Netgear R7800 exploration (IPQ8065, QCA9984)

Hope this show us some results then o.O

it's built, just need 15 min when no one is using the AP... latter today I should have an answer

btw in the tests above i do have:

echo 2 > /proc/irq/<irq # for eth0>/smp_affinity
echo 2 > /proc/irq/<adm_dma>/smp_affinity

if that matters, i'll test again without these

sure take your time. But still something is very strange here. We really need to understand exactly what is causing the regression.

250mbps is a lot!

1 Like

If want to do another test, comment just the read part of the phy and leave the write one.

1 Like

Either with both the read and write commented out or just read commented out, the result is the same - my 5g wifi client can't do better than 200 mbps and the router cpu0 is maxed out at 100%, cpu1 at 50+%.

Going back to k510+pr4036, this same 5g wifi client can do 450+ mbps (even regularly hitting 500+ mbps) with cpu0 at ~80% and cpu1 at ~50%.

For the record, the tested kernel 5.15 build was from current head of master, +pr 4036, +pr 4748, included a cherry pick of commit e972109 (test mdio improvement part 2), had the ds->pcs_poll line in qca8k.c commented out and tried the two variants of disabling the "mdio for phy" you suggest.

At this point, I think it would be helpful if someone else can confirm the slow down.

@quarky do you want to test a patch? but no idea if it does apply to 5.10...

I ported the coordinate clk and added support for ipq8064...

Sure. Point me to the patch link.

still testing to make it work... but on the positive part about all this, I totally reproduced the kernel crash and yes they are totally caused by having fun with the muxes... by setting the wrong mux a panic is always triggered with all sort of error like kernel page fault or illegal ops or NULL pointer...

So the instability is definitely triggered by a bad safe parent.... wonder if we should increase the udelay for the mux... and on a very corner case it require more time to switch?

Do we have some user that suffer from the instability problem a lot? so we can test the delay fix?


Ok it seems I have a working patch... give a sec so I can create a good patchset to send...


@quarky here the patch for 5.15 (from initial testing the scaling work, i tested this using the mbw)

Hope they are not impossible to rebase on top of 5.10...

Only part i found discrepancy is with the secondary mux safe sel... (i dropped the use of hardcoding stuff)

What i can't understand is... the safe_sel is directly the value or the index?

Cause currently

for the primary mux safe_sel 2 produce a sel of 0
and for secondary mux a safe_sel of 0 produce a sel of 1

I need to check the original code from netgear as could be that we got confused by this and we are sourcing from the mux divider in stead of the secondary.... (primary mux has 3 input: secondar, hfpll and hfpll / 2)


Anyway some more info about the clk mess... it doesn't make sense... from what i understand acpu(0/1/l2)_aux is stupid... they are all connected to the same pll (that in theory is pll8) and it does run to the same rate of qsb... so really it's never used. We always use hfpll.

Current change I made internally is that i just dropped all the acpu mess and set the parent to pll8... (we just disable and enable the mux... we don't need to switch them as pll8 is never scaled as it's also a fixed clk source) Also the clk names defined in the dts are just useless and should not be declared / connected correctly... (and i assume set the parent to the 2 mux?) What should be defined is the qsb or the pll8 clk... Considering dropping the qsb and making it declared using the dts directly as it's a fixed clk...


btw i'm checking the old r7800 source that has the very old clk init... discovered all sort of missing clk... and the prng running at half the clk... it's init at 32mhz but the original firmware run it at 64mhz....

3 Likes

Router rebooted again this morning, despite locking scaling_min_freq to 800000. Nothing in pstore this time either...
I feel like something happened between mid november 2021 and mid january 2022, at least for me. Because I never had reboots before mid january when I upgraded OpenWrt and the image I was running before upgrading was from mid november 2021.

This happened on both the 5.10 kernel DSA and NSS builds, I know they're both very experimental but still...

I will try changing from schedutil to ondemand if that makes a difference.

I found 2x ramoops files on my router today (it crashed during a video meeting of course :upside_down_face: )

<1>[1195353.889862] Unable to handle kernel NULL pointer dereference at virtual address 00000000
<1>[1195353.892194] pgd = e28d2935
<1>[1195353.900283] [00000000] *pgd=00000000
<0>[1195353.903241] Internal error: Oops: 17 [#1] SMP ARM
<4>[1195353.906792] Modules linked in: imq xt_IMQ qcserial pppoe ppp_async option cdc_mbim ath10k_pci ath10k_core ath wireguard uvcvideo usb_wwan sierra_net sierra rndis_host qmi_wwan pptp pppox ppp_generic mac80211 libchacha20poly1305 libblake2s ipt_REJECT huawei_cdc_ncm gspca_zc3xx gspca_ov534 gspca_main ebtable_nat ebtable_filter ebtable_broute curve25519_neon cfg80211 cdc_ncm cdc_ether xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_quota xt_pkttype xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_bpf xt_addrtype xt_TCPMSS xt_REDIRECT xt_NETMAP xt_MASQUERADE xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY videobuf2_v4l2 videobuf2_common usbserial usbnet usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm slhc poly1305_arm nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic
<4>[1195353.907136]  nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast ts_kmp nf_conntrack_amanda nf_conncount libcurve25519_generic libblake2s_generic iptable_raw iptable_nat iptable_mangle iptable_filter ipt_ECN ip6table_raw ip_tables input_core ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat chacha_neon cdc_wdm cdc_acm asn1_decoder fuse sch_teql sch_sfq sch_red sch_prio sch_pie sch_multiq sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_simple act_police act_pedit act_ipt act_csum libcrc32c sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact videobuf2_vmalloc videobuf2_memops videodev ledtrig_usbport
<4>[1195353.977490]  xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink xt_weburl xt_webmon xt_timerange xt_bandwidth ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_NPT nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 nfsv4 nfsd nfs msdos ip_gre gre ip6_udp_tunnel udp_tunnel ip_tunnel rpcsec_gss_krb5 auth_rpcgss oid_registry tun vfat fat lockd sunrpc grace hfsplus dns_resolver dm_mirror dm_region_hash dm_log dm_crypt dm_mod dax nls_utf8 nls_koi8_r nls_iso8859_2 nls_iso8859_15 nls_iso8859_13 nls_iso8859_1 nls_cp866 nls_cp852 nls_cp850 nls_cp775 nls_cp437 nls_cp1251 nls_cp1250 dma_shared_buffer sha1_generic md5 ecb des_generic libdes cts cbc arc4 uas usb_storage leds_gpio
<4>[1195354.064901]  xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 crc32c_generic [last unloaded: imq]
<4>[1195354.178144] CPU: 0 PID: 5816 Comm: kworker/0:2 Not tainted 5.4.168 #0
<4>[1195354.200371] Hardware name: Generic DT based system
<4>[1195354.206732] Workqueue: events dbs_work_handler
<4>[1195354.211921] PC is at __timer_delay+0x38/0x70
<4>[1195354.216528] LR is at msm_read_current_timer+0x1c/0x28
<4>[1195354.220769] pc : [<c089ae54>]    lr : [<c07175cc>]    psr: 80000013
<4>[1195354.225894] sp : d608bd20  ip : 00000000  fp : dda03010
<4>[1195354.232487] r10: ffffffff  r9 : 00000000  r8 : 00000002
<4>[1195354.237869] r7 : d608bda4  r6 : 00000006  r5 : 7806604e  r4 : 00000000
<4>[1195354.243254] r3 : de806024  r2 : 1fffa6f0  r1 : 00000000  r0 : 00000003
<4>[1195354.250029] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[1195354.256710] Control: 10c5787d  Table: 5dd2406a  DAC: 00000051
<0>[1195354.263742] Process kworker/0:2 (pid: 5816, stack limit = 0xec75e7fc)
<0>[1195354.269644] Stack: (0xd608bd20 to 0xd608c000)
<0>[1195354.276260] bd20: dd5d53d8 00000001 20000013 c05e652c dd5d53e4 00000000 00000000 c05e7be0
<0>[1195354.280779] bd40: ffffffff 00000000 00000000 c033dd70 dd6b8718 00000000 dd6b8704 00000002
<0>[1195354.289112] bd60: d608bda4 c033dfe8 00000000 dd574e00 dd6b8700 c0c142f8 dd6af780 00000002
<0>[1195354.297444] bd80: 2faf0800 dce46b00 dce46a00 c033e06c 00000000 00003208 dd574e00 c05d7e4c
<0>[1195354.305779] bda0: dd574e00 dd5d4a40 2faf0800 23c34600 dd574e00 00000000 dd6af780 23c34600
<0>[1195354.314112] bdc0: dd561000 c05da340 dd6af668 dd561000 23c34600 dd4c7240 2faf0800 dce46b00
<0>[1195354.322445] bde0: dce46a00 c05da388 dd6af780 00000000 23c34600 dd561000 dce46b80 dce46b00
<0>[1195354.330778] be00: dce46a00 c05da724 00000000 23c34600 00000000 ffffffff 23c34600 c0c16294
<0>[1195354.339112] be20: dce43f40 dce43f80 23c34600 23c34600 2faf0800 c05da90c dd7e8a00 23c34600
<0>[1195354.347445] be40: 23c34600 c06e44c0 c0c21b14 00000000 dce46b34 dce46bb4 00000000 23c34600
<0>[1195354.355778] be60: dd7e9000 dd7e9000 00000000 23c34600 00000000 00000001 000927c0 dce48500
<0>[1195354.364112] be80: dce48518 c06edd90 d608bec0 000927c0 dce48540 dce46e80 dd7e9000 dd7e9000
<0>[1195354.372444] bea0: 00000000 c0c5ed10 00000000 00000001 000927c0 00000000 ffffe000 c06e8b20
<0>[1195354.380780] bec0: dd7e9000 000c3500 000927c0 000000a1 dd7e9000 dce46f00 dce46f00 dce46f80
<0>[1195354.389111] bee0: dce46f80 dce486c0 00000000 c06ec374 dce46f38 00000000 dce46f04 dd7e9000
<0>[1195354.397444] bf00: c0c21e08 00000000 00000000 c06ecfd4 dce46f38 d84d2000 dda095c0 dda0c700
<0>[1195354.405778] bf20: 00000000 c0336ff8 00000008 c0c03d00 d84d2000 d84d2014 dda095c0 00000008
<0>[1195354.414113] bf40: c0c03d00 dda095d8 dda095c0 c03372b8 c0c0bc1c c095a8ec d949feac d85bcbdc
<0>[1195354.422445] bf60: d84d2000 d85bcbc0 d608a000 d8b17400 d949feac d85bcbdc d84d2000 c0337264
<0>[1195354.430779] bf80: 00000000 c033d0e4 00000000 d8b17400 c033cfa0 00000000 00000000 00000000
<0>[1195354.439111] bfa0: 00000000 00000000 00000000 c03010e8 00000000 00000000 00000000 00000000
<0>[1195354.447444] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[1195354.455780] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<4>[1195354.464105] [<c089ae54>] (__timer_delay) from [<c05e652c>] (krait_mux_set_parent+0xc8/0xcc)
<4>[1195354.472435] [<c05e652c>] (krait_mux_set_parent) from [<c05e7be0>] (krait_notifier_cb+0x58/0xb4)
<4>[1195354.481125] [<c05e7be0>] (krait_notifier_cb) from [<c033dd70>] (notifier_call_chain+0x74/0xa8)
<4>[1195354.489975] [<c033dd70>] (notifier_call_chain) from [<c033dfe8>] (__srcu_notifier_call_chain+0x54/0xc0)
<4>[1195354.498829] [<c033dfe8>] (__srcu_notifier_call_chain) from [<c033e06c>] (srcu_notifier_call_chain+0x18/0x20)
<4>[1195354.508292] [<c033e06c>] (srcu_notifier_call_chain) from [<c05d7e4c>] (__clk_notify+0x70/0x94)
<4>[1195354.518188] [<c05d7e4c>] (__clk_notify) from [<c05da340>] (clk_change_rate+0xfc/0x29c)
<4>[1195354.527123] [<c05da340>] (clk_change_rate) from [<c05da388>] (clk_change_rate+0x144/0x29c)
<4>[1195354.535197] [<c05da388>] (clk_change_rate) from [<c05da724>] (clk_core_set_rate_nolock+0xfc/0x14c)
<4>[1195354.543616] [<c05da724>] (clk_core_set_rate_nolock) from [<c05da90c>] (clk_set_rate+0x38/0x9c)
<4>[1195354.552742] [<c05da90c>] (clk_set_rate) from [<c06e44c0>] (dev_pm_opp_set_rate+0x28c/0x49c)
<4>[1195354.561504] [<c06e44c0>] (dev_pm_opp_set_rate) from [<c06edd90>] (set_target+0x64/0x250)
<4>[1195354.569917] [<c06edd90>] (set_target) from [<c06e8b20>] (__cpufreq_driver_target+0x1a0/0x568)
<4>[1195354.578075] [<c06e8b20>] (__cpufreq_driver_target) from [<c06ec374>] (od_dbs_update+0xc8/0x19c)
<4>[1195354.586671] [<c06ec374>] (od_dbs_update) from [<c06ecfd4>] (dbs_work_handler+0x38/0x70)
<4>[1195354.595705] [<c06ecfd4>] (dbs_work_handler) from [<c0336ff8>] (process_one_work+0x234/0x4a0)
<4>[1195354.603859] [<c0336ff8>] (process_one_work) from [<c03372b8>] (worker_thread+0x54/0x604)
<4>[1195354.612366] [<c03372b8>] (worker_thread) from [<c033d0e4>] (kthread+0x144/0x148)
<4>[1195354.620608] [<c033d0e4>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
<4>[1195354.628152] Exception stack(0xd608bfb0 to 0xd608bff8)
<4>[1195354.635710] bfa0:                                     00000000 00000000 00000000 00000000
<4>[1195354.640673] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[1195354.649000] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[1195354.657334] Code: e12fff33 e1a05000 e5940000 ea000000 (e5940000) 
<4>[1195354.664407] ---[ end trace 2e9f02c89c3c3afb ]---
<1>[1195353.889862] Unable to handle kernel NULL pointer dereference at virtual address 00000000
<1>[1195353.892194] pgd = e28d2935
<1>[1195353.900283] [00000000] *pgd=00000000
<0>[1195353.903241] Internal error: Oops: 17 [#1] SMP ARM
<4>[1195353.906792] Modules linked in: imq xt_IMQ qcserial pppoe ppp_async option cdc_mbim ath10k_pci ath10k_core ath wireguard uvcvideo usb_wwan sierra_net sierra rndis_host qmi_wwan pptp pppox ppp_generic mac80211 libchacha20poly1305 libblake2s ipt_REJECT huawei_cdc_ncm gspca_zc3xx gspca_ov534 gspca_main ebtable_nat ebtable_filter ebtable_broute curve25519_neon cfg80211 cdc_ncm cdc_ether xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_quota xt_pkttype xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_bpf xt_addrtype xt_TCPMSS xt_REDIRECT xt_NETMAP xt_MASQUERADE xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY videobuf2_v4l2 videobuf2_common usbserial usbnet usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm slhc poly1305_arm nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic
<4>[1195353.907136]  nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast ts_kmp nf_conntrack_amanda nf_conncount libcurve25519_generic libblake2s_generic iptable_raw iptable_nat iptable_mangle iptable_filter ipt_ECN ip6table_raw ip_tables input_core ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat chacha_neon cdc_wdm cdc_acm asn1_decoder fuse sch_teql sch_sfq sch_red sch_prio sch_pie sch_multiq sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_simple act_police act_pedit act_ipt act_csum libcrc32c sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact videobuf2_vmalloc videobuf2_memops videodev ledtrig_usbport
<4>[1195353.977490]  xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink xt_weburl xt_webmon xt_timerange xt_bandwidth ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_NPT nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 nfsv4 nfsd nfs msdos ip_gre gre ip6_udp_tunnel udp_tunnel ip_tunnel rpcsec_gss_krb5 auth_rpcgss oid_registry tun vfat fat lockd sunrpc grace hfsplus dns_resolver dm_mirror dm_region_hash dm_log dm_crypt dm_mod dax nls_utf8 nls_koi8_r nls_iso8859_2 nls_iso8859_15 nls_iso8859_13 nls_iso8859_1 nls_cp866 nls_cp852 nls_cp850 nls_cp775 nls_cp437 nls_cp1251 nls_cp1250 dma_shared_buffer sha1_generic md5 ecb des_generic libdes cts cbc arc4 uas usb_storage leds_gpio
<4>[1195354.064901]  xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 crc32c_generic [last unloaded: imq]
<4>[1195354.178144] CPU: 0 PID: 5816 Comm: kworker/0:2 Not tainted 5.4.168 #0
<4>[1195354.200371] Hardware name: Generic DT based system
<4>[1195354.206732] Workqueue: events dbs_work_handler
<4>[1195354.211921] PC is at __timer_delay+0x38/0x70
<4>[1195354.216528] LR is at msm_read_current_timer+0x1c/0x28
<4>[1195354.220769] pc : [<c089ae54>]    lr : [<c07175cc>]    psr: 80000013
<4>[1195354.225894] sp : d608bd20  ip : 00000000  fp : dda03010
<4>[1195354.232487] r10: ffffffff  r9 : 00000000  r8 : 00000002
<4>[1195354.237869] r7 : d608bda4  r6 : 00000006  r5 : 7806604e  r4 : 00000000
<4>[1195354.243254] r3 : de806024  r2 : 1fffa6f0  r1 : 00000000  r0 : 00000003
<4>[1195354.250029] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[1195354.256710] Control: 10c5787d  Table: 5dd2406a  DAC: 00000051
<0>[1195354.263742] Process kworker/0:2 (pid: 5816, stack limit = 0xec75e7fc)
<0>[1195354.269644] Stack: (0xd608bd20 to 0xd608c000)
<0>[1195354.276260] bd20: dd5d53d8 00000001 20000013 c05e652c dd5d53e4 00000000 00000000 c05e7be0
<0>[1195354.280779] bd40: ffffffff 00000000 00000000 c033dd70 dd6b8718 00000000 dd6b8704 00000002
<0>[1195354.289112] bd60: d608bda4 c033dfe8 00000000 dd574e00 dd6b8700 c0c142f8 dd6af780 00000002
<0>[1195354.297444] bd80: 2faf0800 dce46b00 dce46a00 c033e06c 00000000 00003208 dd574e00 c05d7e4c
<0>[1195354.305779] bda0: dd574e00 dd5d4a40 2faf0800 23c34600 dd574e00 00000000 dd6af780 23c34600
<0>[1195354.314112] bdc0: dd561000 c05da340 dd6af668 dd561000 23c34600 dd4c7240 2faf0800 dce46b00
<0>[1195354.322445] bde0: dce46a00 c05da388 dd6af780 00000000 23c34600 dd561000 dce46b80 dce46b00
<0>[1195354.330778] be00: dce46a00 c05da724 00000000 23c34600 00000000 ffffffff 23c34600 c0c16294
<0>[1195354.339112] be20: dce43f40 dce43f80 23c34600 23c34600 2faf0800 c05da90c dd7e8a00 23c34600
<0>[1195354.347445] be40: 23c34600 c06e44c0 c0c21b14 00000000 dce46b34 dce46bb4 00000000 23c34600
<0>[1195354.355778] be60: dd7e9000 dd7e9000 00000000 23c34600 00000000 00000001 000927c0 dce48500
<0>[1195354.364112] be80: dce48518 c06edd90 d608bec0 000927c0 dce48540 dce46e80 dd7e9000 dd7e9000
<0>[1195354.372444] bea0: 00000000 c0c5ed10 00000000 00000001 000927c0 00000000 ffffe000 c06e8b20
<0>[1195354.380780] bec0: dd7e9000 000c3500 000927c0 000000a1 dd7e9000 dce46f00 dce46f00 dce46f80
<0>[1195354.389111] bee0: dce46f80 dce486c0 00000000 c06ec374 dce46f38 00000000 dce46f04 dd7e9000
<0>[1195354.397444] bf00: c0c21e08 00000000 00000000 c06ecfd4 dce46f38 d84d2000 dda095c0 dda0c700
<0>[1195354.405778] bf20: 00000000 c0336ff8 00000008 c0c03d00 d84d2000 d84d2014 dda095c0 00000008
<0>[1195354.414113] bf40: c0c03d00 dda095d8 dda095c0 c03372b8 c0c0bc1c c095a8ec d949feac d85bcbdc
<0>[1195354.422445] bf60: d84d2000 d85bcbc0 d608a000 d8b17400 d949feac d85bcbdc d84d2000 c0337264
<0>[1195354.430779] bf80: 00000000 c033d0e4 00000000 d8b17400 c033cfa0 00000000 00000000 00000000
<0>[1195354.439111] bfa0: 00000000 00000000 00000000 c03010e8 00000000 00000000 00000000 00000000
<0>[1195354.447444] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[1195354.455780] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<4>[1195354.464105] [<c089ae54>] (__timer_delay) from [<c05e652c>] (krait_mux_set_parent+0xc8/0xcc)
<4>[1195354.472435] [<c05e652c>] (krait_mux_set_parent) from [<c05e7be0>] (krait_notifier_cb+0x58/0xb4)
<4>[1195354.481125] [<c05e7be0>] (krait_notifier_cb) from [<c033dd70>] (notifier_call_chain+0x74/0xa8)
<4>[1195354.489975] [<c033dd70>] (notifier_call_chain) from [<c033dfe8>] (__srcu_notifier_call_chain+0x54/0xc0)
<4>[1195354.498829] [<c033dfe8>] (__srcu_notifier_call_chain) from [<c033e06c>] (srcu_notifier_call_chain+0x18/0x20)
<4>[1195354.508292] [<c033e06c>] (srcu_notifier_call_chain) from [<c05d7e4c>] (__clk_notify+0x70/0x94)
<4>[1195354.518188] [<c05d7e4c>] (__clk_notify) from [<c05da340>] (clk_change_rate+0xfc/0x29c)
<4>[1195354.527123] [<c05da340>] (clk_change_rate) from [<c05da388>] (clk_change_rate+0x144/0x29c)
<4>[1195354.535197] [<c05da388>] (clk_change_rate) from [<c05da724>] (clk_core_set_rate_nolock+0xfc/0x14c)
<4>[1195354.543616] [<c05da724>] (clk_core_set_rate_nolock) from [<c05da90c>] (clk_set_rate+0x38/0x9c)
<4>[1195354.552742] [<c05da90c>] (clk_set_rate) from [<c06e44c0>] (dev_pm_opp_set_rate+0x28c/0x49c)
<4>[1195354.561504] [<c06e44c0>] (dev_pm_opp_set_rate) from [<c06edd90>] (set_target+0x64/0x250)
<4>[1195354.569917] [<c06edd90>] (set_target) from [<c06e8b20>] (__cpufreq_driver_target+0x1a0/0x568)
<4>[1195354.578075] [<c06e8b20>] (__cpufreq_driver_target) from [<c06ec374>] (od_dbs_update+0xc8/0x19c)
<4>[1195354.586671] [<c06ec374>] (od_dbs_update) from [<c06ecfd4>] (dbs_work_handler+0x38/0x70)
<4>[1195354.595705] [<c06ecfd4>] (dbs_work_handler) from [<c0336ff8>] (process_one_work+0x234/0x4a0)
<4>[1195354.603859] [<c0336ff8>] (process_one_work) from [<c03372b8>] (worker_thread+0x54/0x604)
<4>[1195354.612366] [<c03372b8>] (worker_thread) from [<c033d0e4>] (kthread+0x144/0x148)
<4>[1195354.620608] [<c033d0e4>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
<4>[1195354.628152] Exception stack(0xd608bfb0 to 0xd608bff8)
<4>[1195354.635710] bfa0:                                     00000000 00000000 00000000 00000000
<4>[1195354.640673] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[1195354.649000] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[1195354.657334] Code: e12fff33 e1a05000 e5940000 ea000000 (e5940000) 
<4>[1195354.664407] ---[ end trace 2e9f02c89c3c3afb ]---
<0>[1195354.691761] Kernel panic - not syncing: Fatal exception
<2>[1195354.691808] CPU1: stopping
<4>[1195354.696143] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D           5.4.168 #0
<4>[1195354.699092] Hardware name: Generic DT based system
<4>[1195354.706490] [<c030f934>] (unwind_backtrace) from [<c030b928>] (show_stack+0x14/0x20)
<4>[1195354.711608] [<c030b928>] (show_stack) from [<c089d478>] (dump_stack+0x94/0xa8)
<4>[1195354.719330] [<c089d478>] (dump_stack) from [<c030eb60>] (handle_IPI+0x184/0x1b8)
<4>[1195354.726882] [<c030eb60>] (handle_IPI) from [<c059e01c>] (gic_handle_irq+0xb4/0xb8)
<4>[1195354.734262] [<c059e01c>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
<4>[1195354.742149] Exception stack(0xdd467f18 to 0xdd467f60)
<4>[1195354.749530] 7f00:                                                       00000000 00043f2b
<4>[1195354.754759] 7f20: 1ced7000 dda189c0 dcc14c00 00000000 dda17db0 00043f2b 00043f2b 00000000
<4>[1195354.763092] 7f40: 33c8f7c0 33a013a0 00000015 dd467f68 c06ef598 c06ef59c 20000013 ffffffff
<4>[1195354.771422] [<c0301a8c>] (__irq_svc) from [<c06ef59c>] (cpuidle_enter_state+0x94/0x498)
<4>[1195354.779750] [<c06ef59c>] (cpuidle_enter_state) from [<c06ef9e4>] (cpuidle_enter+0x30/0x4c)
<4>[1195354.788081] [<c06ef9e4>] (cpuidle_enter) from [<c03487dc>] (do_idle+0x1d8/0x240)
<4>[1195354.796584] [<c03487dc>] (do_idle) from [<c0348aec>] (cpu_startup_entry+0x1c/0x20)
<4>[1195354.803961] [<c0348aec>] (cpu_startup_entry) from [<423024cc>] (0x423024cc)
1 Like

Based on krait, mux and clk_rate it sounds like CPU frequency scaling, and based on kernel NULL pointer dereference at virtual address 00000000 it looks like calling something before the pointer variable has got a value. Hopefully @Ansuel can figure out more.

P.s great that you got a ramoops file. I should probably do a PR to get it included in ipq806x by default.

3 Likes

FWIW, I haven't experienced any crashes on master with CPU governor set to performance. I guess that adds another datapoint that indicates something is wrong with the frequency scaling.

with @quarky we analyzed the code and the kernel NULL report doesn't make sense since it's nothing allocated and then freed... I suspect that's a false positive report and the actual problem is with the scaling setting the cpu to the wrong freq and making the system crash... I'm testing the patch i posted above to see if the situation is improved... If anyone wants to test it too.... don't know if @quarky manage to port it for 5.10 tho....

I'm actually running both my ipq806x routers on openwrt 21.02, 5.4 kernel. I back ported the L2 freq scaling module you did for master 5.10 kernel tho.

There's quite a number of files to be patched, so I think it'll take me a while to back-port it to 5.4. Will try it out tho. At the moment my focus is to try to find out why my Wi-Fi is so unstable. I think I found some discrepancies in logic between the old and new airtime fairness code backported to mac80211, but it seems to affect only ath10k. Anyway, I'm still trying to figure it out. Reverting back to the old algo. seems to resolve my ipq806x router's Wi-Fi issues, for now.

1 Like

did you get the log of this unstable WiFi operation?

Router did not crash. Wi-Fi just slows to a crawl and I had to restart the Wi-Fi interface to get back to ‘normal’.

@quarky, have you found a relatively easy way to revert ATF completely (either disable at runtime or build an image with the ATF commits reverted) and test the algorithm prior to ATF?

I can now demonstrate that ATF is working on my device, but it seems asynchronous. I.e. Simultaneous netperf through AP out to clients and ATF seems to work. Simultaneous netperf from clients through AP to netperf server and I have issues.

I can always go back to an old build (likely 4.19 days), but the comparison would be "cleaner" i think if i can revert ATF on a current build.

Have you tried the ath10k module and qca firmware?

The non-ct qca9980 firmware is ancient. I've reverted to it previously (along with the non ath10k-ct driver) as a part of my testing but perhaps it would be good to try it one more time now that i can at least show ATF doing something.

thx

EDIT, just tried it and the results are confusing.

First off single netperf's from my "fast" 5g wifi clients drop by 50-75%. i.e. 500 mpbs down to less than 100 mbps. Ironically, one "slower" 5g wifi stayed about the same.

Second, I still see

r7500v2 # cat /sys/kernel/debug/ieee80211/phy0/netdev\:wlan0/stations/*/airtime
RX: 0 us
TX: 1430717014 us
Weight: 256
Virt-T: VO: 3517 us VI: 0 us BE: 1557380091 us BK: 0 us
RX: 0 us
TX: 155146654 us
Weight: 256
Virt-T: VO: 1777 us VI: 0 us BE: 1557406770 us BK: 0 us

i.e. the ATF infrastructure is still there and, unless i mod the ath10k driver like i did for the ath10k-ct driver, there is no way for me to demonstrate that ATF is working or not.

Lastly if i repeat a simultaneous netperf tests, they are different, possibly less "grinding to a halt" or aleast some recovery but it's not that clear especially given the now different netperf rates i see.

Another example: a single netperf -t tcp_maerts from a sole client with ath10-ct driver frimware gave 125 mbps, with non ct, the same test does 22 mbps.

Are you saying that ATF kicks in when the clients are downloading but not when they're uploading?