Netgear R7800 exploration (IPQ8065, QCA9984)

The idea is that I need a r7800 with serial access so I can test a initramfs image. But I guess I will just wait to come back home on Tuesday

can I test it on my tp link c2600, it is also ipq806x?

If it's an ipq806x system, you can test it.

2 Likes

Im testing it on my c2600 tp link, it runs fine but problem about dependencies and usb mount, exroot..

Well nice job getting it to build. I just tried and had to fix multiple issues before it would build. For others that might see this, might be better to wait until @ansuel has a chance to rebase pr 4748 and clean up the commit a little.

For the impatient:

This commit causes a merge issue which is easy enough to fix after a little inspection.

Also it looks like the dtb entries qcom-ipq8068-mr42.dtb and qcom-ipq8068-mr52.dtb are not included in target/linux/ipq806x/patches-5.15/0069-arm-boot-add-dts-files.patch but the corresponding entries for these targets are still in target/linux/ipq806x/image/generic.mk. I'm not sure the correct way to resolve this, i just removed the problematic entries from the generic.mk file and then the image builds.

I can't test this image yet, the AP is in service.

1 Like

Seems like my R7800 is getting rather cranky now. Just had another panic, less than 1 day of uptime running a test build to troubleshoot Wi-Fi slow-down issue. Looks like it is the CPU clock again. Will try to debug.

<1>[42559.968493] 8<--- cut here ---
<1>[42559.968527] Unable to handle kernel NULL pointer dereference at virtual address 00000000
<1>[42559.970564] pgd = d41ae606
<1>[42559.978686] [00000000] *pgd=00000000
<0>[42559.981308] Internal error: Oops: 17 [#1] SMP ARM
<4>[42559.984940] Modules linked in: ecm iptable_nat ath10k_pci ath10k_core ath xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD wireguard nf_nat nf_flow_table_hw nf_flow_table nf_conntrack mac80211 libchacha20poly1305 libblake2s ipt_REJECT ebtable_nat ebtable_filter ebtable_broute curve25519_neon cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_quota xt_pkttype xt_physdev xt_owner xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_addrtype xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY ppp_async poly1305_arm nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 macvlan libcurve25519_generic libblake2s_generic l2tp_ppp iptable_mangle iptable_filter ipt_ECN ip_tables ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat chacha_neon sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact
<4>[42559.985201]  qca_nss_tunipip6 qca_nss_tun6rd qca_nss_ipsecmgr qca_nss_cfi_cryptoapi qca_nss_qdisc qca_nss_crypto qca_nss_vlan qca_nss_pppoe pppoe pppox ppp_generic slhc qca_nss_gre qca_nss_bridge_mgr ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 qca_mcs bonding ip6_gre ip_gre gre sit qca_nss_drv l2tp_netlink l2tp_core udp_tunnel ip6_udp_tunnel ipcomp6 xfrm6_tunnel esp6 ah6 xfrm4_tunnel ipcomp esp4 ah4 ipip ip6_tunnel qca_nss_gmac tunnel6 tunnel4 ip_tunnel tun qca_ssdk xfrm_user xfrm_ipcomp af_key xfrm_algo shortcut_fe_drv shortcut_fe_ipv6 shortcut_fe sha1_generic md5 echainiv des_generic libdes cbc authenc
<4>[42560.054477]  usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 crc32c_generic
<4>[42560.166708] CPU: 1 PID: 20026 Comm: kworker/1:2 Not tainted 5.4.171 #0
<4>[42560.188936] Hardware name: Generic DT based system
<4>[42560.195552] Workqueue: events dbs_work_handler
<4>[42560.200323] PC is at __timer_delay+0x30/0x70
<4>[42560.204742] LR is at msm_read_current_timer+0x1c/0x28
<4>[42560.209158] pc : [<c090290c>]    lr : [<c075b920>]    psr: a0000013
<4>[42560.214108] sp : d8dc9d20  ip : 00000000  fp : dd998010
<4>[42560.220182] r10: ffffffff  r9 : 00000000  r8 : 00000002
<4>[42560.225390] r7 : d8dc9da4  r6 : 00000006  r5 : eed2e34c  r4 : 00000000
<4>[42560.230601] r3 : de806024  r2 : 1fffa6f0  r1 : 00000000  r0 : eed2e34c
<4>[42560.237203] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[42560.243713] Control: 10c5787d  Table: 5a1c406a  DAC: 00000051
<0>[42560.250918] Process kworker/1:2 (pid: 20026, stack limit = 0x2ceb57a0)
<0>[42560.256645] Stack: (0xd8dc9d20 to 0xd8dca000)
<0>[42560.263090] 9d20: dd5b1858 00000001 20000013 c0629a84 dd5b1864 00000000 00000000 c062b610
<0>[42560.267519] 9d40: ffffffff 00000000 00000000 c033eb08 dd5b3218 00000000 dd5b3204 00000002
<0>[42560.275679] 9d60: d8dc9da4 c033ed80 00000000 dd52f600 dd5b3200 c0c1ab30 dd5b2300 00000002
<0>[42560.283838] 9d80: 2faf0800 dce58000 dce55f00 c033ee04 00000000 00003248 dd52f600 c061b3a0
<0>[42560.292004] 9da0: dd52f600 dd6f3e80 2faf0800 23c34600 dd52f600 00000000 dd5b2300 23c34600
<0>[42560.300162] 9dc0: dd51f240 c061d894 dd5b21e8 dd51f240 23c34600 dd4d23c0 2faf0800 dce58000
<0>[42560.308319] 9de0: dce55f00 c061d8dc dd5b2300 00000000 23c34600 dd51f240 dce58080 dce58000
<0>[42560.316481] 9e00: dce55f00 c061dc78 c0c04f28 23c34600 00000000 ffffffff 23c34600 c0c1cde4
<0>[42560.324639] 9e20: dce54680 dce546c0 23c34600 00000001 2faf0800 c061de60 dcc2a800 23c34600
<0>[42560.332802] 9e40: 00000001 c0728434 dcc2b400 c07273cc dce58034 dce580b4 dcc2b400 23c34600
<0>[42560.340960] 9e60: dce58588 00000000 3b9aca00 00000001 c0c04f28 00000001 c0c67644 dcc2b000
<0>[42560.349120] 9e80: 00000000 c0732cec d8dc9ec0 dce54fc0 23c34600 000927c0 dcc2b000 dcc2b000
<0>[42560.357280] 9ea0: 00000000 c0c67620 00000000 00000001 000927c0 00000000 ffffe000 c072c924
<0>[42560.365438] 9ec0: dcc2b000 000c3500 000927c0 000000a1 dcc2b000 dce58300 dce58300 dce55e00
<0>[42560.373596] 9ee0: dce55e00 dce54b80 00000000 c073017c dce58338 00000000 dce58304 dcc2b000
<0>[42560.381756] 9f00: c0c289fc 00000040 00000000 c0730ddc dce58338 d93e4100 dd99e680 dd9a1800
<0>[42560.389918] 9f20: 00000000 c0337a04 00000008 c0c03d00 d93e4100 d93e4114 dd99e680 00000008
<0>[42560.398078] 9f40: c0c03d00 dd99e698 dd99e680 c0337cc4 c0c0be4c c09c9c08 00000000 c0c0be90
<0>[42560.406237] 9f60: d93e4100 d8121c40 d8fbf340 00000000 d8dc8000 c0337c70 d93e4100 d4fabeac
<0>[42560.414399] 9f80: d8121c5c c033dca0 00000000 d8fbf340 c033db40 00000000 00000000 00000000
<0>[42560.422556] 9fa0: 00000000 00000000 00000000 c03010e8 00000000 00000000 00000000 00000000
<0>[42560.430712] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[42560.438873] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<4>[42560.447045] [<c090290c>] (__timer_delay) from [<c0629a84>] (krait_mux_set_parent+0xc8/0xcc)
<4>[42560.455190] [<c0629a84>] (krait_mux_set_parent) from [<c062b610>] (krait_notifier_cb+0x58/0xb4)
<4>[42560.463356] [<c062b610>] (krait_notifier_cb) from [<c033eb08>] (notifier_call_chain+0x74/0xa8)
<4>[42560.472030] [<c033eb08>] (notifier_call_chain) from [<c033ed80>] (__srcu_notifier_call_chain+0x54/0xc0)
<4>[42560.480707] [<c033ed80>] (__srcu_notifier_call_chain) from [<c033ee04>] (srcu_notifier_call_chain+0x18/0x20)
<4>[42560.490001] [<c033ee04>] (srcu_notifier_call_chain) from [<c061b3a0>] (__clk_notify+0x70/0x94)
<4>[42560.500063] [<c061b3a0>] (__clk_notify) from [<c061d894>] (clk_change_rate+0xfc/0x29c)
<4>[42560.508482] [<c061d894>] (clk_change_rate) from [<c061d8dc>] (clk_change_rate+0x144/0x29c)
<4>[42560.516382] [<c061d8dc>] (clk_change_rate) from [<c061dc78>] (clk_core_set_rate_nolock+0xfc/0x14c)
<4>[42560.524633] [<c061dc78>] (clk_core_set_rate_nolock) from [<c061de60>] (clk_set_rate+0x38/0x9c)
<4>[42560.533573] [<c061de60>] (clk_set_rate) from [<c0728434>] (dev_pm_opp_set_rate+0x28c/0x49c)
<4>[42560.542172] [<c0728434>] (dev_pm_opp_set_rate) from [<c0732cec>] (set_target+0x17c/0x1ec)
<4>[42560.550407] [<c0732cec>] (set_target) from [<c072c924>] (__cpufreq_driver_target+0x1a0/0x568)
<4>[42560.558737] [<c072c924>] (__cpufreq_driver_target) from [<c073017c>] (od_dbs_update+0xc8/0x19c)
<4>[42560.567249] [<c073017c>] (od_dbs_update) from [<c0730ddc>] (dbs_work_handler+0x38/0x70)
<4>[42560.575762] [<c0730ddc>] (dbs_work_handler) from [<c0337a04>] (process_one_work+0x234/0x4a0)
<4>[42560.583747] [<c0337a04>] (process_one_work) from [<c0337cc4>] (worker_thread+0x54/0x604)
<4>[42560.592429] [<c0337cc4>] (worker_thread) from [<c033dca0>] (kthread+0x160/0x164)
<4>[42560.600496] [<c033dca0>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
<4>[42560.607862] Exception stack(0xd8dc9fb0 to 0xd8dc9ff8)
<4>[42560.614903] 9fa0:                                     00000000 00000000 00000000 00000000
<4>[42560.620037] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[42560.628189] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[42560.636352] Code: 0a000010 e5903000 e12fff33 e1a05000 (e5940000) 
<4>[42560.642953] ---[ end trace 492b3a2b02cadd53 ]---
<0>[42560.667088] Kernel panic - not syncing: Fatal exception
<2>[42560.667132] CPU0: stopping
<4>[42560.671124] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D           5.4.171 #0
<4>[42560.673899] Hardware name: Generic DT based system
<4>[42560.681472] [<c030f974>] (unwind_backtrace) from [<c030b968>] (show_stack+0x14/0x20)
<4>[42560.686064] [<c030b968>] (show_stack) from [<c0904f38>] (dump_stack+0x94/0xa8)
<4>[42560.693960] [<c0904f38>] (dump_stack) from [<c030eba0>] (handle_IPI+0x184/0x1b8)
<4>[42560.701001] [<c030eba0>] (handle_IPI) from [<c05e1570>] (gic_handle_irq+0xb4/0xb8)
<4>[42560.708550] [<c05e1570>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
<4>[42560.715917] Exception stack(0xc0c01ee0 to 0xc0c01f28)
<4>[42560.723486] 1ee0: 00000000 000026b5 1ce4e000 dd98fa80 dd7f7000 00000000 dd98ee30 000026b5
<4>[42560.728524] 1f00: 000026b5 00000000 6d71f540 6d645480 00000015 c0c01f30 c07338b8 c07338bc
<4>[42560.736666] 1f20: 20000013 ffffffff
<4>[42560.744832] [<c0301a8c>] (__irq_svc) from [<c07338bc>] (cpuidle_enter_state+0x94/0x498)
<4>[42560.748130] [<c07338bc>] (cpuidle_enter_state) from [<c0733d04>] (cpuidle_enter+0x30/0x4c)
<4>[42560.756116] [<c0733d04>] (cpuidle_enter) from [<c034ae34>] (do_idle+0x1d8/0x240)
<4>[42560.764449] [<c034ae34>] (do_idle) from [<c034b144>] (cpu_startup_entry+0x1c/0x20)
<4>[42560.772001] [<c034b144>] (cpu_startup_entry) from [<c0b00e5c>] (start_kernel+0x4dc/0x4ec)

that crash is similar to another kernel crash... the panic was also at the mux_set_parent... Interesting... wonder if in all this time we are missing some lock or extra check with the safe parent logic (the notifier stuff)

@quarky any idea how to reproduce this? like something to stress this?
In theory we should be able to reproduce this by stressing the cpu clk change...

I mean I can't understand if the panic is caused by

struct krait_mux_clk *mux = to_krait_mux_clk(hw);

or

struct krait_mux_clk *mux = container_of(nb, struct krait_mux_clk, clk_nb);

that are NULL for some reason?


My suspect is that we are switching to a safe parent while we are setting the clock of the other core and the mux is set again for the other core clock and this cause crash or other problems.

(if i'm not wrong, but i need to find the old ascii scheme, the mux is shared across the 2 core... wonder if we should put it under mutex?)

The router is streaming an IPTV multicast to a STB with 3 Wi-Fi clients that’s idling. Nothing unusual that I can tell tho.

and wonder if that's the real problem... this happens when the router is idle... I assume that's where the safe parent is triggered as for high frequency is not used... In theory it's used to transition from high to low freq so router in idle state with some spike...

Wait a second let me find the mux scheme.

1 Like

https://patchwork.ozlabs.org/project/devicetree-bindings/cover/1529415925-28915-1-git-send-email-sricharan@codeaurora.org/

We need to read the message in the cover letter and understand if something is wrong in the logic... especially about the case where the cpu core can scale independently and no locking is implemented.

Also I notice one patch was removed from v3 and we should investigate if it was correct or not...


No idea if this is still present

https://patchwork.kernel.org/project/linux-arm-kernel/patch/1426920332-9340-4-git-send-email-sboyd@codeaurora.org/


That patch was resend and it was said that it needed a coordinated clk changes... that seems to be implemented in 2019 with https://patchwork.kernel.org/project/linux-clk/patch/20190305044936.22267-5-dbasehore@chromium.org/
Where it looks like a safe parent support was finally introduced in the generic clk... Wonder if this can be used to drop all the notifier stuff and improve this?

but nobody decided to review that...

2 Likes

From my understanding of reading the codes, the container_of() macro just performs pointer arithmetics and there's no use of pointers to access data, so it shouldn't cause any NULL pointer de-references.

From the panic log, the problem seem to be caused by NULL pointer de-reference at the __timer_delay() function, since CPU PC is pointed at that function. But __timer_delay() is just doing some integer computation and doing a bunch of CPU nops. I can't see anywhere that will cause a NULL pointer de-reference. Interesting ...

Maybe I'm not reading the panic log correctly.

@ansuel i saw that you rebased 4748 - thx for that and there are no more merge errors; however, i'm getting new errors about patches failing during building the toolchain after a make distclean. In particular, 765-3-net-next-net-dsa-stop-updating-master-MTU-from-master.c.patch all hunks failed (it looks like these changes were already accepted?) and 765-4 a couple succeeded with fuzzing, but a least one failed.

It's possible that for the one 5.15 build that did succeed for me above I did with a toolchain built with 5.10 - i didn't realize the toolchain is dependent on the kernel version (or i thought if it mattered the toolchain would be rebuilt after selecting the testing 5.15 kernel). Not sure if that could have caused the build time errors about the missing mr42/52 dtb files or if you just haven't got to that yet.

No hurry, I'll play some more and check back later.

1 Like

same here........

Yes sorry. I pushed the wrong patchset. Now it's the one I use in my buildroot

1 Like

IMHO to investigate the problem we shouldn't care about the NULL pointer dereference but the fact that these panic happens right after the mux notifier...
IMHO the NULL pointer is caused by the cpu in a bad state and cause all sort of problems... (stall... random NULL pointer... random not implemented ops)

I'm still convinced in all these years that the krait notifier for the safe parent is implemented in a bad way and in a corner case it can happen that all the mux configuration is wrong causing the cpu to be clocked to an abnormal freq or a too low freq... this with the introduction of the regulator actually working makes the problem even worse as in the old days the regulator was set to the max the cpu could handle abnormal freq spike but now if it's set to the correct voltage and a spike happen 99% the system will crash for not sufficient voltage.

Also in our system we are lacking the safe parent for the gmac... That it looks to be called even if the nss core never change frequency...

I need to find some time to stop working on qca8k driver and check if the coordinated clk patch that i posted earlier can be actually applied and used in our system and prepare a patch so you guys can test it.

1 Like

@quarky was checking did you ever experienced the qca-rfs module?


the thing is that this is pure luck. You can totally have a idle system + a good chip + a good power brick and never experience this bug. The bug is triggered by a mix of factor that cause at the end glitch to the system.

ah ok, i got u.

I also correlate keeping my min cpu freq at 800 as contributing to maintaining long uptimes (as well as the great work by those maintaining ipq806x systems). However, there is also at least one comment in the forums that the bug @facboy mentioned was fixed and this is no longer necessary.

@facboy you could switch back to 325 MHz (or whatever the old min freq was) and see if you don't get a crash. A subsequent crash over the next 4-5 days would both help to confirm keeping the cpu at 800 mitigates the issue and removes a bit of the "luck" factor @ansuel (justifiably) suggests.

hmm, using your the latest push to pr 4748 (i did notice that pr 4828 is no longer needed for k515/dsa) i got this error

net/dsa/tag_qca.c: In function 'qca_tag_rcv':
net/dsa/tag_qca.c:46:25: error: 'struct dsa_switch' has no member named 'tagger_data'
   46 |         tagger_data = ds->tagger_data;
      |                         ^~
net/dsa/tag_qca.c: In function 'qca_tag_connect':
net/dsa/tag_qca.c:98:11: error: 'struct dsa_switch' has no member named 'tagger_data'
   98 |         ds->tagger_data = tagger_data;
      |           ^~
net/dsa/tag_qca.c: In function 'qca_tag_disconnect':
net/dsa/tag_qca.c:105:17: error: 'struct dsa_switch' has no member named 'tagger_data'        
  105 |         kfree(ds->tagger_data);
      |                 ^~
net/dsa/tag_qca.c:106:11: error: 'struct dsa_switch' has no member named 'tagger_data'        
  106 |         ds->tagger_data = NULL;
      |           ^~
net/dsa/tag_qca.c: At top level:
net/dsa/tag_qca.c:112:10: error: 'const struct dsa_device_ops' has no member named 'connect'  
  112 |         .connect = qca_tag_connect,
      |          ^~~~~~~
net/dsa/tag_qca.c:112:9: warning: the address of 'qca_tag_connect' will always evaluate as 't\
rue' [-Waddress]
  112 |         .connect = qca_tag_connect,
      |         ^
net/dsa/tag_qca.c:113:10: error: 'const struct dsa_device_ops' has no member named 'disconnec\
t'
  113 |         .disconnect = qca_tag_disconnect,
      |          ^~~~~~~~~~
net/dsa/tag_qca.c:113:23: warning: excess elements in struct initializer
  113 |         .disconnect = qca_tag_disconnect,
      |                       ^~~~~~~~~~~~~~~~~~
net/dsa/tag_qca.c:113:23: note: (near initialization for 'qca_netdev_ops')
make[7]: *** [scripts/Makefile.build:277: net/dsa/tag_qca.o] Error 1
make[6]: *** [scripts/Makefile.build:540: net/dsa] Error 2
make[6]: *** Waiting for unfinished jobs....

2 other missing patch force pushed...

1 Like