Netgear R7800 exploration (IPQ8065, QCA9984)

hello,
just for information,
in addition to the Kong R7800 build,
there is also the Cezary Jackiewicz R7800 build with a full repository so you can use and install all kernel module and software, have fun

dl.eko.one.pl

It's been a while since I encounter a kernel panic for my R7800, but it happened. Here's the panic log:

<4>[345594.390046] ath10k_pci 0000:01:00.0: Invalid peer id 36 peer stats buffer
<4>[345594.393247] ath10k_pci 0000:01:00.0: Invalid peer id 28 peer stats buffer
<4>[356254.856446] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 46 tid 0
<4>[356254.856628] ath10k_pci 0000:01:00.0: Invalid peer id 46 peer stats buffer
<1>[360672.175152] 8<--- cut here ---
<1>[360672.175183] Unable to handle kernel paging request at virtual address c02296a8
<1>[360672.177106] pgd = e10a68e4
<1>[360672.184387] [c02296a8] *pgd=4221141e(bad)
<0>[360672.187176] Internal error: Oops: 8000000d [#1] SMP ARM
<4>[360672.191336] Modules linked in: ecm iptable_nat ath10k_pci ath10k_core ath xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD wireguard nf_nat nf_flow_table_hw nf_flow_table nf_conntrack mac80211 libchacha20poly1305 libblake2s ipt_REJECT ebtable_nat ebtable_filter ebtable_broute curve25519_neon cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_quota xt_pkttype xt_physdev xt_owner xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_addrtype xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY ppp_async poly1305_arm nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 macvlan libcurve25519_generic libblake2s_generic iptable_mangle iptable_filter ipt_ECN ip_tables ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat chacha_neon sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact qca_nss_tun6rd
<4>[360672.191567]  qca_nss_ipsecmgr qca_nss_cfi_cryptoapi qca_nss_qdisc qca_nss_crypto qca_nss_vlan qca_nss_pppoe pppoe pppox ppp_generic slhc qca_nss_gre qca_nss_bridge_mgr ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 bonding ip6_gre ip_gre gre ip6_udp_tunnel udp_tunnel sit qca_nss_drv ipcomp6 xfrm6_tunnel esp6 ah6 xfrm4_tunnel ipcomp esp4 ah4 ipip ip6_tunnel qca_nss_gmac tunnel6 tunnel4 ip_tunnel tun qca_ssdk xfrm_user xfrm_ipcomp af_key xfrm_algo shortcut_fe_drv shortcut_fe_ipv6 shortcut_fe sha1_generic md5 echainiv des_generic libdes cbc authenc usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom
<4>[360672.262359]  ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 crc32c_generic
<4>[360672.371743] CPU: 1 PID: 25660 Comm: kworker/1:1 Not tainted 5.4.163 #0
<4>[360672.391423] Hardware name: Generic DT based system
<4>[360672.398033] Workqueue: events dbs_work_handler
<4>[360672.402881] PC is at 0xc02296a8
<4>[360672.407402] LR is at krait_mux_set_parent+0xac/0xcc
<4>[360672.410865] pc : [<c02296a8>]    lr : [<c06296a8>]    psr: 60000013
<4>[360672.415816] sp : d6e15d30  ip : 00000000  fp : dd998010
<4>[360672.422150] r10: ffffffff  r9 : 00000000  r8 : 00000002
<4>[360672.427446] r7 : d6e15da4  r6 : 20000013  r5 : 00000001  r4 : dd5e9858
<4>[360672.432744] r3 : 0000b0c7  r2 : 0000e698  r1 : 20000013  r0 : c0c656cc
<4>[360672.439082] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[360672.445679] Control: 10c5787d  Table: 5a6d006a  DAC: 00000051
<0>[360672.452970] Process kworker/1:1 (pid: 25660, stack limit = 0x51ff8251)
<0>[360672.458785] Stack: (0xd6e15d30 to 0xd6e16000)
<0>[360672.465304] 5d20:                                     dd5e9864 00000000 00000000 c062b250
<0>[360672.469830] 5d40: ffffffff 00000000 00000000 c033eae8 dd5eb218 00000000 dd5eb204 00000002
<0>[360672.478077] 5d60: d6e15da4 c033ed60 00000000 dd52f600 dd5eb200 c0c1ab30 dd5ea300 00000002
<0>[360672.486324] 5d80: 23c34600 dce5a080 dce57f00 c033ede4 00000000 00003248 dd52f600 c061b010
<0>[360672.494569] 5da0: dd52f600 dd5b9e80 23c34600 2faf0800 dd52f600 00000000 dd5ea300 2faf0800
<0>[360672.502816] 5dc0: dd51f240 c061d504 dd5ea1e8 dd51f240 2faf0800 dd4d23c0 23c34600 dce5a080
<0>[360672.511062] 5de0: dce57f00 c061d54c dd5ea300 00000000 2faf0800 dd51f240 dce5a000 dce5a080
<0>[360672.519309] 5e00: dce57f00 c061d8e8 dce5a0b4 2faf0800 00000000 ffffffff 2faf0800 c0c1cde4
<0>[360672.527555] 5e20: dce56640 dce56680 2faf0800 00000000 23c34600 c061dad0 dcc2a800 2faf0800
<0>[360672.535803] 5e40: 00000000 c0727f80 dcc2b400 c0726f6c dce5a0b4 dce5a034 dcc2b400 2faf0800
<0>[360672.544048] 5e60: dce5a588 00000000 3b9aca00 00000001 c0c04f28 00000001 c0c67644 dcc2b000
<0>[360672.552296] 5e80: 00000000 c0732884 d6e15ec0 dce56f80 2faf0800 000c3500 dcc2b000 dcc2b000
<0>[360672.560541] 5ea0: 00000000 c0c67620 00000000 00000002 000c3500 00000000 ffffe000 c072c548
<0>[360672.568789] 5ec0: dcc2b000 000927c0 000c3500 000000a1 dcc2b000 dce5a300 dce5a300 dce57e00
<0>[360672.577034] 5ee0: dce57e00 dce56b40 00000000 c072fd14 dce5a338 00000000 dce5a304 dcc2b000
<0>[360672.585282] 5f00: c0c289fc 00000040 00000000 c0730974 dce5a338 d6e37980 dd99e680 dd9a1800
<0>[360672.593528] 5f20: 00000000 c03379e4 00000008 c0c03d00 d6e37980 d6e37994 dd99e680 00000008
<0>[360672.601774] 5f40: c0c03d00 dd99e698 dd99e680 c0337ca4 c0c0be4c c09c9b4c ffffe000 c0337c50
<0>[360672.610021] 5f60: d6e37980 d4d4b8c0 d4d4b900 00000000 d6e14000 c0337c50 d6e37980 dbd97eac
<0>[360672.618268] 5f80: d4d4b8dc c033dc80 00000000 d4d4b900 c033db20 00000000 00000000 00000000
<0>[360672.626513] 5fa0: 00000000 00000000 00000000 c03010e8 00000000 00000000 00000000 00000000
<0>[360672.634759] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[360672.643007] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<4>[360672.651253] [<c06296a8>] (krait_mux_set_parent) from [<c062b250>] (krait_notifier_cb+0x58/0xb4)
<4>[360672.659500] [<c062b250>] (krait_notifier_cb) from [<c033eae8>] (notifier_call_chain+0x74/0xa8)
<4>[360672.668434] [<c033eae8>] (notifier_call_chain) from [<c033ed60>] (__srcu_notifier_call_chain+0x54/0xc0)
<4>[360672.676854] [<c033ed60>] (__srcu_notifier_call_chain) from [<c033ede4>] (srcu_notifier_call_chain+0x18/0x20)
<4>[360672.686580] [<c033ede4>] (srcu_notifier_call_chain) from [<c061b010>] (__clk_notify+0x70/0x94)
<4>[360672.696385] [<c061b010>] (__clk_notify) from [<c061d504>] (clk_change_rate+0xfc/0x29c)
<4>[360672.704889] [<c061d504>] (clk_change_rate) from [<c061d54c>] (clk_change_rate+0x144/0x29c)
<4>[360672.712875] [<c061d54c>] (clk_change_rate) from [<c061d8e8>] (clk_core_set_rate_nolock+0xfc/0x14c)
<4>[360672.721209] [<c061d8e8>] (clk_core_set_rate_nolock) from [<c061dad0>] (clk_set_rate+0x38/0x9c)
<4>[360672.730242] [<c061dad0>] (clk_set_rate) from [<c0727f80>] (dev_pm_opp_set_rate+0x238/0x49c)
<4>[360672.738921] [<c0727f80>] (dev_pm_opp_set_rate) from [<c0732884>] (set_target+0x17c/0x1ec)
<4>[360672.747596] [<c0732884>] (set_target) from [<c072c548>] (__cpufreq_driver_target+0x1a0/0x568)
<4>[360672.755668] [<c072c548>] (__cpufreq_driver_target) from [<c072fd14>] (od_dbs_update+0xc8/0x19c)
<4>[360672.764264] [<c072fd14>] (od_dbs_update) from [<c0730974>] (dbs_work_handler+0x38/0x70)
<4>[360672.773211] [<c0730974>] (dbs_work_handler) from [<c03379e4>] (process_one_work+0x234/0x4a0)
<4>[360672.781279] [<c03379e4>] (process_one_work) from [<c0337ca4>] (worker_thread+0x54/0x604)
<4>[360672.789699] [<c0337ca4>] (worker_thread) from [<c033dc80>] (kthread+0x160/0x164)
<4>[360672.797856] [<c033dc80>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
<4>[360672.805313] Exception stack(0xd6e15fb0 to 0xd6e15ff8)
<4>[360672.812437] 5fa0:                                     00000000 00000000 00000000 00000000
<4>[360672.817659] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[360672.825900] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[360672.834147] Code: e5943028 e1cd02f0 e0420003 ebfffffe (e1c423d8) 
<4>[360672.840993] ---[ end trace fa2b240b40f8a6fe ]---

I'm running a custom OpenWrt 21.02 build from my GitHub repo, with the NSS drivers.

From the panic logs, it looks like panic is from the CPU clock drivers, when it's trying to change clock frequency?

It'll probably take me a while to figure out what's wrong, so hoping that more eyes can be looking at the issue and figure it out faster.

@Ansuel any idea?

how did you manage to get that crashlog? that kind of crash looks to be very random. Also the error from ath10k are normal or it's the first time you notice them?

The thing is that yes from the bootlog is caused by the cpufreq driver (that scale the cpu freq) but the crash was caused by the krait notifier that is used to put the clk to a safe freq before switching... so it's from a deeper level and not from the cpufreq driver but by the clk driver that control the frequency (much deeper level)

I don't know but looking at the ath10k error and the strange kernel panic it looks like the system just fked up bad and crashed on the first operation it was doing... (but could really be a bug in the mux clk code that is hard to reproduce as it's very difficult to trigger)
Any idea how to reproduce this?

The crash log is from the ramoops pstore after the router rebooted.

The ath10k errors seems to be a recent thing from what I remember. I think I started noticing it from since early Dec 2021 after I made a new build to test out ar8337 IGMP snooping code. Since Dec 2021, I noticed that the router's WiFi is unstable. I sometimes need to restart the WiFi interface to get traffic flowing thru WiFi again. LAN traffic is fine tho.

Unfortunately I do not have any idea what caused the panic nor do I know how to re-produce it. I was reading forum posting and the router was connected to an OpenVPN TAP tunnel from a remote site when the router crashed.

mhhh it was under load? Can you share the config for the pstore? would be handy to track some problem with nss scaling (I also have some crash but as they are random I never manage to investigate them)

Sure. I configured pstore in the .dts file. Here's the snippet:

	reserved-memory {
		rsvd@5fe00000 {
			reg = <0x5fe00000 0x200000>;
			reusable;
		};

		ramoops@42100000 {
			compatible = "ramoops";
			reg = <0x42100000 0x40000>;
			record-size = <0x4000>;
			console-size = <0x4000>;
			ftrace-size = <0x4000>;
			pmsg-size = <0x4000>;
		};
	};

Enable pstore filesystem and ramoops using make menuconfig and it should start storing oops logs in /sys/fs/pstore when the router reboots.

Edit: You can test out the ramoops log by simulating a reboot using this command (if memory serves):

echo c > /proc/sysrq-trigger

For the reboots caused by the watchdog (which I assume is causing the spontaneous reboots of the NSS scaling), since it is not a kernel panic, it will not have any logs stored.

4 Likes

Curiously, but also I got a crash with my 21.02 build last week. First crash in ages.

Upstreaming and enabling ramoops for R7800 (and all ipq806x?) might be helpful. Mt7622 has it on by default, and it has helped a lot in the mt76 WiFi driver related crashes with e8450/rt3200.

4 Likes

Heh heh ... all my routers running OpenWrt now has ramoops enabled.

Thing is that it would reduce ram space. I assume other target have better way to store these kind of stuff. Interesting that it does work and uboot doesn't clear this space. Wonder if I can try implementing the original way QCOM use to store kernel panics. Should be a funny project.

Curiously, mt76 seems to be the only one, where it is currently enabled in OpenWrt by default.

Marginally, I think, from ipq806x perspective. By 256 kB as quarky has set it? (or by 64 kB like with mt7622)

It would for sure make collecting debug info easier, so I vote for it to be enabled by default.

Potentially dumb question. Flashing to an image that has ramoops enabled can be done by sysupgrade? Or is RAM size/layout change necessary to use factory?

Sysupgrade is perfectly fine.

1 Like

factory image flash are only needed when the mtd partition is changed...here we change how the reserved memory is allocated and used.
I honestly think we should consider enabling this feature by default... Would help with the ath10k problem and the crash problem. (example we notice that the panic is not random but is actually the same for everyone and we never notice that)

2 Likes

Yeah, with mt7622 this revealed a common standard error: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000053 which then led to https://github.com/openwrt/mt76/issues/565 and fix.

Yes I will probably be enabling this on my personal builds at the least.
It would be great if this could be enabled for more devices easier instead of individual dts mods.

It's really how uboot init the memory. If it's cleared on reboot then rip pstore

Thanks for the example @quarky,

this seems to work nicely with R7800 in master:

Kernel config change + DTS change:

--- a/target/linux/ipq806x/config-5.10
+++ b/target/linux/ipq806x/config-5.10
@@ -365,6 +365,19 @@ CONFIG_POWER_RESET_MSM=y
 CONFIG_POWER_SUPPLY=y
 CONFIG_PPS=y
 CONFIG_PRINTK_TIME=y
+CONFIG_PSTORE=y
+# CONFIG_PSTORE_842_COMPRESS is not set
+CONFIG_PSTORE_COMPRESS=y
+CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
+# CONFIG_PSTORE_CONSOLE is not set
+CONFIG_PSTORE_DEFLATE_COMPRESS=y
+CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
+# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
+# CONFIG_PSTORE_LZ4_COMPRESS is not set
+# CONFIG_PSTORE_LZO_COMPRESS is not set
+# CONFIG_PSTORE_PMSG is not set
+CONFIG_PSTORE_RAM=y
+# CONFIG_PSTORE_ZSTD_COMPRESS is not set
 CONFIG_PTP_1588_CLOCK=y
 # CONFIG_QCOM_A53PLL is not set
 CONFIG_QCOM_ADM=y
@@ -397,6 +410,9 @@ CONFIG_QCOM_WDT=y
 # CONFIG_QCS_TURING_404 is not set
 CONFIG_RAS=y
 CONFIG_RATIONAL=y
+CONFIG_REED_SOLOMON=y
+CONFIG_REED_SOLOMON_DEC8=y
+CONFIG_REED_SOLOMON_ENC8=y
 CONFIG_REGMAP=y
 CONFIG_REGMAP_MMIO=y
 CONFIG_REGULATOR=y
--- a/target/linux/ipq806x/files/arch/arm/boot/dts/qcom-ipq8065-nighthawk.dtsi
+++ b/target/linux/ipq806x/files/arch/arm/boot/dts/qcom-ipq8065-nighthawk.dtsi
@@ -13,6 +13,15 @@
 			reg = <0x5fe00000 0x200000>;
 			reusable;
 		};
+
+		ramoops@42100000 {
+			compatible = "ramoops";
+			reg = <0x42100000 0x40000>;
+			record-size = <0x4000>;
+			console-size = <0x4000>;
+			ftrace-size = <0x4000>;
+			pmsg-size = <0x4000>;
+		};
 	};
 
 	aliases {

pstore file after reboot:

 -----------------------------------------------------
 OpenWrt SNAPSHOT, r18562-0765466a42
 -----------------------------------------------------
root@router1:~# ls -l /sys/fs/pstore/
-r--r--r--    1 root     root         27199 Jan 14 18:30 dmesg-ramoops-0
root@router1:~# cat /sys/fs/pstore/dmesg-ramoops-0
Panic#1 Part1
...
<6>[  131.097393] br-lan: port 2(wlan0) entered forwarding state
<6>[  132.324175] sysrq: Trigger a crash
<0>[  132.324215] Kernel panic - not syncing: sysrq triggered crash
<2>[  132.326486] CPU1: stopping
<4>[  132.332288] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.90 #0
<4>[  132.334893] Hardware name: Generic DT based system
<4>[  132.341070] [<c030e32c>] (unwind_backtrace) from [<c030a1ac>] (show_stack+0x14/0x20)
<4>[  132.345668] [<c030a1ac>] (show_stack) from [<c062eac8>] (dump_stack+0x94/0xa8)
<4>[  132.353561] [<c062eac8>] (dump_stack) from [<c030d050>] (do_handle_IPI+0x140/0x184)
<4>[  132.360593] [<c030d050>] (do_handle_IPI) from [<c030d0b0>] (ipi_handler+0x1c/0x2c)
<4>[  132.368145] [<c030d0b0>] (ipi_handler) from [<c0370f7c>] (__handle_domain_irq+0x90/0xf4)
<4>[  132.375789] [<c0370f7c>] (__handle_domain_irq) from [<c0648e20>] (gic_handle_irq+0x90/0xb8)
<4>[  132.384034] [<c0648e20>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
<4>[  132.392100] Exception stack(0xc146df18 to 0xc146df60)
<4>[  132.399739] df00:                                                       00000000 0000001e
<4>[  132.404784] df20: 1cd5a000 dd9a0cc0 00000000 cf226ee0 c1c73040 00000000 dd99ffb0 0000001e
<4>[  132.412944] df40: 00000000 0000001e 0003fd40 c146df68 c07b5fec c07b600c 60000013 ffffffff
<4>[  132.421106] [<c0300b0c>] (__irq_svc) from [<c07b600c>] (cpuidle_enter_state+0x180/0x380)
<4>[  132.429256] [<c07b600c>] (cpuidle_enter_state) from [<c07b625c>] (cpuidle_enter+0x3c/0x5c)
<4>[  132.437417] [<c07b625c>] (cpuidle_enter) from [<c034df10>] (do_idle+0x208/0x2a4)
<4>[  132.445487] [<c034df10>] (do_idle) from [<c034e268>] (cpu_startup_entry+0x1c/0x20)
<4>[  132.453040] [<c034e268>] (cpu_startup_entry) from [<4230152c>] (0x4230152c)
root@router1:~#

(I applied the DTS change to the combined R7800+XR500 Nighthawk .dtsi in master, but in 21.02 it would be directly to the R7800 .dts file.)

4 Likes

Thanks, I was stuck on the dts part but your diff helped.