Just "opkg remove irqbalance". Try to keep the same image, make small change one at a time (e.g. disable irqbalance etc.) and observe the behavior. Making big change like switching to a new image may introduce lots of unknown variables into the equation.

I followed the Netgear R7800 exploration thread and learned that hnyman made a pull request to make this ramoops/pstore persistent after a random reboot.

Basically this means that all you need to do is install the kmod-ramoops package. This will pull in kmod-pstore and kmod-reed-solomon as a dependency. You can either include this in your own diffconfig or install it with opkg later on. When installed with opkg, you need to reboot to make sure everything gets set up properly for a random reboot. You can check if it's properly set up after this reboot with:

root@router1:~# logread | grep -i -E "pstore|ramo"
Wed Feb  9 23:49:14 2022 kern.info kernel: [   16.377464] pstore: Using crash dump compression: deflate
Wed Feb  9 23:49:14 2022 kern.info kernel: [   16.377494] pstore: Registered ramoops as persistent store backend
Wed Feb  9 23:49:14 2022 kern.info kernel: [   16.381920] ramoops: using 0x40000@0x42100000, ecc: 0

Then I did:

echo c > /proc/sysrq-trigger

and immediately my ap-R7800 rebooted. Afterwards I found a file in /sys/fs/pstore:

root@ap-R7800:~# ls -l /sys/fs/pstore/
-r--r--r--    1 root     root         27182 Sep  9 14:40 dmesg-ramoops-0

It had the following content:

<4>[877148.946890] br-lan: received packet on eth1.1 with own address as source address (addr:12:33:3b:4c:73:ec, vlan:0)
<4>[877151.426933] br-lan: received packet on eth1.1 with own address as source address (addr:12:33:3b:4c:73:ec, vlan:0)
<4>[877151.427177] br-lan: received packet on eth1.1 with own address as source address (addr:12:33:3b:4c:73:ec, vlan:0)
<4>[877153.621894] br-lan: received packet on eth1.1 with own address as source address (addr:12:33:3b:4c:73:ec, vlan:0)
<4>[877153.622162] br-lan: received packet on eth1.1 with own address as source address (addr:12:33:3b:4c:73:ec, vlan:0)
<4>[882242.977602] ath10k_pci 0000:01:00.0: Invalid peer id 476 peer stats buffer
<4>[886458.948966] ath10k_pci 0000:01:00.0: Invalid peer id 482 peer stats buffer
<4>[886515.128957] ath10k_pci 0000:01:00.0: Invalid peer id 477 peer stats buffer
<4>[928488.983994] ath10k_pci 0000:01:00.0: Invalid peer id 490 peer stats buffer
<6>[940901.364018] sysrq: Trigger a crash
<0>[940901.364058] Kernel panic - not syncing: sysrq triggered crash
<2>[940901.366335] CPU0: stopping
<4>[940901.372228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.138 #0
<4>[940901.374914] Hardware name: Generic DT based system
<4>[940901.381190] [<c030e46c>] (unwind_backtrace) from [<c030a204>] (show_stack+0x14/0x20)
<4>[940901.385955] [<c030a204>] (show_stack) from [<c062ef48>] (dump_stack+0x94/0xa8)
<4>[940901.393938] [<c062ef48>] (dump_stack) from [<c030d190>] (do_handle_IPI+0x140/0x184)
<4>[940901.401054] [<c030d190>] (do_handle_IPI) from [<c030d1f0>] (ipi_handler+0x1c/0x2c)
<4>[940901.409043] [<c030d1f0>] (ipi_handler) from [<c037174c>] (__handle_domain_irq+0x90/0xf4)
<4>[940901.416429] [<c037174c>] (__handle_domain_irq) from [<c06482e0>] (gic_handle_irq+0x90/0xb8)
<4>[940901.424759] [<c06482e0>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
<4>[940901.433252] Exception stack(0xc0d01ee0 to 0xc0d01f28)
<4>[940901.440647] 1ee0: 00000000 000357be 1cd4a000 dd990d80 00000000 abadff60 c1ca8840 00000000
<4>[940901.445773] 1f00: dd990030 000357be 00000000 000357be 826a5280 c0d01f30 c07b64ac c07b64cc
<4>[940901.454003] 1f20: 60000013 ffffffff
<4>[940901.462255] [<c0300b8c>] (__irq_svc) from [<c07b64cc>] (cpuidle_enter_state+0x180/0x380)
<4>[940901.465989] [<c07b64cc>] (cpuidle_enter_state) from [<c07b671c>] (cpuidle_enter+0x3c/0x5c)
<4>[940901.474061] [<c07b671c>] (cpuidle_enter) from [<c034e670>] (do_idle+0x208/0x2a4)
<4>[940901.482216] [<c034e670>] (do_idle) from [<c034e9c8>] (cpu_startup_entry+0x1c/0x20)
<4>[940901.489866] [<c034e9c8>] (cpu_startup_entry) from [<c0c01008>] (start_kernel+0x528/0x538)

So I've now learned and confirmed how to verify ramoops is working. Just got a bit wiser again today and hopefully by sharing this reassure others on how to verify this.

I went through the Netgear R7800 exploration thread and saw that @quarky and @Ansuel were chasing down a kernel crash issue and we're looking into CPU frequency changes that might be the cause of that. Around that time the WiFi slowdown related to ATF/AQL was also showing up and a lot of joint effort was put into that too. Ansuel found a few strange things in old R7800 code around February this year. Some time later in March he discovered some more strange things, it's technical and I think I can follow most of it. Up until this point I kept getting the idea that setting the CPU frequency to a fixed value might be a solution. When I read this post, I'm even more convinced to try a fixed CPU frequency for a R7800. Maybe not the performance governor, but set it fixed to 1400MHz or so?

Now, in reading all of this I'm developing a theory as I write. It appears that there's something strange in R7800 CPU frequency code when scaling up/down. We all know that we need to increase the minimum frequency to at least 600MHz or perhaps 800MHz to improve stability right? This is a known issue for a long time on R7800 anyway. Now bear with me; what if there's some kind of regression or enhancement somewhere in the master code for kernel 5.10 that somehow defeats the stability we got in the R7800 platform up until kernel 5.4? Meaning that if you have not fixed the CPU frequency to any value your R7800 would change frequency on demand. Here comes the NSS acceleration without PPPoE support in play. Before we had NSS acceleration with PPPoE, our CPU's were probably maxed out most of the time, hardly any on demand frequency switching from 600/800MHz to 1700MHz, so any issue with CPU frequency switching on kernel 5.10 would probably go unnoticed for a while. With NSS acceleration on PPPoE we now see CPU's hardly doing much work, until sometimes a burst of work is required and the CPU quickly changes frequency. And when reading all of these discoveries in R7800 code where it seemed to be "quickly built without proper documentation" it wouldn't surprise me that it's a hit and miss when changing CPU frequency and having a random reboot or not...

The latest work from Ansuel is a fix for 5.15 for the cache scaling driver, from the looks of it, he put in serious effort in this CPU frequency scaling issue. I don't think this was back ported (yet) to 5.10.

So if anything; I'm going to build a fresh new 22.03 build, keep my config but set the CPU frequency scaling to a fixed number. Something like this:

3 Likes

Persistent ramoops has been working properly in ACwifidude's NSS 22.03 and Master images for quite some time. His diffconfig files also have the necessary kmod packages for it.

I messed around with CPU governor settings quite a lot in the past and noted that crashes caused by these CPU/cache frequency issue always tend to produce some crash dumps with "clk_clock_change" or something with "clk" in the dumps.

However, there were also random reboots in which no crash dump was generated at all, like the RCU stalling reboots I have encountered lately. I don't think these random reboots have anything to do with the CPU/cache frequency scaling issue at all. So far disabling irqbalance seems to work well for my routers (4 RCU stalling reboots on 2 routers in 1 day after loading a new image; no reboot on both routers after almost 3 days with irqbalance disabled). Of course, what may work for me does not mean it will work for your "reboots every few hours" on your router. OpenWrt is our hobby so to each his own.

You definitely should try to clamp your CPU to a single frequency to see it may help your router. FYI, Mpilon used performance governor (single frequency) and he still kept getting RCU stalling reboots.

1 Like

It is recommended that 22.03 be consistent with the kernel on the official website. When installing a new ipk, errors are often reported.

For any private images (e.g. ACwifidude's ones or any others), any kmod packages or packages that have some kmod dependencies must be installed manually using the kmod packages under the Packages-... folder on the download site.

I tested that by triggering a reboot, it works. But I also had a few random reboots without any ramoops file being recorded. And from the looks of it, I wasn't alone. In some cases, random reboots also occurred to @shelterx with a NSS enabled build.

Yeah, this is kinda unsure and the reason I'm going to fix the CPU's to a single frequency of 1400MHz. Irqbalance is already disabled and will stay that for now. In that R7800 exploration thread I also read a few posts about R7800's that would not crash when the performance governor was in use, all the time. So I'm going to hope for that, clamp CPU frequency and :crossed_fingers: :sweat_smile:

3 Likes

I've finished building 22.03.0 images, one with latest ath10k firmware and one with ath10k-ct driver for my C2600. Ramoops is enabled.

  • packet steering kept disabled (already done that for a longer time)
  • irqbalance disabled
  • clamped scaling_min_freq and scaling_max_freq on all ipq806x devices with NSS firmware to 1400000

I also discovered that my dumb AP's running NSS accelerated firmware had uptimes of a few days, while those are really just dumb AP's and not doing much. Perhaps they break some sweat when a few clients roam, but that's really all they do. But still they randomly reboot. But not as often as my router did (within hours with my previous 22.03-RC6 build). So all suspicious parameters are now configured in the same way on all devices.

So, there's only one thing left to say:

a919d70d-ae04-4daa-9280-6b61abedc402_text

3 Likes

D43m0n: "In that R7800 exploration thread I also read a few posts about R7800's that would not crash when the performance governor was in use, all the time."

Internet is a messy potpourri of good information, misinformation and conflicting information. Just ask Mpilon .. he will tell you that his R7800 with performance governor crashed all the time! Maybe they were caused by some other reason, but unsuspecting viewers will scratch their head ?w?t?h? Cheers.

1 Like

You omitted the bit about What Works Wisdom changing over time - from hard knowledge to half baked theories to hands full of clutched straws - which gain credibility and then lose it. Lost to time.

'hardware' anyone?

There is no repository of What Works, complete with dated entries and a change log.

Some of us have a reproducible set of fails, different threads which crash but when calling the same func.

I don't suspect that func, I suspect one or more nss kernel drivers/modules/things mis-handling the file descriptors in flight.

They may be tickling a cache weakness and if I could get hold of qualcom's errata for the CPU - or the arm7A core more generally, I'd have a better idea what to look for.

Have to go to a different computer for the next bit, stay tuned!

The Next Bit:

Submitted for your approval - a slightly-off character, been around for a long time, works tirelessly, but with a shady appearance.

openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/linux-5.10.136/fs/seq_file.c

seq_read_iter(struct kiocb *iocb, struct iov_iter *iter)

it doesn't look like this func has changed from at least kernel 5.4 -- this is one of the read funcs written to dev structs as its howto-read-this-device fund ...

so debugging this means getting the actual device involved. everything done thru the function ptrs in the dev's ops struct.

here's the good bit, the suspicious bit: it locks a mutex and has the potential to loop forever:

ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter) line 171, seq_file.c
{
	struct seq_file *m = iocb->ki_filp->private_data;
        .
        .
	if (!iov_iter_count(iter))
		return 0;

	mutex_lock(&m->lock);   <<<<<<<<<  line 182
        .
        .
        .
	/* Don't assume ki_pos is where we left it */
**	if (unlikely(iocb->ki_pos != m->read_pos)) {
**		while ((err = traverse(m, iocb->ki_pos)) == -EAGAIN)  line 195
**			;   // !!!!!!!!!

As I understand, this is called all over linux-land. I don't like the while ... ; but we've all lived with it for a long time - If it's involved it may be the specific funcs supporting this device ... private mem pool too small? 2 things fighting over the same mutex?

mmmv,
M

2 Likes

You’re far more knowledgeable to dig into this than I am. I’ve done some C-programming for a few months 12 years ago in school, to get to know the language and that get the point out that memory management is up to the developer.

I think you might have a chat with @Ansuel since he’s done some digging in older code as well and also has a suspicion set on cache timing and I believe also some unusual mutex handling. Now I don’t know if you guys are poking in the same corner, but both of you also are familiar with NSS firmware.

Thanks for the kind words.

My background is in the system side .. I used to say I didn't really care what the application was, I liked to work on the system underneath.

But I saw the posts from 3 days ago talking about the details of how nss works, and realized I'm in over my head.

I can diff code bases and have hunches about the kind of problems to look for ... But aside of diffing code bases, googling, and toggling gross features, not much else I can do.

I'm running @KONG nss 22.3 code base for 3 days now and no reboots. I'm not using irqbalance because I'm going to feel silly if that's what shows the problem (!)

If that's the case it could point directly about the nss ISRs needing to gate or dodge irqbalance .. changing cores close-to ornduring ISR firing ... Maybe some data used by an ISR that's bound to one core and not happy if the code moves to the other core.

'way too much nss for me to get involved in' - with apologies to Chuck Berry.

1 Like

I believe @KONG 22.03 NSS builds don't have @tishipp 's PPPoE fixes (yet). So if PPPoE is required for your connection, check if your CPU's stay low when doing a speedtest. The 22.03 and master NSS builds (kernel 5.10) from @ACwifidude were also stable, until PPPoE got accelerated again. I haven't removed the L2TP patch in my recent build yet, because in a previous build without that, it didn't make a difference.

Now... I've had 2 random reboots yesterday... One on the router and one on the R7800 dumb AP...
And unfortunately... No ramoops file in /sys/fs/pstore....
I've found this prior to the reboot in the remote syslog:

Sep  9 22:24:24 OpenWrt kernel: [ 8626.009090] ath10k_warn: 41 callbacks suppressed
Sep  9 22:24:24 OpenWrt kernel: [ 8626.009099] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.012794] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.019912] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.027115] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.034508] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.041769] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.049056] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.056283] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.063648] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  9 22:24:24 OpenWrt kernel: [ 8626.070940] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
Sep  9 22:24:26 OpenWrt dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) 192.168.1.15 9e:bf:a1:74:51:42
Sep  9 22:24:26 OpenWrt dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.15 9e:bf:a1:74:51:42 iPhone
Sep  9 22:24:28 OpenWrt dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) 192.168.1.15 9e:bf:a1:74:51:42
Sep  9 22:24:28 OpenWrt dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.15 9e:bf:a1:74:51:42 iPhone
Sep  9 22:24:29 OpenWrt dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) 192.168.1.15 9e:bf:a1:74:51:42
Sep  9 22:24:29 OpenWrt dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.15 9e:bf:a1:74:51:42 iPhone
Sep  9 22:24:31 OpenWrt kernel: [ 8633.408445] ath10k_warn: 110 callbacks suppressed
Sep  9 22:24:31 OpenWrt kernel: [ 8633.408454] ath10k_pci 0000:01:00.0: bss channel survey timed out
Sep  9 22:24:33 OpenWrt kernel: [ 8635.328580] ath10k_pci 0001:01:00.0: wmi command 36967 timeout, restarting hardware
Sep  9 22:24:34 OpenWrt kernel: [ 8636.448424] ath10k_pci 0000:01:00.0: wmi command 36967 timeout, restarting hardware
Sep  9 17:36:46 OpenWrt logread[949]: Logread connected to 192.168.1.9:514 via udp
Sep  9 17:36:46 OpenWrt pppd[2509]: Renamed interface ppp0 to pppoe-wan
Sep  9 17:36:46 OpenWrt pppd[2509]: Using interface pppoe-wan
Sep  9 17:36:46 OpenWrt pppd[2509]: Connect: pppoe-wan <--> eth0.6
Sep  9 17:36:46 OpenWrt pppd[2509]: Remote message: Authentication success,Welcome!
Sep  9 17:36:46 OpenWrt pppd[2509]: PAP authentication succeeded
Sep  9 17:36:46 OpenWrt pppd[2509]: peer from calling number 68:8F:84:EE:DC:F7 authorized
Sep  9 17:36:46 OpenWrt pppd[2509]: local  IP address 84.86.163.233
Sep  9 17:36:46 OpenWrt pppd[2509]: remote IP address 195.190.228.152
Sep  9 17:36:46 OpenWrt pppd[2509]: primary   DNS address 195.121.1.34
Sep  9 17:36:46 OpenWrt pppd[2509]: secondary DNS address 195.121.1.66

On the dumb AP:

Sep  9 22:40:37 ap-R7800 kernel: [10086.395713] br-lan: received packet on eth1.1 with own address as source address (addr:54:60:09:0b:ab:fe, vlan:0)
Sep  9 22:40:38 ap-R7800 kernel: [10087.384847] br-lan: received packet on eth1.1 with own address as source address (addr:54:60:09:0b:ab:fe, vlan:0)
Sep  9 22:40:38 ap-R7800 kernel: [10088.226137] br-lan: received packet on eth1.1 with own address as source address (addr:54:60:09:0b:ab:fe, vlan:0)
Sep  9 22:40:39 ap-R7800 kernel: [10088.386039] br-lan: received packet on eth1.1 with own address as source address (addr:54:60:09:0b:ab:fe, vlan:0)
Sep  9 22:40:39 ap-R7800 kernel: [10089.180670] br-lan: received packet on eth1.1 with own address as source address (addr:54:60:09:0b:ab:fe, vlan:0)
Sep  9 22:46:12 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef IEEE 802.11: disassociated due to inactivity
Sep  9 22:46:12 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef MLME: MLME-DISASSOCIATE.indication(c4:b3:01:d5:e0:ef, 4)
Sep  9 22:46:12 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef MLME: MLME-DELETEKEYS.request(c4:b3:01:d5:e0:ef)
Sep  9 22:46:13 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Sep  9 22:46:13 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef MLME: MLME-DEAUTHENTICATE.indication(c4:b3:01:d5:e0:ef, 2)
Sep  9 22:46:13 ap-R7800 hostapd: wlan0: STA c4:b3:01:d5:e0:ef MLME: MLME-DELETEKEYS.request(c4:b3:01:d5:e0:ef)
Sep  9 22:46:13 ap-R7800 kernel: [10422.530516] ath10k_pci 0000:01:00.0: Invalid peer id 22 peer stats buffer
Sep  9 23:12:31 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc IEEE 802.11: handle_action - unknown action category 6 or invalid frame
Sep  9 23:17:44 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc IEEE 802.11: disassociated due to inactivity
Sep  9 23:17:44 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc MLME: MLME-DISASSOCIATE.indication(e2:f1:77:4e:73:bc, 4)
Sep  9 23:17:44 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc MLME: MLME-DELETEKEYS.request(e2:f1:77:4e:73:bc)
Sep  9 23:17:45 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Sep  9 23:17:45 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc MLME: MLME-DEAUTHENTICATE.indication(e2:f1:77:4e:73:bc, 2)
Sep  9 23:17:45 ap-R7800 hostapd: wlan0: STA e2:f1:77:4e:73:bc MLME: MLME-DELETEKEYS.request(e2:f1:77:4e:73:bc)
Sep  9 19:49:03 ap-R7800 logread[924]: Logread connected to 192.168.1.9:514 via udp
Sep  9 19:49:04 ap-R7800 kernel: [   51.232772] ath: EEPROM regdomain: 0x8210
Sep  9 19:49:04 ap-R7800 kernel: [   51.232780] ath: EEPROM indicates we should expect a country code
Sep  9 19:49:04 ap-R7800 kernel: [   51.232785] ath: doing EEPROM country->regdmn map search
Sep  9 19:49:04 ap-R7800 kernel: [   51.232789] ath: country maps to regdmn code: 0x37
Sep  9 19:49:04 ap-R7800 kernel: [   51.232794] ath: Country alpha2 being used: NL
Sep  9 19:49:04 ap-R7800 kernel: [   51.232798] ath: Regpair used: 0x37
Sep  9 19:49:04 ap-R7800 kernel: [   51.232804] ath: regdomain 0x8210 dynamically updated by user
Sep  9 19:49:04 ap-R7800 kernel: [   51.232831] ath: EEPROM regdomain: 0x8210

The C2600 is still up since Marshall Pentecost reset the clock.

@D43m0n

I checked the commit log and found that the kernel version prior to the introduction of these patches was 5.10.134.

One possibility is that some kernel changes may affect the interaction between the kernel and NSS stuff to some extent. You can try to revert the kernel commits 5.10.138/137/136/135 and build a new image with these patches.

Both ACwifidude and Tipshipp claimed that their PPPoE patches were actually based on the work from @robimarko. Hope @robimarko may help us in this matter.

I gave up on NSS a long time ago, its just not sustainable

Thanks @robimarko. That's sad news.

I'd like to ask a question. Our recent random reboots did not seem to produce any crash dump (with ramoops/pstore enabled). In these cases, the last log entries indicate RCU stalling errors. Is that true that the Linux kernel always triggers a reboot if its rcu_sched detects a stall on CPU?

After about 3 1/2 days, my R7800 got a random reboot without any dmesg or console ramoops. The EA8500 is still OK.

Since I don't use PPPoE, I will revert all 4 Tishipp commits and build a new image to see if some changes in the latest kernel commits may have played some role in these relatively frequent random reboots.

2 Likes

would it be possible to do a test without the firmware and driver the wifi and skip its use?

NSS firmware and drivers are fully integrated in a NSS build. We cannot just toggle them freely like disabling/enabling a kernel proc entry.

For WIFI firmware and driver, you can just use opkg to remove them and reboot the device.