Yes, it worked. As a matter of fact, I was able to use NSS fq_codel properly but the device might crash randomly with the same "Unable to handle kernel NULL pointer dereference at virtual address 00000138". Unfortunately, I did not save the dmesg dumps of the previous crashes from the last week because I thought the recent 22.03 commits might cause the problem.
FYI, @D43m0n also encountered similar crashes with his private images based on the recent 22.03 branch.
D43m0n: can you please post your crash dumps for quarky to investigate?
I don't think it has anything to do with nssfq_codel tho. If you don't mind, can comment off the part of your startup script that configures the shaping & codel?
Just enable the nssifb driver will do:
modprobe nss-ifb && ip link set up nssifb
I suspect you will still see the crash regardless.
It looks like the skb received and sent by NSS is somehow not agreeing with the kernel API used.
My crash above might just be a corner case (timing issue? etc.) when the kernel module nss-ifb was loaded and the nssifb interface was brought up. However, the crash dumps (from the last week, but I did not save them) happened when the router was already up and running overnight. All the crash dumps with the recent 22.03 + NSS fq_codel enabled had this same null pointer crash: "Unable to handle kernel NULL pointer dereference at virtual address 00000138".
After disabling NSS fq_codel, the same 22.03 image ran for 7 days straight without any problem.
modprobe nss-ifb && ip link set up nssifb
ifconfig nssifb
nssifb Link encap:Ethernet HWaddr 02:56:32:18:92:A2
inet6 addr: fe80::56:32ff:fe18:92a2/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:32
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
You may be right that nssfq_codel may be causing it. It could be coincidence as well though.
I've had a quick look at the eth_type_trans() API code. I suspect the problem may be caused by NSS receiving a network packet that is empty. When a skb that is empty is sent into the eth_type_trans() API, kernel will panic.
Do you by any chance run a wireless mesh setup in your environment? It seems mesh network sends empty network packets.
No, I don't run any wireless mesh setup. It's just a single R7800 serving as router + AP. @D432m0n should be able to provide you with more details about his similar crashes later.
Sorry, it's been quite late here and I need to go to sleep for tomorrow's work. Thanks for all your help master quarky.
yes it appears nssifb works. I've kept track of what I've been doing to narrow it down to where change has been introduced where spontaneous crashes occur in this post. I don't have a mesh setup, only a few vlan's and some dumb AP's that are all wired to each other.
Overnight my NAS has finished a build based on the stable version I built myself on April 30. In that build I dropped in Felix's WiFi patches 330-* up and until 339-* in the ~/openwrt/package/kernel/mac80211/patches/subsys directory. I'm guessing that this would apply those patches during building but I'm not sure I'm no developer but I think I can understand how things work. I didn't have any errors during building and the finished files are a tiny bit different in size, although the version in version.buildinfo is exactly the same.
I can't flash this new build yet because the wife is still working from home, I need to wait a few hours.
I get the suspicion that there have been changes to the kernel that don't align with the NSS patches anymore.
I have two R7800's. I keep both on the same build versions. One is configured as the router, the other as dumb AP. On the dumb AP, I don't enable NSS fq_codel, on the router I do enable it. When I enable it, a crash will occur. It could be within a few hours, but it could also take at least 24 hours. I don't enable the nssifb driver in /etc/rc.local, but I use KONG's sqm-scripts for that, but it's basically the same result: the nssifb driver gets enabled through that. So far the router has an uptime of just a little more than 19 hours. After flashing it with another different build yesterday, it took about 4 hours before the first crash occurred. I don't see many ramoops files anymore. I don't know why I don't see them anymore. But the ones I did see were similar to the one from @vochong .
-- EDIT --
I've finally got the new build based on 22.03-RC1 but with presumably Felix's WiFi patches included flashed to both R7800's. One is running as dumb AP and doesn't have NSS fq_codel enabled. The other is the router and has NSS fq_codel enabled.
For reassurance, how can I verify that the way I've tried to add Felix' recent WiFi patches actually got included in this build? How can I verify they actually work?
Today, when using AnyConnect (SSLVPN) through the router and VNC through a websocket tunnel (also traversing the router), my router kept crashing every 15 minutes or so. It was running a private image I built yesterday based on all the latest commits from the 22.03 branch, and with NSS fq_codel disabled (nss_ifb module was NOT loaded and no tc was used). The crashes produced no ramoops dump at all.
After 3 such consecutive crashes, I had to quickly load ACwifidude's last 22.03 image onto the router in order to continue my work and it worked without any crash thereafter.
Something's definitely wrong with recent commits from 22.03. They have rendered NSS builds (with or without NSS fq_codel enabled) very much crash-prone.
Issue "git show ec9f82fa18c7c8deb4875152d7907855d186f4c6" in your openwrt build_root to see if your build has the latest "mac80211: fix AQL issue with multicast traffic" fix. If it does not show it, then your build won't have that fix.
Does your router crash if you do not use AnyConnect? If it doesn't then we can probably try to study what's special about the AnyConnect traffic and maybe work-around it. AFAIK, SSLVPN uses HTTPS over TCP so it should not be any difference compared to a browser accessing a website secured with HTTPS.
Looks like there are changes to the kernel's TCP stack for 5.10 compared to 5.4.
Yes, I use AnyConnect every weekday and I don't think it has anything to do with the crash. It's just HTTPS/TCP traffic like regular web traffic.
ACwifidude's 20220709-Stable2203NSS image (no NSS fq_codel) has just crashed on me after about 10 hours while I was just surfing the web. There was no ramoops dump either. These unexpected and random crashes on R7800 make me think of switching to a RPi4 and a dedicated AP, plus I can run lots of stuff on the RPI4 as well.
Is there any trick to make a NSS-enabled image to behave like a non-NSS router? I was able to unload qca_nss_qdisc and qca_nss_pppoe modules, but cannot remove qca_nss_drv and qca_nss_gmac on the fly.
I just want to see if the very same image can run for long periods of time without NSS acceleration.
If there's no such trick, I will build a plain 22.03 image without NSS and see if the newer kernel 5.10.x and CPU governor gremlin may have anything to do with the crash-prone behavior on R7800.
hello friend, I'm testing your latest version 22.03 on the r7800, and it seems as if pppoe is not accelerated with NSS. I get speed 300/300 on lan network, 300/450 on wifi when I have 22.02 300/900 mb. what could be happening?
Rebased master. I had to delete the âconfig all kmodsâ line in the diffconfig due to random package that wouldnât compile (not in the build, just in masterâs repo).