Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

Yes, it worked. As a matter of fact, I was able to use NSS fq_codel properly but the device might crash randomly with the same "Unable to handle kernel NULL pointer dereference at virtual address 00000138". Unfortunately, I did not save the dmesg dumps of the previous crashes from the last week because I thought the recent 22.03 commits might cause the problem.

FYI, @D43m0n also encountered similar crashes with his private images based on the recent 22.03 branch.
D43m0n: can you please post your crash dumps for quarky to investigate?

D432m0n's crash dumps are here:

<1>[   53.513375] 8<--- cut here ---                                                                                  
<6>[   53.513835] Nexthop successfully set for [eth0] to [nssifb]                                      
<1>[   53.516736] Unable to handle kernel NULL pointer dereference at virtual address 00000138                                                                                                                                              
<1>[   53.522252] pgd = 66bda50b                                                                                                                                                                                                            
<1>[   53.530560] [00000138] *pgd=476ee835, *pte=00000000, *ppte=00000000                                     
<0>[   53.533030] Internal error: Oops: 17 [#1] SMP ARM                                                               
<4>[   53.539188] Modules linked in: nss_ifb ecm ath10k_pci ath10k_core ath wireguard nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet mac80211 libchacha20poly1305 ipt_REJECT curve25519_neon cfg80211 xt_time xt_tcpu
dp xt_tcpmss xt_statistic xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_conntrack xt_comment xt_cgroup xt_addrtype xt_TCPMSS xt_REDIRECT xt_MASQUERADE xt_L
OG xt_HL xt_DSCP xt_CT xt_CLASSIFY sch_cake ppp_async poly1305_arm nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib
_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_reject_ipv4 nf_log_ipv6 nf_log_ipv4 nf_log_common nf_flow_table nf_conntrack_netlink libcurve25519_generic iptable_nat iptable_mangle iptable_filter ipt_ECN ip_tab
les crc_ccitt compat chacha_neon fuse sch_tbf sch_ingress sch_htb                      
<4>[   53.539570]  sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact qca_nss_qdisc qca_nss_pppoe pppoe pppox ppp_generic slhc ledtrig_usbport cryptodev xt_set ip_set_lis
t_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port 
ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_NPT ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 nfsv4 nfsv3 nfs nfs_ssc msdos b
onding ifb ip6_udp_tunnel udp_tunnel sit qca_nss_drv qca_nss_gmac oid_registry tunnel4 ip_tunnel xfrm_user xfrm_ipcomp af_key xfrm_algo vfat fat lockd sunrpc grace hfsplus hfs cdrom cifs dns_resolver nls_utf8 nls_iso8859_15 nls_iso8859_
1 nls_cp850 nls_cp437 nls_cp1250 wp512 twofish_generic                                                                
<4>[   53.609422]  twofish_common tgr192 tea serpent_generic khazad cast6_generic cast5_generic cast_common camellia_generic blowfish_generic blowfish_common anubis xts crypto_user algif_skcipher algif_rng algif_hash algif_aead af_alg s
ha512_generic sha1_generic seqiv md5 md4 kpp echainiv ecb des_generic libdes cmac authenc arc4 uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_
platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd ramoops reed_solomon pstore gpio_button_hotplug xfs libcrc32c f2fs ext4 mbcache jbd2 exfat dm_mirror dm_region_hash dm_log dm_crypt dm_mod dax crc3
2c_generic crc32_generic cbc encrypted_keys trusted tpm                                                               
<4>[   53.762019] CPU: 0 PID: 2924 Comm: modprobe Not tainted 5.10.127 #0                    
<4>[   53.784246] Hardware name: Generic DT based system                                                              
<4>[   53.790324] PC is at eth_type_trans+0x20/0x20c                                                                  
<4>[   53.795177] LR is at nss_ifb_data_cb+0x20/0x54 [nss_ifb]                                 
<4>[   53.799604] pc : [<c087720c>]    lr : [<bfaf52e8>]    psr: 80000113

I don't think it has anything to do with nssfq_codel tho. If you don't mind, can comment off the part of your startup script that configures the shaping & codel?

Just enable the nssifb driver will do:

modprobe nss-ifb && ip link set up nssifb

I suspect you will still see the crash regardless.

It looks like the skb received and sent by NSS is somehow not agreeing with the kernel API used.

1 Like

My crash above might just be a corner case (timing issue? etc.) when the kernel module nss-ifb was loaded and the nssifb interface was brought up. However, the crash dumps (from the last week, but I did not save them) happened when the router was already up and running overnight. All the crash dumps with the recent 22.03 + NSS fq_codel enabled had this same null pointer crash: "Unable to handle kernel NULL pointer dereference at virtual address 00000138".

After disabling NSS fq_codel, the same 22.03 image ran for 7 days straight without any problem.

modprobe nss-ifb && ip link set up nssifb
ifconfig nssifb
nssifb    Link encap:Ethernet  HWaddr 02:56:32:18:92:A2  
          inet6 addr: fe80::56:32ff:fe18:92a2/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:32 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
1 Like

You may be right that nssfq_codel may be causing it. It could be coincidence as well though.

I've had a quick look at the eth_type_trans() API code. I suspect the problem may be caused by NSS receiving a network packet that is empty. When a skb that is empty is sent into the eth_type_trans() API, kernel will panic.

Do you by any chance run a wireless mesh setup in your environment? It seems mesh network sends empty network packets.

No, I don't run any wireless mesh setup. It's just a single R7800 serving as router + AP. @D432m0n should be able to provide you with more details about his similar crashes later.

Sorry, it's been quite late here and I need to go to sleep for tomorrow's work. Thanks for all your help master quarky.

yes it appears nssifb works. I've kept track of what I've been doing to narrow it down to where change has been introduced where spontaneous crashes occur in this post. I don't have a mesh setup, only a few vlan's and some dumb AP's that are all wired to each other.

Overnight my NAS has finished a build based on the stable version I built myself on April 30. In that build I dropped in Felix's WiFi patches 330-* up and until 339-* in the ~/openwrt/package/kernel/mac80211/patches/subsys directory. I'm guessing that this would apply those patches during building but I'm not sure :sweat_smile: I'm no developer but I think I can understand how things work. I didn't have any errors during building and the finished files are a tiny bit different in size, although the version in version.buildinfo is exactly the same.

I can't flash this new build yet because the wife is still working from home, I need to wait a few hours.

I get the suspicion that there have been changes to the kernel that don't align with the NSS patches anymore.

@D43m0n when you had the spontaneous reboot without SQM enabled, did you by any chance enabled the nssifb driver in your startup script?

Does the reboot results in a ramoops log created? If yes, is it similar to the one that @vochong posted a few posts earlier?

I have two R7800's. I keep both on the same build versions. One is configured as the router, the other as dumb AP. On the dumb AP, I don't enable NSS fq_codel, on the router I do enable it. When I enable it, a crash will occur. It could be within a few hours, but it could also take at least 24 hours. I don't enable the nssifb driver in /etc/rc.local, but I use KONG's sqm-scripts for that, but it's basically the same result: the nssifb driver gets enabled through that. So far the router has an uptime of just a little more than 19 hours. After flashing it with another different build yesterday, it took about 4 hours before the first crash occurred. I don't see many ramoops files anymore. I don't know why I don't see them anymore. But the ones I did see were similar to the one from @vochong .

-- EDIT --
I've finally got the new build based on 22.03-RC1 but with presumably Felix's WiFi patches included flashed to both R7800's. One is running as dumb AP and doesn't have NSS fq_codel enabled. The other is the router and has NSS fq_codel enabled.

For reassurance, how can I verify that the way I've tried to add Felix' recent WiFi patches actually got included in this build? How can I verify they actually work?

1 Like

guys since I disabled wireguard I don't have more crashes since 20 days ago.
I think there is a problem there.

I don’t use wireguard at all.

Today, when using AnyConnect (SSLVPN) through the router and VNC through a websocket tunnel (also traversing the router), my router kept crashing every 15 minutes or so. It was running a private image I built yesterday based on all the latest commits from the 22.03 branch, and with NSS fq_codel disabled (nss_ifb module was NOT loaded and no tc was used). The crashes produced no ramoops dump at all.
After 3 such consecutive crashes, I had to quickly load ACwifidude's last 22.03 image onto the router in order to continue my work and it worked without any crash thereafter.

Something's definitely wrong with recent commits from 22.03. They have rendered NSS builds (with or without NSS fq_codel enabled) very much crash-prone.

Issue "git show ec9f82fa18c7c8deb4875152d7907855d186f4c6" in your openwrt build_root to see if your build has the latest "mac80211: fix AQL issue with multicast traffic" fix. If it does not show it, then your build won't have that fix.

1 Like

Does your router crash if you do not use AnyConnect? If it doesn't then we can probably try to study what's special about the AnyConnect traffic and maybe work-around it. AFAIK, SSLVPN uses HTTPS over TCP so it should not be any difference compared to a browser accessing a website secured with HTTPS.

Looks like there are changes to the kernel's TCP stack for 5.10 compared to 5.4.

Yes, I use AnyConnect every weekday and I don't think it has anything to do with the crash. It's just HTTPS/TCP traffic like regular web traffic.

ACwifidude's 20220709-Stable2203NSS image (no NSS fq_codel) has just crashed on me after about 10 hours while I was just surfing the web. There was no ramoops dump either. These unexpected and random crashes on R7800 make me think of switching to a RPi4 and a dedicated AP, plus I can run lots of stuff on the RPI4 as well.

Is there any trick to make a NSS-enabled image to behave like a non-NSS router? I was able to unload qca_nss_qdisc and qca_nss_pppoe modules, but cannot remove qca_nss_drv and qca_nss_gmac on the fly.

I just want to see if the very same image can run for long periods of time without NSS acceleration.

If there's no such trick, I will build a plain 22.03 image without NSS and see if the newer kernel 5.10.x and CPU governor gremlin may have anything to do with the crash-prone behavior on R7800.

1 Like

hello friend, I'm testing your latest version 22.03 on the r7800, and it seems as if pppoe is not accelerated with NSS. I get speed 300/300 on lan network, 300/450 on wifi when I have 22.02 300/900 mb. what could be happening?

plz dont kill me :slight_smile:
should i update?
after all the crashing posts one none crashing :slight_smile:

Too good to be true :slight_smile: j/k

Don't upgrade unless you have some annoying problem that you hope the new image may help get rid of it.

Rebased master. I had to delete the “config all kmods” line in the diffconfig due to random package that wouldn’t compile (not in the build, just in master’s repo).

3 Likes