IPQ806x NSS Drivers

Hi people.
Which branch is the most stable?
I built openwrt with branch lede-17.01-ipq806x-nss, but my outer reboots every several hours.
Has it been corrected?
@quarky - which branch can you recommend for R7800?

@sqter are you using the latest commit from the 'lede-17.01-ipq806x-nss' branch? Before the latest commit, I did not include the NSS core clock frequency patch to force the NSS cores to run at 800MHz. Without that, your R7800 will be unstable and reboot randomly and you have to manually apply the NSS clock patch.

My R7800 has been running for more than a month without any reboots, other than me rebooting it manually. My R7800 longest up time to date is over 35 days.

Of course, it could be something specific to your environment that's causing the reboot. It would be helpful if you can provide logs from your router.

Yes - I built with the latest commit and used your configuration 'config.seed'.
LOG's unfortunately have not been registered. The router reboots randomly and has not saved any errors ..
I think the problem may be the Tvheadend server (trunk) which uses RTSP or UDP as the source. I noticed that multi-casts are not accelerated and the CPU is loaded just like on a disabled QCA-NSS driver. Maybe the buffer is overflowing? Do you have any idea?
Maybe I'll try the older branch 'lede-17.01-quarkysg-qca-nss'?
Is there anything that can be changed in the sources? Maybe I will disable all qca modules and leave only qca-nss-drv and qca-nss-ecm enabled?

probably nothing... but oem spat this at me... may be of use
[11845.187504] L2 master port decode error
[11845.191315] ------------[ cut here ]------------
[11845.195939] WARNING: at arch/arm/mach-msm/cache_erp.c:449 msm_l2_erp_irq+0x244/0x290()
[11845.203811] L2 master port error detected
[11845.207654] Modules linked in: xt_TPROXY xt_mac loop usblp synoiprecord(PO) qca_nss_ipsec(O) qca_nss_ipsecmgr(O) qca_nss_cfi_o]
[11845.280319] [<c001539c>] (unwind_backtrace+0x0/0x128) from [<c005e8d4>] (warn_slowpath_common+0x54/0x64)
[11845.289753] [<c005e8d4>] (warn_slowpath_common+0x54/0x64) from [<c005e914>] (warn_slowpath_fmt+0x30/0x40)
[11845.299313] [<c005e914>] (warn_slowpath_fmt+0x30/0x40) from [<c0059820>] (msm_l2_erp_irq+0x244/0x290)
[11845.308528] [<c0059820>] (msm_l2_erp_irq+0x244/0x290) from [<c00a836c>] (handle_irq_event_percpu+0x34/0x16c)
[11845.318307] [<c00a836c>] (handle_irq_event_percpu+0x34/0x16c) from [<c00a84f4>] (handle_irq_event+0x50/0x70)
[11845.328116] [<c00a84f4>] (handle_irq_event+0x50/0x70) from [<c00ab388>] (handle_fasteoi_irq+0x128/0x188)
[11845.337582] [<c00ab388>] (handle_fasteoi_irq+0x128/0x188) from [<c00a7bfc>] (generic_handle_irq+0x30/0x40)
[11845.347235] [<c00a7bfc>] (generic_handle_irq+0x30/0x40) from [<c000f5a0>] (handle_IRQ+0x94/0xac)
[11845.355982] [<c000f5a0>] (handle_IRQ+0x94/0xac) from [<c00085f8>] (gic_handle_irq+0x8c/0xf4)
[11845.364417] [<c00085f8>] (gic_handle_irq+0x8c/0xf4) from [<c000e800>] (__irq_svc+0x40/0x54)
[11845.372758] Exception stack(0xc0905d70 to 0xc0905db8)
[11845.377788] 5d60:                                     dd8b0000 dd8b0100 0000000c 00000001
[11845.385942] 5d80: fb703dc0 dc326780 bf18d6c0 00000bc0 dc326780 0000005e 00000002 bf18d688
[11845.394095] 5da0: 00000000 c0905db8 c001a7cc bf156418 20000113 ffffffff
[11845.400750] [<c000e800>] (__irq_svc+0x40/0x54) from [<bf156418>] (nss_core_handle_napi+0x84c/0x1e18 [qca_nss_drv])
[11845.411059] [<bf156418>] (nss_core_handle_napi+0x84c/0x1e18 [qca_nss_drv]) from [<c03d9970>] (net_rx_action+0x90/0x168)
[11845.421806] [<c03d9970>] (net_rx_action+0x90/0x168) from [<c0064ac4>] (__do_softirq+0x104/0x1c4)
[11845.430553] [<c0064ac4>] (__do_softirq+0x104/0x1c4) from [<c0064f44>] (irq_exit+0x50/0x58)
[11845.438800] [<c0064f44>] (irq_exit+0x50/0x58) from [<c000f5a8>] (handle_IRQ+0x9c/0xac)
[11845.446704] [<c000f5a8>] (handle_IRQ+0x9c/0xac) from [<c00085f8>] (gic_handle_irq+0x8c/0xf4)
[11845.455139] [<c00085f8>] (gic_handle_irq+0x8c/0xf4) from [<c000e800>] (__irq_svc+0x40/0x54)
[11845.463449] Exception stack(0xc0905f18 to 0xc0905f60)
[11845.468478] 5f00:                                                       00000000 00000002
[11845.476663] 5f20: 00000000 00000000 00000000 00000000 c0ef1340 00000003 c0ae37c8 c09637b8
[11845.484817] 5f40: c0605058 00000000 00000000 c0905f60 00000001 c0049a6c 60000013 ffffffff
[11845.492971] [<c000e800>] (__irq_svc+0x40/0x54) from [<c0049a6c>] (msm_cpuidle_enter+0x68/0x70)
[11845.501562] [<c0049a6c>] (msm_cpuidle_enter+0x68/0x70) from [<c0385b04>] (cpuidle_idle_call+0xb8/0x124)
[11845.510934] [<c0385b04>] (cpuidle_idle_call+0xb8/0x124) from [<c000fb4c>] (cpu_idle+0x70/0xfc)
[11845.519556] [<c000fb4c>] (cpu_idle+0x70/0xfc) from [<c04ee76c>] (rest_init+0x90/0x98)
[11845.527366] [<c04ee76c>] (rest_init+0x90/0x98) from [<c0800960>] (start_kernel+0x434/0x440)
[11845.535676] ---[ end trace 38229f8de8450403 ]---
[11845.540331] L2 Error detected!

You did not add or remove any kernel module(s) before building the firmware image right?

The NSS firmware should accelerate all TCP and UDP traffic that passes thru the ethernet interfaces. For WiFi interfaces, 'acceleration' is limited; not as efficient as stock firmware, which I think bypasses standard Linux WiFi network stack. Your R7800 load will go up if WiFi traffic is high if running LEDE/OpenWRT even with the NSS cores acceleration.

Not too sure about multi-cast traffic tho. My TV's set-top-box is also connected to my R7800 for IPTV services, but I configured it as a pass-thru, i.e. the R7800 Linux stack do not see the IPTV traffic at all. The physical switch just switches the VLAN traffic transparently, and the Linux kernel or NSS cores sees no load from the IPTV traffic.

My R7800 have 4 WireGuard VPN connections to multiple sites using UDP. It transfers gigabytes of data daily, so I think UDP should be fine. I've also tried OpenVPN TAP interface using UDP to another site, i.e. bridging both sides at L2; and multicast traffic seems fine. I can get DHCP service from remote LAN. I only tested it for about a day or so, transferring about 10GB of data via the L2 WAN bridge.

When I was fiddling with the NSS drivers, the driver that handles multicast traffic seems unstable. My 'config.seed' did not select the qca-mcs driver, although it's committed into my repo. As multicast is not my priority, I did not spend too much time trying to figure out why it's not working correctly. If you enabled that driver, that could be the problem.

What services are running in your R7800?

One caveat of the NSS cores' acceleration behaviour, which is similar to most 'acceleration' implementation, like SFE and the one built into the Linux kernel, is that it only accelerate traffic originating from outside of the Linux kernel of your router. They basically have a shortcut path for the Linux firewall and routing stack, which is where the acceleration comes from. Of course the NSS core can also accelerate crypto functions, but I have yet to make this feature useful in my builds.

If you don't mind trying out my firmware, you can trying installing it and see if there's any difference in stability.

Looks like issue with L2 cache. My repo's codes basically uses stock Linux L2 cache handling codes, with whatever patches LEDE/OpenWRT applied on top of it. I did not port over too much Krait CPU handling codes from QSDK. Having said that, it could be one cause of instability of the R7800 even with LEDE/OpenWRT.

1 Like

Thank you for the comprehensive reply.
I tried not to change your configuration too much. I added a few of my packages - nothing more.
I use several programs on it which depend on the kernel so I won't add them to your compilation.
I only have one router so any tests must wait.

I tried the source of codeaurora and Netgear GPL.
Unfortunately, the build failed even once.

For now, I have to give up the NSS. Stability is most important to me now.

Glad to see you still working on it :smiley:
I tried so hard but with my minimal knowledge of compiling and the router architecture kernel patches,
I can't get things properly done for the NBG6817.
What i mean with that is what to change without bricking anything.
Referring to the changes that has to be made by @slh

@quarky I'm currently testing your latest ipq806x nss branch and for the most part things are working great. However, I'm having issues with WLAN where clients on the same AP (5G or 2G) are not able to communicate/ping between each other. I AM however able to ping between 5G and 2G and any clients that are hardwired.

I've used your custom image from Aug 15 to see if I messed something up, but it exhibits the same problem.

I've confirmed that the "isolate" option in NOT set, and even set it manually to "0" using uci.

@qosmio do you see the same result with the plain OpenWRT or LEDE firmware, i.e. without the NSS cores enabled. I'm able to ping my iMac from my iPad, both on the wireless network of my R7800, running my own firmware with NSS drivers running. I'm also able to remote desktop using VNC from my iPad to my iMac. Pinging my iPad from my iMac works too, although the iPad need to be awake.

My guess would be your wireless clients are dropping ICMP packets.

You can disable NSS processing for wireless traffic in my firmware by executing the following from a SSH shell:

echo 0 > /sys/module/mac80211/parameters/is_nss_enable

Try pinging after disabling and see if the issue is due to the NSS stack. I would think it's unlikely to be NSS related tho.

Disabling NSS for mac80211 will result in a drop in wireless thruput tho.

@quarky thanks for the reply! So, I tried disabling NSS with the setting you mentioned (using your factory image) and I'm getting some pretty inconsistent results. I was still not able to communicate between my clients (Rpi 4, MacBook, iPhone and iPad) when on the same WLAN.

I then tried reconnecting my clients after using your setting, and after about 2-3 mins it looked like I was able to communicate between my MacBook and Pi. I tried enabling the NSS cores and it seemed to work. But in subsequent reconnects I could no longer ping or communicate in any form (SSH/TELNET). Only 2G<->5G or LAN<->5G/2G work. This is for both your image and my own I customized from your latest branch.

I then tried both @hnyman and Kong's builds and both worked in allowing my WLAN clients to communicate.

It's a shame really, I pretty much had everything working great with your branch with my customization.

@qosmio Now I see what you mean. I see the same issue if both wireless clients are in the same band. It's as if the clients are all isolated from each other. Previously I tried with two clients on different wireless bands, i.e. one of 2G and the other on 5G.

Will look into it when I have some spare time.

Thanks.

1 Like

could be a limitation of the nss core?

Thanks! Honestly, a bit relieved you're having the same issue. I thought I may have done something wonky. I've switched back to my R7000 XWRT build in the mean time. I tried troubleshooting with this post (Clients in same WLAN can't reach each other).

No luck, not sure if it's even related. Figured it was worth a shot.

From what I can see it looks like issue lies with the old ath10k or mac80211 driver. Likely newer drivers resolved this issue. Unlikely to be caus3d by the NSS cores, or all communications will have issues. I am able to ping clients between 2G and 5G WiFi network. Problem seems to exist when clients are on the same WiFi freq band. I suspect the mac80211 driver is enforcing client isolation even when it’s not enabled.

1 Like

Both mac80211 and ath10k drivers would be kernel specific correct? I tried upgrading the kernel all the way up to 4.4.206 in the hopes there were some fixes to mac80211 and ath10k that mitigated this. I had to tweak the patches a little bit but it compiled. However, upon activating the wireless I had a kernel panic and it rebooted :neutral_face:

The highest version I was able to get was 4.4.196 that didn't kernel panic, but that didn't seem to fix my same WLAN client isolation issue.

Lastly, I tried compiling the 17.04 branch without any of the NSS patches and it didn't exhibit the issue. So, something seems to be related to the NSS patches.

OpenWrt uses an out-of-tree version of mac80211, so upgrading the kernel has little effect

1 Like

So, I got it working. It's not an ideal solution but it works for what I need, and doesn't compromise on all the work @quarky's done getting NSS acceleration working.

For me, the solution was to ensure ap_isolate=0 in the WiFi configs hostapd generates. I had to edit /lib/functions/hostapd.sh and hardcode the value since setting "option isolate 0" has no effect in the generated config. I believe the reason for this was to have everything forwarded to the bridge, and have the isolation controlled there... but, since that doesn't work to begin with...

Hope this helps someone. Again, not ideal as it will mean you're removing the isolation feature for every WLAN. I suppose I could update the hostapd.sh script to only apply it to non-guest WLANs, but not really something I wanna much around with more.

1 Like

@quarky Do you remember if the nss-drv needed the qsdk package?

@Ansuel qsdk is not used by the NSS drivers. As far as I know it’s standalone. It’s basically a utility tool used to configure the qca8337 switch.