Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

I'm sorry to hear that, yes my router rebooted today too after I upgraded the firmware, I have no idea why it stayed alive for 4 days before...

it's really unmotivated, but hey, it works really well without that enabled. so thanks for them, and thanks to you for trying to fix the promisc, in the end, you'll get it. greetings and encouragement.

Well, I noticed I needed to set /sys/class/net/br-lan/brif/eth1.1/hairpin_mode to 1 for another issue... that for sure makes the router reboot if br-lan is set to promisc mode.
So if anyone wants to debug, go ahead :slight_smile:

Having pstore active doesn't help in this case, /sys/fs/pstore/ is empty. :frowning:

1 Like

Did you enable pstore for console messages, I had a random reboot earlier this week and didn't get any panic/oops logged, but it did capture the below which I think indicates the NSS firmware crashed and it decided to reboot the device for me...

That one was caused by me setting scaling_min_freq to the min permitted value, with the reboot occurring ~2 days later. Normally the reboots are caused my me updating firmware rather than a "normal" occurrence since capping the min freq at 600000, but obviously everyone has different workloads and clients, a lot of mine are connected via dumb APs rather than to the main router...

Summary
[334464.821769] NSS core 0 signal COREDUMP COMPLETE 4000
[334464.821858]
[334464.821858] 69d42857: Starting NSS-FW logbuffer dump for core 0
[334464.825865] 69d42857: Warn: trap[813]: Trap on CHIP ID 00050000
[334464.833417] 69d42857: Warn: trap[620]: Trapped: TRAP_TD(00000004) DCAPT(3C000080)
[334464.839504] 69d42857: Warn: trap[645]: Trapped: Thread: 2, reason: 00000020, PC: 4002F30C, previous PC: 4002F308
[334464.846944] 69d42857: Warn: trap[594]: A0_3: 588B9B50 402301C0 3F01ABF8 588B9B52
[334464.857316] 69d42857: Warn: trap[594]: A4_7: 588B9B52 40052304 3F01ABF8 3F00AEF0
[334464.864719] 69d42857: Warn: trap[599]: D0_3: 00000026 00000009 00000001 588B9B40
[334464.872185] 69d42857: Warn: trap[599]: D4_7: 00060000 00000026 000003F4 00000009
[334464.879575] 69d42857: Warn: trap[599]: D8_11: 406AAE20 4C5220BC 55A001D0 00000000
[334464.887118] 69d42857: Warn: trap[599]: D12_15: 00000000 00000000 00D84001 00005805
[334464.894583] 69d42857: Warn: trap[649]: Thread_2 has non-recoverable trap
[334464.906955] NSS core 1 signal COREDUMP COMPLETE 4000
[334464.909015]
[334464.909015] f517ac2e: Starting NSS-FW logbuffer dump for core 1
[334464.914175] Kernel panic - not syncing: NSS FW coredump: bringing system down
[334464.921507] CPU1: stopping
[334464.928703] CPU: 1 PID: 1045 Comm: logd Not tainted 5.10.100 #0
[334464.931384] Hardware name: Generic DT based system
[334464.937666] [<c030ebb8>] (unwind_backtrace) from [<c030a820>] (show_stack+0x14/0x20)
[334464.942266] [<c030a820>] (show_stack) from [<c0679638>] (dump_stack+0x94/0xa8)
[334464.950246] [<c0679638>] (dump_stack) from [<c030d7b0>] (do_handle_IPI+0x140/0x184)
[334464.957359] [<c030d7b0>] (do_handle_IPI) from [<c030d810>] (ipi_handler+0x1c/0x2c)
[334464.965344] [<c030d810>] (ipi_handler) from [<c0373c28>] (__handle_domain_irq+0x90/0xf4)
[334464.972728] [<c0373c28>] (__handle_domain_irq) from [<c0694c70>] (gic_handle_irq+0x90/0xb8)
[334464.981061] [<c0694c70>] (gic_handle_irq) from [<c0300e90>] (__irq_usr+0x50/0x80)
[334464.989549] Exception stack(0xc61bffb0 to 0xc61bfff8)
[334464.996939] ffa0:                                     004b3130 b6e46b20 00000098 00000098
[334465.002080] ffc0: b6efd0b3 b6e075d0 b6e0762c 00000039 004b31ec 00000000 b6ce9080 b6e075d0
[334465.010317] ffe0: 004b2f1c befdcd88 004a1c20 b6e602bc 60000010 ffffffff
[334465.242752] Rebooting in 3 seconds..
1 Like

I have no idea what you're saying, but you're the first to log a nss core reboot

I think you are correct, console log messages wasn't enabled... I'll enable it now.
Any errors I get I will post in the ipq806x-nss-drivers thread. Hopefully Quarky or Ansuel can help us.

client 1 -> eth1.1 -> bridge -> wan (hairpin) -> bridge -> eth1.1 -> client 1

@quarky regarding this, I noticed when setting echo 1 > /sys/class/net/br-lan/brif/eth1.1/hairpin_mode the forwarded port is actually open but no data is received and ethernet traffic becomes very laggy....

Actually you should not set hairpin mode for eth1.1, unless one of your switch port bridged to br-lan connected to more than one subnet and br-lan is routing between those networks.

I would have thot the issue would be purely firewall configuration?

You want your LAN clients to be able to access your router’s WAN IP which is forwarded back to a LAN server?

^ Yes, this

It WORKS if br-lan is in promisc mode... but br-lan in promisc mode makes the router unstable so I want to avoid that.

EDIT:
Setting hairpin like i said makes the router go nuts and eat all CPU.
I'm going to try br-lan in promisc mode AND using the performance governor, it might be a frequency scaling issue after all...

@xeonpj Btw, if it is a CPU scaling issue that's causing the reboots, you could just try

echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor

It makes the router about 2C hotter but it seems more stable also, and of course more responsive since it's running at max frequency all the time.

1 Like

i did it mate i copied your local rc

I have tested for a while NSS build and, besides of slightly worse bufferbloat marks, there is significant difference with LAN Ethernet performance:

NSS
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  48.0 MBytes   403 Mbits/sec    5   1.26 MBytes       
[  5]   1.00-2.00   sec  51.2 MBytes   430 Mbits/sec    0   1.41 MBytes       
[  5]   2.00-3.00   sec  48.8 MBytes   409 Mbits/sec    0   1.52 MBytes       
[  5]   3.00-4.00   sec  46.2 MBytes   388 Mbits/sec    0   1.61 MBytes       
[  5]   4.00-5.00   sec  47.5 MBytes   398 Mbits/sec    0   1.68 MBytes       
[  5]   5.00-6.00   sec  40.0 MBytes   336 Mbits/sec  136   1.23 MBytes       
[  5]   6.00-7.00   sec  31.2 MBytes   262 Mbits/sec   48    950 KBytes       
[  5]   7.00-8.00   sec  43.8 MBytes   367 Mbits/sec   20    714 KBytes       
[  5]   8.00-9.00   sec  58.8 MBytes   493 Mbits/sec    0    771 KBytes       
[  5]   9.00-10.00  sec  51.2 MBytes   430 Mbits/sec   33    597 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   467 MBytes   392 Mbits/sec  242             sender
[  5]   0.00-10.00  sec   464 MBytes   389 Mbits/sec                  receiver

vs

no-NSS
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  73.4 MBytes   615 Mbits/sec    0   1.79 MBytes       
[  5]   1.00-2.00   sec  73.8 MBytes   619 Mbits/sec    0   1.79 MBytes       
[  5]   2.00-3.00   sec  80.0 MBytes   671 Mbits/sec    0   1.79 MBytes       
[  5]   3.00-4.00   sec  78.8 MBytes   661 Mbits/sec    0   1.79 MBytes       
[  5]   4.00-5.00   sec  53.8 MBytes   451 Mbits/sec    0   1.79 MBytes       
[  5]   5.00-6.00   sec  78.8 MBytes   661 Mbits/sec    0   1.79 MBytes       
[  5]   6.00-7.00   sec  81.2 MBytes   682 Mbits/sec    0   1.79 MBytes       
[  5]   7.00-8.00   sec  78.8 MBytes   661 Mbits/sec    0   1.79 MBytes       
[  5]   8.00-9.00   sec  78.8 MBytes   661 Mbits/sec    0   1.79 MBytes       
[  5]   9.00-10.00  sec  80.0 MBytes   671 Mbits/sec    0   1.79 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr

[  5]   0.00-10.00  sec   757 MBytes   635 Mbits/sec    0             sender
[  5]   0.00-10.02  sec   756 MBytes   633 Mbits/sec                  receiver

What might be the reason for such a difference?

something is wrong with lan in NSS, there are several things that cause problems, possibly when the experts find the cause, everything will work perfectly

Sorry for the dumb question, but is there any benefit to running this NSS build vs a standard build (e.g. hnyman) when your ISP connection is only 100Mbit?

Those are odd results. I get upper 500mbps / low 600mbps wireless, full line speed wired on LAN.

What is your client / server setup look like and what iperf settings are you using?

@zabolots honestly with a 100mbps ISP speed you are better off with standard OpenWrt. I’d run cake sqm at that speed and enjoy excellent latency for your speed. My build is more ideal for 500mbps+ speeds for squeezing out every bit of performance for faster internet connections.

2 Likes

I was testing from one R7800 (AP mode) to R7800 router and only over copper. For iperf3 options nothing fancy: iperf3 -s and iperf3 -c hostname.

R7800s will run out of processor if they are the client or the server. The router or the AP should only be the “in between”, never the server or receiver for iperf at these speeds (r7800’s work great for much slower speeds for iperf, but not maxing out line rate or maxing out 2x2 wifi 5 wireless connection, the r7800 cpu falls flat on its face and can’t keep up).

I’d test using two wired PCs for wired, a wired PC and a wireless client for testing wireless. Looks like this for a wired PC as the iperf server, iphone 13 as the wifi client with a r7800 in between:


ath10k-ct, 5.10 Kernel with NSS Hardware Offloading
[SUM]   0.00-30.01  sec  2.30 GBytes   659 Mbits/sec                  receiver
[SUM]   0.00-30.01  sec  1.99 GBytes   569 Mbits/sec  189             sender

ath10k (OpenWrt with no offloading)
[SUM]   0.00-30.01  sec  1.60 GBytes   459 Mbits/sec                  receiver
[SUM]   0.00-30.01  sec  1.14 GBytes   326 Mbits/sec  699             sender

ath10k-ct (OpenWrt with no offloading)
[SUM]   0.00-30.01  sec  1.53 GBytes   437 Mbits/sec                  receiver
[SUM]   0.00-30.01  sec  1.21 GBytes   347 Mbits/sec  763             sender

I've done some more tests and apparently 'worse' performance can be seen only when testing from access point (non-NSS) towards NSS enabled router. Even in opposite direction it is wirespeed. Therefore no issue.

1 Like

I waited a few days, no change. So I filed a bug report. I haven't found a similar issue, I can't imagine I'd be the only one if this is truly a typo in the package coding that runs into this the past week(s). Last change on OpenWRT is from February 28 2021. So the coding error may very well be in the upstream GNU release of findutils?

haven't you tried ath10k with offloading? are you using 80MHz or 160MHz? because with plain ath10k i can't get any advantage (on the opposite, i get less performance) with 160MHz channel width, but even with 80MHz it's not bad at all (uptime over 14 days)

this 21.02 is so damn good..

2 Likes