IPQ806x NSS Drivers

While doing performance checks using iperf3 running on my Zyxel NBG6817 with "-P4" I managed to crash the NSS core driver but I'm not sure how...

[  445.305962] NSS core 1 signal COREDUMP COMPLETE 4000
[  445.306014] 
[  445.306014] 1807b2aa: Starting NSS-FW logbuffer dump for core 1
[  445.309992] 1807b2aa: Warn: trap[813]: Trap on CHIP ID 00050000
[  445.317370] 1807b2aa: Warn: trap[620]: Trapped: TRAP_TD(00000020) DCAPT(3C000080)
[  445.323088] 1807b2aa: Warn: trap[645]: Trapped: Thread: 5, reason: 00000800, PC: 4080B158, previous PC: 4080B154
[  445.330655] 1807b2aa: Warn: trap[594]: A0_3: 3F01E900 00000000 3F02D398 3F00AAE0
[  445.340949] 1807b2aa: Warn: trap[594]: A4_7: 3F01F6E8 3F000658 3F02D7B8 3F007794
[  445.348276] 1807b2aa: Warn: trap[599]: D0_3: 00000002 00000001 00000000 00000000
[  445.355655] 1807b2aa: Warn: trap[599]: D4_7: 00000000 00000000 00000000 00000000
[  445.363031] 1807b2aa: Warn: trap[599]: D8_11: 00000000 00000000 00000000 00000000
[  445.370410] 1807b2aa: Warn: trap[599]: D12_15: 00000000 00000000 00000000 00000000
[  445.377788] 1807b2aa: Warn: trap[649]: Thread_5 has non-recoverable trap
[  445.388411] NSS core 0 signal COREDUMP COMPLETE 4000
[  445.392110] 
[  445.392110] c933bd90: Starting NSS-FW logbuffer dump for core 0
[  445.397122] Kernel panic - not syncing: NSS FW coredump: bringing system down
[  445.404444] CPU1: stopping
[  445.411475] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.4.85 #0
[  445.414072] Hardware name: Generic DT based system
[  445.419916] [<c030f974>] (unwind_backtrace) from [<c030b984>] (show_stack+0x14/0x20)
[  445.424769] [<c030b984>] (show_stack) from [<c08f4ea0>] (dump_stack+0x94/0xa8)
[  445.432665] [<c08f4ea0>] (dump_stack) from [<c030eba0>] (handle_IPI+0x184/0x1b8)
[  445.439703] [<c030eba0>] (handle_IPI) from [<c05d9c10>] (gic_handle_irq+0xb4/0xb8)
[  445.447249] [<c05d9c10>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
[  445.454614] Exception stack(0xdd46bf18 to 0xdd46bf60)
[  445.462170] bf00:                                                       00000000 00000067
[  445.467232] bf20: 1d060000 ddba0a80 dcc58000 00000000 ddb9fe30 00000067 00000067 00000000
[  445.475391] bf40: b42c47c0 b3c24dc0 00000015 dd46bf68 c072b100 c072b104 80000013 ffffffff
[  445.483540] [<c0301a8c>] (__irq_svc) from [<c072b104>] (cpuidle_enter_state+0x94/0x498)
[  445.491693] [<c072b104>] (cpuidle_enter_state) from [<c072b54c>] (cpuidle_enter+0x30/0x4c)
[  445.499506] [<c072b54c>] (cpuidle_enter) from [<c034ac1c>] (do_idle+0x1d8/0x240)
[  445.507836] [<c034ac1c>] (do_idle) from [<c034af2c>] (cpu_startup_entry+0x1c/0x20)
[  445.515388] [<c034af2c>] (cpu_startup_entry) from [<423024cc>] (0x423024cc)
[  445.722778] Rebooting in 3 seconds..
[  449.723230] bad: scheduling from the idle thread!
[  449.723249] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.85 #0
[  449.726899] Hardware name: Generic DT based system
[  449.732633] [<c030f974>] (unwind_backtrace) from [<c030b984>] (show_stack+0x14/0x20)
[  449.737498] [<c030b984>] (show_stack) from [<c08f4ea0>] (dump_stack+0x94/0xa8)
[  449.745395] [<c08f4ea0>] (dump_stack) from [<c034a938>] (dequeue_task_idle+0x38/0x4c)
[  449.752434] [<c034a938>] (dequeue_task_idle) from [<c090d480>] (__schedule+0x258/0x438)
[  449.760329] [<c090d480>] (__schedule) from [<c090d6b4>] (schedule+0x54/0xf4)
[  449.768141] [<c090d6b4>] (schedule) from [<c09114a8>] (schedule_timeout+0x174/0x30c)
[  449.775433] [<c09114a8>] (schedule_timeout) from [<c0386f40>] (msleep+0x34/0x4c)
[  449.783158] [<c0386f40>] (msleep) from [<c071e918>] (qcom_wdt_restart+0xac/0xbc)
[  449.790535] [<c071e918>] (qcom_wdt_restart) from [<c071cde4>] (watchdog_restart_notifier+0x18/0x30)
[  449.797920] [<c071cde4>] (watchdog_restart_notifier) from [<c033e8f0>] (notifier_call_chain+0x74/0xa8)
[  449.806684] [<c033e8f0>] (notifier_call_chain) from [<c033e960>] (atomic_notifier_call_chain+0x1c/0x24)
[  449.816064] [<c033e960>] (atomic_notifier_call_chain) from [<c030a0c4>] (machine_restart+0x80/0x84)
[  449.825352] [<c030a0c4>] (machine_restart) from [<c031e4c4>] (panic+0x27c/0x2fc)
[  449.834479] [<c031e4c4>] (panic) from [<bf2f3638>] (nss_fw_coredump_notify+0x258/0x25c [qca_nss_drv])
[  449.842122] [<bf2f363

Sofar I have only been playing the the qca-nss-cfi (cryptoapi part only) and qca-nss-crypto.
Any suggestions?? @Ansuel, @quarky

Unrelated to the above post, but what am I missing in the nbg6817.dts to fix this:

[    3.425009] dwc3-qcom 110f8800.usb3: IRQ hs_phy_irq not found
[    3.427475] dwc3-qcom 110f8800.usb3: IRQ dp_hs_phy_irq not found
[    3.433280] dwc3-qcom 110f8800.usb3: IRQ dm_hs_phy_irq not found
[    3.439404] dwc3-qcom 110f8800.usb3: IRQ ss_phy_irq not found
[    3.446275] dwc3-qcom 100f8800.usb3: IRQ hs_phy_irq not found
[    3.450980] dwc3-qcom 100f8800.usb3: IRQ dp_hs_phy_irq not found
[    3.458698] dwc3-qcom 100f8800.usb3: IRQ dm_hs_phy_irq not found
[    3.462898] dwc3-qcom 100f8800.usb3: IRQ ss_phy_irq not found
[    3.470098] dwc3 11000000.dwc3: Failed to get clk 'ref': -2
[    3.536901] dwc3 10000000.dwc3: Failed to get clk 'ref': -2

I might have overlooked something in the R7800.dts, but it seems the dwc3 part is defined "in-tree" as part of the ipq8064??

check the pwr_en_pins... maybe 7800 pins were accidentally copied... pretty sure current master dts should be ok for usb...

At a minimum, the following QCA drivers will be needed for basic ethernet NSS acceleration:

qca-nss-gmac + related kernel patches
qca-nss-drv + related kernel patches
shortcut-fe/simulated-driver
qca-nss-ecm + related kernel patches
NSS firmware

If QoS is required, then add in 'nss-ifb' driver.

If these can get accepted into the master branch, it'll benefit all ipq806x routers.

It will be tough tho, as someone needs to maintain the codes as the master branch evolve. Even before the maintenance starts, it has to be accepted into master, which I think it'll be really tough, as the kernel changes are very invasive. Likely will not sit well with the openwrt maintainers.

Looking at the changes with the master branch, it looks like a moving target at the moment. Maybe should wait until the next release based on the 5.x kernel?

1 Like

Looks like there could be a bug with the NSS firmware that's related to the crypto engine. NSS Core 1 is dedicated to the crypto engine operation. If it's a bug with the firmware, there's nothing much we can do tho., unfortunately.

From the stack trace, it looks like the core dump is caused by watchdog? If that's the case, it could be due to your code encountering an infinite loop, deadlocking, causing the watchdog timer to timeout, resulting in a core dump restart?

1 Like

@Ansuel - I am still treating you as "upstream" for the NSS patchset. Just wondering where your interest is in updating some of the NSS code? There was talk about:

I’m going to do some work this weekend to consolidate and reduce the commits. I’ll need some help on the 5.8 wifi and QOS.

2 Likes

Thanks. Works for me - amazing work! I need this router on a daily basis, so I haven't run it for that long tbh.

@Ansuel can you take a look? it seems fairly straightforward, only thing i'm unsure about is the new sta block. it looked like something we should include in the offload path still, but i'm unsure.

FYI i haven't run this on my R7800 yet, waiting for the work(ing from home) day to end.

hello i lost track of this a bit....

I'm investigating a problem with the regulator... Do we still have problem with random freze and crash?

Several folks are still having crashes. Your voltage comments are interesting in the exploration thread. Seems like a plausible reason.

I want to test a hack and test if by hardcoding the max voltage would fix the problem.

1 Like

Hi,

In the NSS drivers, does conntrack work the same way as before or is it in a different way ?

The reason I am asking this is , with the non-NSS boards like IPQ4019 , I was doing some content filtering with this project - https://github.com/vel21ripn/nDPI

This project enables us to do something like this -
iptables -t mangle -I PREROUTING -m ndpi --youtube -j DROP

Now with NSS enabled in the IPQ6018 chipset, the above does not work. I mean it drops the traffic momentarily and then all of a sudden starts working.

Then I felt it could be bcoz of connection tracking , so I even marked the connection and then dropped the mark, but still similar behavior.

Do you have any idea what could be the issue.

can you test the hack when i publish it?

1 Like

@Ansuel, is there a way to check what is the active voltage used by the router?
I'm asking b/c my last build since November is still up:

Model Netgear Nighthawk X4S R7800
Architecture ARMv7 Processor rev 0 (v7l)
Firmware Version OpenWrt SNAPSHOT r14530+605-9085343 / LuCI Master unknown
Kernel Version 5.4.80
Local Time 2021-01-07 10:32:35
Uptime 41d 14h 27m 24s

The build is supposedly using the NSS cores:

[   21.040820] **********************************************************
[   21.040848] * Driver    :NSS GMAC Driver - RTL v(3.72a)
[   21.046233] * Version   :1.0
[   21.051425] * Copyright :Copyright (c) 2013-2018 The Linux Foundation. All rights reserved.
[   21.054483] **********************************************************
[   21.206638] nss_driver - fw of size 536324  bytes copied to load addr: 40000000, nss_id : 0
[   21.207489] nss_driver - Turbo Support 1
[   21.213870] Supported Frequencies - 
[   21.213872] 800Mhz 
[   21.217953] 800Mhz 
[   21.221585] 800Mhz 
[   21.223330] 
[   21.227766] 3782ad7e: meminfo init succeed
[   21.253122] nss_driver - fw of size 218224  bytes copied to load addr: 40800000, nss_id : 1

Performance is also where it needs to be

There isn't... The problem here is that like the smartphone and any other device with a silicon chip... Every cpu has it's own voltage. Some chip are better than others and some needs more voltage than others.
On ipq806x there are efuses that are set by qcom (or who creates the cpu) that tells how much good the chip is. With this prerequisites....
The regulator system was never actually used (nobody actually enabled the regulator in the first place in the kernel) The regulator comunicate with the system with the rpm firmware. This is a one way comunication. We can tell him to set the value and we wait for a response (most of the time from some test, it does reject the set). AFAIK there is not way to ask the rpm interface to tell us what is the actual voltage used. The current regulator driver is shit and if the regulator is never enabled, it just set the voltage like it has been applied. (the voltage is stored in a simple struct and the value is changed also with the regulator disabled) So the number we gets from regulator_summary are just what we tell the system to set but not what is actually set.

tl;dr regulator system is broken, We can't get the current voltage set in the system.

Also in your case you probably have a working system for the fact that you have a very good silicon cpu... (mainly the last psv is fused and you can use the max freq with just 1050000 instead of 1262500
so your cpu can work with 1v at max freq instead of 1.2v. That if someone have some experience with overclock 0.2v for a cpu is a world of difference)

This is very interesting point. To make sure we are on the same page, I need to reiterate that prior to this build, I was having random reboots like everybody else.
I don't think I have an exceptionally good silicon. I'm never that lucky :slight_smile:

Suggestion 1: if you can manage to take your code to work with the actual master, probably someone else can build it and test (for example, me :))
Suggestion 2: if you got it right and it is a matter of voltage (great catch!), i remember that when the nss cores are allowed to throttle themselves automatically the reboots are much more frequent. Maybe this can help in testing? my R7800 reboots more or less once every two or three weeks, i would not be able to test this..

(and welcome back, nice to see you again :))

I'm currently testing with a very bad code... Luckly the rpm firmware tells immediately if it does reject the voltage request.
Again this must be fixed or I must find a workaround...
The silicon world is so beautiful. If we apply the same rules that we apply for the normal cpu overclock, the first thing to do is disable cpu scaling to keep stability. And what we are doing here is the same.

well, but if we apply the same rules of the normal cpu/gpu overclock, NSS should be much more instable when fixed at maximum frequency then when fixed at minimum.. and it seems not the case, am i wrong?