Felix has communicated with Toke (creator of the VTBA scheduler), decided to drop the VTBA scheduler and reverted back to the round-robin scheduler plus some additional improvements:
The virtual time scheduler code has a number of issues:
10 - queues slowed down by hardware/firmware powersave handling were not properly
11 handled.
12 - on ath10k in push-pull mode, tx queues that the driver tries to pull from
13 were starved, causing excessive latency
14 - delay between tx enqueue and reported airtime use were causing excessively
15 bursty tx behavior
16
17 The bursty behavior may also be present on the round-robin scheduler, but there
18 it is much easier to fix without introducing additional regressions
19
20 Signed-off-by: Felix Fietkau <nbd@nbd.name>
Quarky's finding seems to have been corroborated. Hope OpenWrt Wifi will become reliable again!
Can confirm, my WLAN is completely broken after a few hours of uptime with the current kernel 5.10 build.
Is there a way to download one of the older builds so I can at least revert back until its fixed? I don't have a backup of the image I used before unfortunately
I have several backups. If you want I can share the 20220519 master build which works OK for me.
It spits periodically kern.warn kernel: [276542.060762] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 83 tid 0
but the WLAN at least continue working.
Felix published updated info
I rebased your master to the latest commits and built it but it failed. I tried -j4 V=sc but it failed earlier on so I used -j1 V=sc but it kept failing when building qca-nss-drv. The errors are as follows:
from /home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_tx_rx_common.h:25,
from /home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_bridge.c:17:
/home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_core.h: In function 'nss_core_dma_cache_maint':
/home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_core.h:120:17: error: implicit declaration of function 'dmac_inv_range'; did you mean 'outer_inv_range'? [-Werror=implicit-function-declaration]
120 | dmac_inv_range(start, start + size);
| ^~~~~~~~~~~~~~
| outer_inv_range
/home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_core.h:123:17: error: implicit declaration of function 'dmac_clean_range'; did you mean 'dmac_flush_range'? [-Werror=implicit-function-declaration]
123 | dmac_clean_range(start, start + size);
| ^~~~~~~~~~~~~~~~
| dmac_flush_range
cc1: some warnings being treated as errors
make[5]: *** [scripts/Makefile.build:280: /home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/nss_bridge.o] Error 1
make[4]: *** [Makefile:1822: /home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de] Error 2
make[4]: Leaving directory '/home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/linux-5.10.120'
make[3]: *** [Makefile:127: /home/python/Venv/OpenWrt/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/qca-nss-drv-809a00de/.built] Error 2
make[3]: Leaving directory '/home/python/Venv/OpenWrt/openwrt/package/qca/qca-nss-drv'
time: package/qca/qca-nss-drv/compile#7.54#1.44#8.32
ERROR: package/qca/qca-nss-drv failed to build.
make[2]: *** [package/Makefile:116: package/qca/qca-nss-drv/compile] Error 1
make[2]: Leaving directory '/home/python/Venv/OpenWrt/openwrt'
make[1]: *** [package/Makefile:110: /home/python/Venv/OpenWrt/openwrt/staging_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/stamp/.package_compile] Error 2
make[1]: Leaving directory '/home/python/Venv/OpenWrt/openwrt'
make: *** [/home/python/Venv/OpenWrt/openwrt/include/toplevel.mk:230: world] Error 2
Felix has just committed his fix for mac80211 several hours ago, please kick off a new master build. I trust a build on your machine than mine, plus you said you always test it first for us all
Quarky must be chuckling at his Eureka moment
mac80211: add airtime fairness improvements
This reverts the airtime scheduler back from the virtual-time based scheduler
to the deficit round robin scheduler implementation.
This reduces burstiness and improves fairness by improving interaction with AQL.
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=6d49a25804d78d639e08a67c86b26991ce6485d8
Honestly I cannot exactly say that the Wi-Fi is improved compared to the things that were present (with VTBS) around 15 days ago (before the patches that spoiled the WLAN). But at least now (with latest commits) the WLAN is stable and with really good performance. At least that is in my case. @ACwifidude
After the latest master firmware update I see this in status->firewall
I have been using the 5.10 kernel build OpenWrt 22.03 (Stable) + NSS Hardware Offloading Dowload and i seem to be having issues with getting full speeds, my connection uses PPPOE BT in uk 980mps down and 120mbs up. I would lke to add on the OpenWrt 21.02 (Stable) + NSS Hardware Offloading Download there is no issues, ive setup FQ Codel for Nss as per the instructions , with performace governor, irqbalance, but i dont use packet steering, the loss of speed only seems to effect the 5.10 kernel builds, is PPPOE offloading broke in these builds? also is anyone else having same issues? ive reverted back to the 21.02 branch for now and everything is working as expected. @ACwifidude - will there be any new builds for the 21.02 branch, with the new wifi patches? out of interest, and is there any other reports of the PPPOE offloading issues? (assuming this is the issue)
i've just installed an updated 21.02 and there's something wrong.
now and then (i think when some device exits the wifi coverage) simply all connections lock up, i mean both wired and wireless.
after some minutes, everything starts up again
it doesnt't seem to be @quarky 's issue, since this is also for wired connections, possibly it's the router itself to be busy doing something else, i'll try to check cpu occupation next time it happens..
Since switching to the schedutil governor, my r7800 crashed twice. The first crash did not save any ramoop, but the second crash gave me the following ramoops dump that showed something related to NSS. Could you please take a look. I used the latest master snapshot build from ACwifidude.
I have switched back to the ondemand governor for now, since I had never seen this NSS-related crash prior to switching to the schedutil governor.
<1>[48528.076300] NSS core 0 signal COREDUMP COMPLETE 4000
<1>[48528.076338]
<1>[48528.076338] fd47b999: Starting NSS-FW logbuffer dump for core 0
<1>[48528.080421] fd47b999: Warn: trap[813]: Trap on CHIP ID 00050000
<1>[48528.087796] fd47b999: Warn: trap[620]: Trapped: TRAP_TD(00000004) DCAPT(3C000080)
<1>[48528.093361] fd47b999: Warn: trap[645]: Trapped: Thread: 2, reason: 00000020, PC: 4002F30C, previous PC: 4002F308
<1>[48528.101073] fd47b999: Warn: trap[594]: A0_3: 4AC96ED0 402301C0 3F020D88 4AC96ED2
<3>[48528.104389] wlan0: NSS TX failed with error: NSS_TX_FAILURE_NOT_READY
<1>[48528.111316] fd47b999: Warn: trap[594]: A4_7: 4AC96ED2 40052304 3F020D88 3F00AEF0
<1>[48528.111326] fd47b999: Warn: trap[599]: D0_3: 00000026 00000009 00000006 4AC96EC0
<1>[48528.111334] fd47b999: Warn: trap[599]: D4_7: 00060000 00000026 4368E0CC 4368E0B4
<1>[48528.111342] fd47b999: Warn: trap[599]: D8_11: 4368E0B8 4368E0BC 4C08867C 00000000
<1>[48528.111356] fd47b999: Warn: trap[599]: D12_15: 00000000 00000000 00D84001 00003C00
<1>[48528.154617] fd47b999: Warn: trap[649]: Thread_2 has non-recoverable trap
<1>[48528.165281] NSS core 1 signal COREDUMP COMPLETE 4000
<1>[48528.169143]
<1>[48528.169143] 7f68f8b9: Starting NSS-FW logbuffer dump for core 1
<0>[48528.173840] Kernel panic - not syncing: NSS FW coredump: bringing system down
<2>[48528.181215] CPU1: stopping
<4>[48528.188233] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.120 #0
<4>[48528.190833] Hardware name: Generic DT based system
<4>[48528.197017] [<c030e46c>] (unwind_backtrace) from [<c030a204>] (show_stack+0x14/0x20)
<4>[48528.201701] [<c030a204>] (show_stack) from [<c0632ea8>] (dump_stack+0x94/0xa8)
<4>[48528.209597] [<c0632ea8>] (dump_stack) from [<c030d190>] (do_handle_IPI+0x140/0x184)
<4>[48528.216627] [<c030d190>] (do_handle_IPI) from [<c030d1f0>] (ipi_handler+0x1c/0x2c)
<4>[48528.224178] [<c030d1f0>] (ipi_handler) from [<c037184c>] (__handle_domain_irq+0x90/0xf4)
<4>[48528.231821] [<c037184c>] (__handle_domain_irq) from [<c064c154>] (gic_handle_irq+0x90/0xb8)
<4>[48528.240068] [<c064c154>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
<4>[48528.248130] Exception stack(0xc146df18 to 0xc146df60)
<4>[48528.255768] df00: 00000000 00002c22
<4>[48528.260822] df20: 1cd58000 dd99fd80 00000000 d8cba8a0 c1c69040 00000000 dd99f030 00002c22
<4>[48528.268980] df40: 00000000 00002c22 0e22a980 c146df68 c07bd41c c07bd43c 60000013 ffffffff
<4>[48528.277137] [<c0300b8c>] (__irq_svc) from [<c07bd43c>] (cpuidle_enter_state+0x180/0x380)
<4>[48528.285292] [<c07bd43c>] (cpuidle_enter_state) from [<c07bd68c>] (cpuidle_enter+0x3c/0x5c)
<4>[48528.293450] [<c07bd68c>] (cpuidle_enter) from [<c034e678>] (do_idle+0x208/0x2a4)
<4>[48528.301522] [<c034e678>] (do_idle) from [<c034e9d0>] (cpu_startup_entry+0x1c/0x20)
<4>[48528.309072] [<c034e9d0>] (cpu_startup_entry) from [<423015ac>] (0x423015ac)