IPQ806x NSS Drivers

If you can run gdb on ecm.o, please show the output for:

l *ecm_nss_ported_ipv4_connection_destroy_callback+0x21c/0x4c0

Sure...

l *ecm_nss_ported_ipv4_connection_destroy_callback+0x21c/0x4c0
qca-nss-ecm-standard/qca-nss-ecm-9228212b/frontends/nss/ecm_nss_ported_ipv4.c:1302
1297            struct ecm_nss_ported_ipv4_connection_instance *npci;
1298
1299            /*
1300             * Is this a response to a destroy message?
1301             */
1302            if (nim->cm.type != NSS_IPV4_TX_DESTROY_RULE_MSG) {
1303                    DEBUG_ERROR("%p: ported destroy callback with improper type: %d\n", nim, nim->cm.type);
1304                    return;
1305            }
1306

and just l *ecm_nss_ported_ipv4_connection_destroy_callback+0x21c gives:

qca-nss-ecm-standard/qca-nss-ecm-9228212b/frontends/nss/ecm_nss_ported_ipv4.c:1419
1414            DEBUG_CHECK_MAGIC(npci, ECM_NSS_PORTED_IPV4_CONNECTION_INSTANCE_MAGIC, "%p: magic failed", npci);
1415            /*
1416             * Increment the decel pending counter
1417             */
1418            spin_lock_bh(&ecm_nss_ipv4_lock);
1419            ecm_nss_ipv4_pending_decel_count++;
1420            spin_unlock_bh(&ecm_nss_ipv4_lock);
1421
1422            /*
1423             * Prepare deceleration message

If any of the extra output helps...

There should be a line number printed where the main problem is supposed to be caused? I suppose it should point to spin_lock_bh() line?

I updated the post with file and line number.

Weird, it hung at a simple variable increment. Is your router's CPU at prolonged100% load when it happened?

I have no idea. I wish i had a way to see cpu usage before it reboots. :slight_smile:

EDIT:
I'm wondering if these particular reboots are because of packet steering... do you have it enabled yourself?
Because I enabled it just 1 day ago... and started to see:
NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! a few minutes after boot, but it only shows up after boot like 5 times.

I'll disable packet steering and try and see if I still get the cpu stall error...

No. I don't have packet steering enabled, as I think it doesn't help. With NSS, the CPU is not really the bottleneck, so CPU cache misses are not big penalty.

But I think I may see a problem. The method ecm_nss_ported_ipv4_connection_defunct_callback() and ecm_nss_ported_ipv4_connection_destroy_callback() both are calling the same lock. It looks like both methods are called from the same thread, and that is causing a deadlock.

I think packet steering probably is the cause? Looks like we cannot turn on packet steering for NSS.

Btw, I found out that #08 is probably NET_TX_SOFTIRQ ... don't take my word for it tho'.

Yes, I've disabled packet steering now and I'll keep you updated if I get a new crash, I'm really after hunting down the crash we get when the bridge is in promisc mode.

Well... @quarky seems like NSS doesn't like to have br-lan in promisc mode. This is without packet steering.

[161721.019986] rcu: INFO: rcu_sched self-detected stall on CPU
[161721.020018] rcu:    0-...!: (2100 ticks this GP) idle=7aa/1/0x40000004 softirq=3868679/3868679 fqs=0
[161721.024711]         (t=2100 jiffies g=6680641 q=3)
[161721.033820] rcu: rcu_sched kthread starved for 2101 jiffies! g6680641 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[161721.038079] rcu:    Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[161721.048579] rcu: RCU grace-period kthread stack dump:
[161721.057342] task:rcu_sched       state:I stack:    0 pid:   11 ppid:     2 flags:0x00000000
[161721.062572] [<c09c6864>] (__schedule) from [<c09c6bfc>] (schedule+0x68/0x110)
[161721.071154] [<c09c6bfc>] (schedule) from [<c09ca7cc>] (schedule_timeout+0x74/0xd8)
[161721.078191] [<c09ca7cc>] (schedule_timeout) from [<c038576c>] (rcu_gp_kthread+0x544/0xd18)
[161721.085740] [<c038576c>] (rcu_gp_kthread) from [<c033e3d8>] (kthread+0x15c/0x160)
[161721.094068] [<c033e3d8>] (kthread) from [<c0300148>] (ret_from_fork+0x14/0x2c)
[161721.101703] Exception stack(0xc1469fb0 to 0xc1469ff8)
[161721.108910] 9fa0:                                     00000000 00000000 00000000 00000000
[161721.114126] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[161721.122371] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[161721.130615] NMI backtrace for cpu 0
[161721.137464] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.100 #0
[161721.141024] Hardware name: Generic DT based system
[161721.147018] [<c030e32c>] (unwind_backtrace) from [<c030a1ac>] (show_stack+0x14/0x20)
[161721.151797] [<c030a1ac>] (show_stack) from [<c062f3e8>] (dump_stack+0x94/0xa8)
[161721.159780] [<c062f3e8>] (dump_stack) from [<c0637990>] (nmi_cpu_backtrace+0xdc/0x108)
[161721.166897] [<c0637990>] (nmi_cpu_backtrace) from [<c0637adc>] (nmi_trigger_cpumask_backtrace+0x120/0x158)
[161721.174888] [<c0637adc>] (nmi_trigger_cpumask_backtrace) from [<c0381334>] (rcu_dump_cpu_stacks+0xe8/0x118)
[161721.184612] [<c0381334>] (rcu_dump_cpu_stacks) from [<c0386dc8>] (rcu_sched_clock_irq+0x728/0x8f8)
[161721.194682] [<c0386dc8>] (rcu_sched_clock_irq) from [<c038dec4>] (update_process_times+0x64/0x90)
[161721.203453] [<c038dec4>] (update_process_times) from [<c03a0824>] (tick_sched_timer+0x88/0x130)
[161721.212475] [<c03a0824>] (tick_sched_timer) from [<c038e4c8>] (__hrtimer_run_queues+0x184/0x254)
[161721.221415] [<c038e4c8>] (__hrtimer_run_queues) from [<c038f500>] (hrtimer_interrupt+0x130/0x374)
[161721.230190] [<c038f500>] (hrtimer_interrupt) from [<c07e44a4>] (msm_timer_interrupt+0x3c/0x4c)
[161721.239043] [<c07e44a4>] (msm_timer_interrupt) from [<c0377608>] (handle_percpu_devid_irq+0x84/0x178)
[161721.247631] [<c0377608>] (handle_percpu_devid_irq) from [<c037115c>] (__handle_domain_irq+0x90/0xf4)
[161721.257005] [<c037115c>] (__handle_domain_irq) from [<c0649740>] (gic_handle_irq+0x90/0xb8)
[161721.266292] [<c0649740>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
[161721.274794] Exception stack(0xc0d01130 to 0xc0d01178)
[161721.282174] 1120:                                     bf95a100 00000000 0000bc0a 0000bc08
[161721.287305] 1140: c64abe00 00000000 bf95a160 c64abe74 c0d014fc c64ab4a8 00000004 00000000
[161721.295548] 1160: 00000000 c0d01180 bf9378c4 c09cb498 80000113 ffffffff
[161721.303793] [<c0300b0c>] (__irq_svc) from [<c09cb498>] (_raw_spin_lock_bh+0x44/0x58)
[161721.310769] [<c09cb498>] (_raw_spin_lock_bh) from [<bf9378c4>] (ecm_nss_ported_ipv4_connection_destroy_callback+0x21c/0x4c0 [ecm])
[161721.318544] [<bf9378c4>] (ecm_nss_ported_ipv4_connection_destroy_callback [ecm]) from [<bf938028>] (ecm_nss_ported_ipv4_connection_defunct_callback+0xfc/0x164 [ecm])
[161721.330135] [<bf938028>] (ecm_nss_ported_ipv4_connection_defunct_callback [ecm]) from [<bf926934>] (ecm_db_connection_make_defunct+0x34/0xa0 [ecm])
[161721.345064] [<bf926934>] (ecm_db_connection_make_defunct [ecm]) from [<bf934d4c>] (ecm_conntrack_ipv4_event+0xbc/0xd8 [ecm])
[161721.358462] [<bf934d4c>] (ecm_conntrack_ipv4_event [ecm]) from [<c0341200>] (atomic_notifier_call_chain+0x64/0x94)
[161721.369621] [<c0341200>] (atomic_notifier_call_chain) from [<bf781e9c>] (nf_conntrack_eventmask_report+0xa8/0x324 [nf_conntrack])
[161721.379866] [<bf781e9c>] (nf_conntrack_eventmask_report [nf_conntrack]) from [<bf7772ec>] (nf_ct_delete+0x5c/0x144 [nf_conntrack])
[161721.391660] [<bf7772ec>] (nf_ct_delete [nf_conntrack]) from [<bf7774a0>] (nf_ct_kill_acct+0xcc/0x530 [nf_conntrack])
[161721.403378] [<bf7774a0>] (nf_ct_kill_acct [nf_conntrack]) from [<bf778f6c>] (nf_conntrack_tuple_taken+0x44c/0x750 [nf_conntrack])
[161721.414196] [<bf778f6c>] (nf_conntrack_tuple_taken [nf_conntrack]) from [<bf941728>] (ecm_classifier_nl_connection_added+0x19c/0x34c [ecm])
[161721.425880] [<bf941728>] (ecm_classifier_nl_connection_added [ecm]) from [<bf92874c>] (ecm_db_connection_add+0x4a4/0x6a8 [ecm])
[161721.438638] [<bf92874c>] (ecm_db_connection_add [ecm]) from [<bf93a478>] (ecm_nss_ported_ipv4_process+0xffc/0x1080 [ecm])
[161721.450183] [<bf93a478>] (ecm_nss_ported_ipv4_process [ecm]) from [<bf935ad0>] (ecm_nss_ipv4_init+0x6cc/0x1240 [ecm])
[161721.461032] [<bf935ad0>] (ecm_nss_ipv4_init [ecm]) from [<bf936804>] (ecm_nss_ipv4_post_routing_hook+0xe4/0x118 [ecm])
[161721.471660] [<bf936804>] (ecm_nss_ipv4_post_routing_hook [ecm]) from [<c089c41c>] (nf_hook_slow+0x48/0xd8)
[161721.482272] [<c089c41c>] (nf_hook_slow) from [<c08ad658>] (ip_output+0x138/0x170)
[161721.491988] [<c08ad658>] (ip_output) from [<c09b46f0>] (ip_sabotage_in+0x60/0x70)
[161721.499625] [<c09b46f0>] (ip_sabotage_in) from [<c089c41c>] (nf_hook_slow+0x48/0xd8)
[161721.507178] [<c089c41c>] (nf_hook_slow) from [<c08a7280>] (ip_rcv+0x68/0xe0)
[161721.515079] [<c08a7280>] (ip_rcv) from [<c0824ccc>] (__netif_receive_skb_one_core+0x48/0x58)
[161721.522196] [<c0824ccc>] (__netif_receive_skb_one_core) from [<c0824dc0>] (netif_receive_skb+0x58/0xf8)
[161721.530707] [<c0824dc0>] (netif_receive_skb) from [<c09989c0>] (br_handle_frame_finish+0x1a4/0x468)
[161721.540252] [<c09989c0>] (br_handle_frame_finish) from [<c09b4bbc>] (br_nf_pre_routing_finish_bridge+0x19c/0x1b4)
[161721.549366] [<c09b4bbc>] (br_nf_pre_routing_finish_bridge) from [<c09b5800>] (br_nf_hook_thresh+0xc4/0x110)
[161721.559522] [<c09b5800>] (br_nf_hook_thresh) from [<c09b62b0>] (br_nf_pre_routing_finish+0x158/0x3c8)
[161721.569504] [<c09b62b0>] (br_nf_pre_routing_finish) from [<c09b69b8>] (br_nf_pre_routing+0x498/0x4b8)
[161721.578620] [<c09b69b8>] (br_nf_pre_routing) from [<c0998e04>] (br_handle_frame+0x180/0x444)
[161721.587907] [<c0998e04>] (br_handle_frame) from [<c0822558>] (__netif_receive_skb_core.constprop.0+0x268/0xe90)
[161721.596503] [<c0822558>] (__netif_receive_skb_core.constprop.0) from [<c082325c>] (__netif_receive_skb_list_core+0xdc/0x1dc)
[161721.606747] [<c082325c>] (__netif_receive_skb_list_core) from [<c0823530>] (netif_receive_skb_list_internal+0x1d4/0x2d4)
[161721.617944] [<c0823530>] (netif_receive_skb_list_internal) from [<c08244bc>] (napi_complete_done+0x7c/0x1e0)
[161721.628988] [<c08244bc>] (napi_complete_done) from [<bf371e7c>] (nss_core_handle_napi+0x1f0/0x238 [qca_nss_drv])
[161721.638840] [<bf371e7c>] (nss_core_handle_napi [qca_nss_drv]) from [<c0824654>] (__napi_poll+0x34/0x150)
[161721.649020] [<c0824654>] (__napi_poll) from [<c0824970>] (net_rx_action+0xdc/0x270)
[161721.658565] [<c0824970>] (net_rx_action) from [<c03012f8>] (__do_softirq+0x110/0x2b8)
[161721.666378] [<c03012f8>] (__do_softirq) from [<c0322a18>] (irq_exit+0xb8/0x118)
[161721.674103] [<c0322a18>] (irq_exit) from [<c0371160>] (__handle_domain_irq+0x94/0xf4)
[161721.681655] [<c0371160>] (__handle_domain_irq) from [<c0649740>] (gic_handle_irq+0x90/0xb8)
[161721.689381] [<c0649740>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
[161721.697971] Exception stack(0xc0d01ee0 to 0xc0d01f28)
[161721.705356] 1ee0: 00000000 00009310 1cd49000 dd990d00 00000000 b849b5e0 c1c68840 00000000
[161721.710481] 1f00: dd98ffb0 00009310 00000000 00009310 f9751e00 c0d01f30 c07b6c84 c07b6ca4
[161721.718720] 1f20: 60000013 ffffffff
[161721.726969] [<c0300b0c>] (__irq_svc) from [<c07b6ca4>] (cpuidle_enter_state+0x180/0x380)
[161721.730702] [<c07b6ca4>] (cpuidle_enter_state) from [<c07b6ef4>] (cpuidle_enter+0x3c/0x5c)
[161721.738775] [<c07b6ef4>] (cpuidle_enter) from [<c034dfb0>] (do_idle+0x208/0x2a4)
[161721.746932] [<c034dfb0>] (do_idle) from [<c034e308>] (cpu_startup_entry+0x1c/0x20)
[161721.754572] [<c034e308>] (cpu_startup_entry) from [<c0c00eb0>] (start_kernel+0x53c/0x54c)
[161721.762040] Sending NMI from CPU 0 to CPUs 1:
[161721.771147] NMI backtrace for cpu 1
[161721.771149] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.100 #0
[161721.771151] Hardware name: Generic DT based system
[161721.771153] PC is at _raw_spin_lock_bh+0x44/0x58
[161721.771155] LR is at ecm_nss_ported_ipv4_process+0xd60/0x1080 [ecm]
[161721.771157] pc : [<c09cb498>]    lr : [<bf93a1dc>]    psr: 80000113
[161721.771159] sp : c146d690  ip : 000001da  fp : c884f700
[161721.771160] r10: c24ed000  r9 : c24ed000  r8 : c146d764
[161721.771163] r7 : c68db800  r6 : 00000001  r5 : 00000006  r4 : c68db800
[161721.771164] r3 : 0000bc08  r2 : 0000bc09  r1 : 00000000  r0 : bf95a100
[161721.771167] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[161721.771168] Control: 10c5787d  Table: 493b006a  DAC: 00000051
[161721.771170] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.100 #0
[161721.771172] Hardware name: Generic DT based system
[161721.771174] [<c030e32c>] (unwind_backtrace) from [<c030a1ac>] (show_stack+0x14/0x20)
[161721.771176] [<c030a1ac>] (show_stack) from [<c062f3e8>] (dump_stack+0x94/0xa8)
[161721.771178] [<c062f3e8>] (dump_stack) from [<c0637978>] (nmi_cpu_backtrace+0xc4/0x108)
[161721.771180] [<c0637978>] (nmi_cpu_backtrace) from [<c030cf84>] (do_handle_IPI+0x74/0x184)
[161721.771183] [<c030cf84>] (do_handle_IPI) from [<c030d0b0>] (ipi_handler+0x1c/0x2c)
[161721.771185] [<c030d0b0>] (ipi_handler) from [<c037115c>] (__handle_domain_irq+0x90/0xf4)
[161721.771187] [<c037115c>] (__handle_domain_irq) from [<c0649740>] (gic_handle_irq+0x90/0xb8)
[161721.771189] [<c0649740>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
[161721.771191] Exception stack(0xc146d640 to 0xc146d688)
[161721.771193] d640: bf95a100 00000000 0000bc09 0000bc08 c68db800 00000006 00000001 c68db800
[161721.771195] d660: c146d764 c24ed000 c24ed000 c884f700 000001da c146d690 bf93a1dc c09cb498
[161721.771197] d680: 80000113 ffffffff
[161721.771199] [<c0300b0c>] (__irq_svc) from [<c09cb498>] (_raw_spin_lock_bh+0x44/0x58)
[161721.771201] [<c09cb498>] (_raw_spin_lock_bh) from [<bf93a1dc>] (ecm_nss_ported_ipv4_process+0xd60/0x1080 [ecm])
[161721.771203] [<bf93a1dc>] (ecm_nss_ported_ipv4_process [ecm]) from [<bf935ad0>] (ecm_nss_ipv4_init+0x6cc/0x1240 [ecm])
[161721.771206] [<bf935ad0>] (ecm_nss_ipv4_init [ecm]) from [<bf936804>] (ecm_nss_ipv4_post_routing_hook+0xe4/0x118 [ecm])
[161721.771208] [<bf936804>] (ecm_nss_ipv4_post_routing_hook [ecm]) from [<c089c41c>] (nf_hook_slow+0x48/0xd8)
[161721.771210] [<c089c41c>] (nf_hook_slow) from [<c08ad658>] (ip_output+0x138/0x170)
[161721.771212] [<c08ad658>] (ip_output) from [<c09b46f0>] (ip_sabotage_in+0x60/0x70)
[161721.771214] [<c09b46f0>] (ip_sabotage_in) from [<c089c41c>] (nf_hook_slow+0x48/0xd8)
[161721.771216] [<c089c41c>] (nf_hook_slow) from [<c08a7280>] (ip_rcv+0x68/0xe0)
[161721.771219] [<c08a7280>] (ip_rcv) from [<c0824ccc>] (__netif_receive_skb_one_core+0x48/0x58)
[161721.771221] [<c0824ccc>] (__netif_receive_skb_one_core) from [<c0824dc0>] (netif_receive_skb+0x58/0xf8)
[161721.771223] [<c0824dc0>] (netif_receive_skb) from [<c09989c0>] (br_handle_frame_finish+0x1a4/0x468)
[161721.771225] [<c09989c0>] (br_handle_frame_finish) from [<c09b4bbc>] (br_nf_pre_routing_finish_bridge+0x19c/0x1b4)
[161721.771228] [<c09b4bbc>] (br_nf_pre_routing_finish_bridge) from [<c09b5800>] (br_nf_hook_thresh+0xc4/0x110)
[161721.771230] [<c09b5800>] (br_nf_hook_thresh) from [<c09b62b0>] (br_nf_pre_routing_finish+0x158/0x3c8)
[161721.771232] [<c09b62b0>] (br_nf_pre_routing_finish) from [<c09b69b8>] (br_nf_pre_routing+0x498/0x4b8)
[161721.771234] [<c09b69b8>] (br_nf_pre_routing) from [<c0998e04>] (br_handle_frame+0x180/0x444)
[161721.771236] [<c0998e04>] (br_handle_frame) from [<c0822558>] (__netif_receive_skb_core.constprop.0+0x268/0xe90)
[161721.771239] [<c0822558>] (__netif_receive_skb_core.constprop.0) from [<c082325c>] (__netif_receive_skb_list_core+0xdc/0x1dc)
[161721.771241] [<c082325c>] (__netif_receive_skb_list_core) from [<c0823530>] (netif_receive_skb_list_internal+0x1d4/0x2d4)
[161721.771243] [<c0823530>] (netif_receive_skb_list_internal) from [<c08244bc>] (napi_complete_done+0x7c/0x1e0)
[161721.771246] [<c08244bc>] (napi_complete_done) from [<bf371e7c>] (nss_core_handle_napi+0x1f0/0x238 [qca_nss_drv])
[161721.771248] [<bf371e7c>] (nss_core_handle_napi [qca_nss_drv]) from [<c0824654>] (__napi_poll+0x34/0x150)
[161721.771250] [<c0824654>] (__napi_poll) from [<c0824970>] (net_rx_action+0xdc/0x270)
[161721.771252] [<c0824970>] (net_rx_action) from [<c03012f8>] (__do_softirq+0x110/0x2b8)
[161721.771254] [<c03012f8>] (__do_softirq) from [<c0322a18>] (irq_exit+0xb8/0x118)
[161721.771256] [<c0322a18>] (irq_exit) from [<c0371160>] (__handle_domain_irq+0x94/0xf4)
[161721.771258] [<c0371160>] (__handle_domain_irq) from [<c0649740>] (gic_handle_irq+0x90/0xb8)
[161721.771260] [<c0649740>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
[161721.771262] Exception stack(0xc146df18 to 0xc146df60)
[161721.771264] df00:                                                       00000000 00009310
[161721.771266] df20: 1cd58000 dd99fd00 00000000 b844eb00 c1c69040 00000000 dd99efb0 00009310
[161721.771268] df40: 00000000 00009310 f9751e00 c146df68 c07b6c84 c07b6ca4 60000013 ffffffff
[161721.771271] [<c0300b0c>] (__irq_svc) from [<c07b6ca4>] (cpuidle_enter_state+0x180/0x380)
[161721.771273] [<c07b6ca4>] (cpuidle_enter_state) from [<c07b6ef4>] (cpuidle_enter+0x3c/0x5c)
[161721.771275] [<c07b6ef4>] (cpuidle_enter) from [<c034dfb0>] (do_idle+0x208/0x2a4)
[161721.771277] [<c034dfb0>] (do_idle) from [<c034e308>] (cpu_startup_entry+0x1c/0x20)
[161721.771279] [<c034e308>] (cpu_startup_entry) from [<4230152c>] (0x4230152c)

I have not tried promisc mode tho., but when I run tcpdump on br-lan, it goes into promisc mode. Granted I do not run tcpdump monitoring over a prolonged period.

Did you try the 21.02 builds as well? Not sure if this is peculiar to master?

No never tried stable NSS build.
But promiscuous mode makes port forwards work like they do in a non-NSS build that's why some need it on.
(promiscuous mode is not needed in normal master build)

I can't have the router crashing like this as we still work from home here. I have to go back to the normal master build for now

Thanks for your help so far

1 Like

I have a IPQ601x chipset which has NSS support and have loaded QSDK on my Access Point .

I have configured it in the Bridge Mode. I push the packets from L2 to L3 using net.bridge.bridge-nf-call-iptables=1.

This is my bridge setup

br-vlan80               7fff.587be915a963       no              ath242113
                                                        ath512113
                                                        eth0.80

where, ath242113 and ath512113 are the wireless interface and eth0.80 is the tagged ethernet interface.

I am trying to open captive portal using the following rules

iptables -A PREROUTING -m physdev --physdev-in ath242113 -j prt_captive_2113
iptables -A PREROUTING -m physdev --physdev-in ath512113 -j prt_captive_2113
iptables -A prt_captive_2113 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 8080
iptables -A prt_captive_2113 -p tcp -m tcp --dport 443 -j REDIRECT --to-ports 8443

I also tried enabling net.bridge.bridge-nf-filter-vlan-tagged=1

but the redirection doesnt work

Oh wow, I have this problem where wifi on 5G stalls for a few minutes but when a device leaves the house (a Pixel phone). It doesn't always happen, I tried to monitor for it. It may have to do with what other devices are doing, amount of traffic, etc. I notice it if I'm on Zoom on my laptop, the Pixel leaves the house and my Zoom freezes. It all started for me with the upgrade to 21.02.2 official build, it was rock solid before. I suspect it's some of the hostapd or mac80211 backports that were applied between .1 and .2
I use mainline driver + firmware because CT doesn't work well at all for some of my clients.
I'll try disabling aql_enable and report back.

This is what I see in the log when this happens:

Wed Mar  9 09:45:50 2022 daemon.notice hostapd: wlan0: AP-STA-DISCONNECTED xx:xx:xx:xx:xx:xx
Wed Mar  9 09:45:50 2022 daemon.info hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity
Wed Mar  9 09:45:51 2022 daemon.info hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.197590] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.197631] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.203782] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.211061] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.218685] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.225663] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.232951] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.240228] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7

But this gets printed 5 min. after the device leaves which makes sense because of the default max_inactivity=300 hostapd setting.

i notice the same thing... with a zoom call... something is wrong with aql... as soon as the zoom call ended all started to work normally...

Hmm, what happened between 21.02.1 and 21.02.2? It was stable for me before for a long time.

Anyone know what is up with that kern.warn? My dmseg is full of it, but that's always been the case, not new.

hostapd / backports bump ? (to fix some vulnerabilities?)

From my trial and errors, it looks like the issue was introduced with this commit. And so what I did was that I did a build up to this commit and reverted the commit that I suspected. So far my R7800 (with 19+ days uptime) does not exhibit the high latency or Wi-Fi stalls that I will almost always encounter before without reverting the suspected commit.

Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.197590] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0

@wired I still see this in my R7800 logs, but it doesn't seem to affect the Wi-Fi performance. It's most likely that when hostapd disconnected an inactive client, it somehow did not notify ath10k or that ath10k did not remove client from it database (maybe the txq has not timed-out due to inactivity?).

1 Like

Hmm, I don't have those at all...

It doesn’t happen frequently for me, but I see it on and off.

They happen when devices such as phones or other low power devices go in power saving mode or leave the house, but it doesn't seem to affect anything, just noise in the logs (I hope).

As for the wifi stalls, the aql_enable workaround above seems to help so far.

I filed a Github issue for more visibility and better tracking:

2 Likes