Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

Mpilon · September 29, 2022, 2:25am

My very unscientific impression is my network feels snappier with this change.

amteza · September 29, 2022, 4:07am

Would you be so kind as to run some flent tests and share them with us?

Mpilon · September 29, 2022, 4:53am

Maybe, I've never flent-ed before, can you point me to a HowTo?

quarky · September 29, 2022, 8:43am

@dtaht not trying to beat a dead horse. I'm not sure if tweaking NAPI would be some thing that we should do to reduce overall latency for WiFi. I've been reading up on the NAPI flow and it seems that reducing the weight/budget may cause more CPU overhead.

What I understand from the NAPI architecture is as follows:

For each NAPI enabled network interface, they will be allocated a certain budget (defaulted to 300, which is adjustable via net.core.netdev_budget)

When a frame arrives for a network interface that is NAPI enabled, it will DMA the frame to CPU memory and raise an interrupt. The network interface driver will service the interrupt, by disabling further interrupts and calls the napi_poll() function of the interface. The napi_poll() function will have been registered with the default NAPI_POLL_WEIGHT value, which is 64.

The driver's napi_poll() will retrieve as many frames from CPU memory as it can (as the assumption is that more frames would have been received since the interrupt occurs), up to the registered weight value, in this case, 64.

The number of frames received (i.e. work done) by the napi_poll() function is then subtracted from the budget value (which is 300). If the work done equals the entire weight (i.e. 64 in this case), then the interface's NAPI structure is moved to the end of the NAPI queue waiting to be called again, as long as the remaining budget is positive or less than 2 jiffies (30 ms) has elapsed since the interrupt.

When work done is less than weight, or budget is negative, or 2 jiffies has occurred, the interface's NAPI will be disabled and interrupt will be re-enabled to wait for the next frame to be received.

So if the weight is reduced to 8, that means that when an interrupt occurs, only 8 frames will be processed (assuming client is bursting frames) and the napi_poll() function will have to wait for it's turn again to process another 8, etc, until the entire budget is exhausted, or there's no more frame to process. This seems like it will cause received frames to be processed slower, but probably make it fairer to other NAPI enabled interfaces.

Maybe we should try increasing the budget from it's default 300 instead?

On the other hand, if the CPU is fast enough to clear the received packets before the budget is used, it's probably a moot point to adjust either value.

neonman63 · September 29, 2022, 9:57am

Hello, @qosmio @Ansuel Im also built an image from qosmio's repo branch 5.15-qsdk11-new-krait-cc, sometimes router crashed with error in pstore:

<1>[11257.107439] 8<--- cut here ---
<1>[11257.107477] Unable to handle kernel NULL pointer dereference at virtual address 00000024
<1>[11257.109400] pgd = c857ac36
<1>[11257.117639] [00000024] *pgd=00000000
<0>[11257.120162] Internal error: Oops: 5 [#1] SMP ARM
<4>[11257.123893] Modules linked in: ecm xt_connlimit pppoe ppp_async nf_conncount iptable_nat ath10k_pci ath10k_core ath xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_CT wireguard pptp pppox ppp_generic nft_redir nft_nat nft_masq nft_flow_offload nft_fib_inet nft_ct nft_chain_nat nf_nat nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nf_flow_table nf_conntrack_netlink nf_conntrack mac80211 libchacha20poly1305 ipt_REJECT curve25519_neon cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_recent xt_physdev xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc sch_cake poly1305_arm nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_quota nft_objref nft_numgen nft_log nft_limit nft_hash nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_counter nft_compat nf_tables nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c iptable_raw
<4>[11257.124625]  iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat chacha_neon sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact ledtrig_usbport ledtrig_gpio ledtrig_activity xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ipmac ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 qca_mcs ip_gre gre ifb ip6_udp_tunnel udp_tunnel sit tunnel4 ip_tunnel tun cifs oid_registry cifs_md4 cifs_arc4 asn1_decoder dns_resolver nls_utf8 nls_iso8859_1 nls_cp437 seqiv md5 kpp ecb des_generic libdes cmac usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd
<4>[11257.193875]  phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod scsi_common ehci_hcd qca_nss_drv qca_nss_gmac ramoops reed_solomon pstore gpio_button_hotplug vfat fat f2fs ext4 mbcache jbd2 crc32c_generic crc32_generic
<4>[11257.305393] CPU: 1 PID: 7939 Comm: kworker/1:0 Not tainted 5.15.69 #0
<4>[11257.327627] Hardware name: Generic DT based system
<4>[11257.333962] Workqueue: events dbs_work_handler
<4>[11257.338647] PC is at __timer_const_udelay+0xc/0x24
<4>[11257.343074] LR is at krait_mux_set_parent+0xd4/0x120
<4>[11257.347850] pc : [<c06b7120>]    lr : [<c072c868>]    psr: 60000093
<4>[11257.352973] sp : c64f1d10  ip : 00000000  fp : c1e22980
<4>[11257.358961] r10: c1cce818  r9 : 00000000  r8 : c64f1d5c
<4>[11257.364170] r7 : 20000013  r6 : 00000001  r5 : 00000001  r4 : c1cd9360
<4>[11257.369379] r3 : 00000018  r2 : c0ee533c  r1 : 60000093  r0 : 000346dc
<4>[11257.375977] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[11257.382490] Control: 10c5787d  Table: 47d0806a  DAC: 00000051
<1>[11257.389690] Register r0 information: non-paged memory
<1>[11257.395506] Register r1 information: non-paged memory
<1>[11257.400541] Register r2 information: non-slab/vmalloc memory
<1>[11257.405575] Register r3 information: non-paged memory
<1>[11257.411304] Register r4 information: slab kmalloc-128 start c1cd9300 pointer offset 96 size 128
<1>[11257.416258] Register r5 information: non-paged memory
<1>[11257.424758] Register r6 information: non-paged memory
<1>[11257.429968] Register r7 information: non-paged memory
<1>[11257.435001] Register r8 information: non-slab/vmalloc memory
<1>[11257.440038] Register r9 information: NULL pointer
<1>[11257.445767] Register r10 information: slab kmalloc-256 start c1cce800 pointer offset 24 size 256
<1>[11257.450373] Register r11 information: slab kmalloc-128 start c1e22980 pointer offset 0 size 128
<1>[11257.459227] Register r12 information: NULL pointer
<0>[11257.467641] Process kworker/1:0 (pid: 7939, stack limit = 0xdd30094a)
<0>[11257.472505] Stack: (0xc64f1d10 to 0xc64f2000)
<0>[11257.479015] 1d00:                                     c1cd936c c16ccc00 ffffffff 00000002
<0>[11257.483366] 1d20: c64f1d5c c072e054 00000000 c16ccc00 ffffffff c034b120 c1cce800 c16ccc00
<0>[11257.491525] 1d40: c0eac650 00000002 c1680e40 2faf0800 23c34600 c07191bc c161b200 c1cd85c0
<0>[11257.499684] 1d60: 2faf0800 23c34600 c161b200 c16ccc00 00000000 23c34600 c0ed8460 c071e68c
<0>[11257.507844] 1d80: c64f0000 2faf0800 23c34600 c1680e40 c16ccae8 23c34600 c0ed8460 c1437f00
<0>[11257.516004] 1da0: 2faf0800 23c34600 c1e22980 c071e6d0 c64f0000 c1680e40 00000001 c16ccc00
<0>[11257.524162] 1dc0: 00000000 23c34600 c1680e40 dd995010 23c34600 00000000 c1e22980 c071ea40
<0>[11257.532323] 1de0: 00000000 23c34600 00000000 ffffffff 23c34600 c0eaf040 00000002 c2055000
<0>[11257.540482] 1e00: 23c34600 00000001 c1e22b80 dd995010 23c34600 00000000 c1e22980 c071f3c8
<0>[11257.548643] 1e20: c1d21a00 c1e22b00 00000001 c1e22b80 dd995010 23c34600 00000000 c0873198
<0>[11257.556803] 1e40: 23c34600 c071f700 c2055000 00000000 c16ccc00 c071f700 00000a3d c1e22b80
<0>[11257.564960] 1e60: 00000000 ffffffff 23c34600 c1d21a00 dd995010 00000006 23c34600 c1e22b00
<0>[11257.573121] 1e80: 00000001 00000000 000927c0 c08735a0 c0ec0810 c034b140 c161a600 23c34600
<0>[11257.581280] 1ea0: c161a600 00000000 000c3500 c0ee24b0 c0f1f0a4 c0878764 dd99ccc0 00000001
<0>[11257.589441] 1ec0: c0d661c8 00000000 c161a600 000c3500 000927c0 00000024 dd99c5c0 c161a600
<0>[11257.597602] 1ee0: c1e22f00 c1e22e80 c1e22f00 c2055180 c1e22e80 dd99f405 c558db40 c087c28c
<0>[11257.605760] 1f00: c1e22f38 00000000 c1e22f04 c0ec0af0 00000000 c0ed62e0 dd99f405 c087d010
<0>[11257.613921] 1f20: c1e22f38 c558db00 dd99c1c0 dd99f400 00000000 c034106c c64f0000 dd99c1c0
<0>[11257.622079] 1f40: 00000008 c558db00 c558db18 dd99c1c0 00000008 dd99c1d8 c0e03d00 dd99c380
<0>[11257.630240] 1f60: c64f0000 c0341414 caabded4 ce216dc0 ccc3fdc0 c03413b0 c558db00 c64f0000
<0>[11257.638398] 1f80: caabded4 ccc3fde0 00000000 c03496fc ce216dc0 c03495ac 00000000 00000000
<0>[11257.646557] 1fa0: 00000000 00000000 00000000 c0300130 00000000 00000000 00000000 00000000
<0>[11257.654717] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[11257.662878] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<0>[11257.671037] [<c06b7120>] (__timer_const_udelay) from [<c072c868>] (krait_mux_set_parent+0xd4/0x120)
<0>[11257.679197] [<c072c868>] (krait_mux_set_parent) from [<c072e054>] (krait_notifier_cb+0x5c/0xd0)
<0>[11257.688048] [<c072e054>] (krait_notifier_cb) from [<c034b120>] (srcu_notifier_call_chain+0x84/0xfc)
<0>[11257.696730] [<c034b120>] (srcu_notifier_call_chain) from [<c07191bc>] (__clk_notify+0x74/0x94)
<0>[11257.705757] [<c07191bc>] (__clk_notify) from [<c071e68c>] (clk_change_rate+0x184/0x478)
<0>[11257.714435] [<c071e68c>] (clk_change_rate) from [<c071e6d0>] (clk_change_rate+0x1c8/0x478)
<0>[11257.722334] [<c071e6d0>] (clk_change_rate) from [<c071ea40>] (clk_core_set_rate_nolock+0xc0/0x23c)
<0>[11257.730670] [<c071ea40>] (clk_core_set_rate_nolock) from [<c071f3c8>] (clk_set_rate+0x48/0x180)
<0>[11257.739610] [<c071f3c8>] (clk_set_rate) from [<c0873198>] (_set_opp+0x28c/0x5ac)
<0>[11257.748204] [<c0873198>] (_set_opp) from [<c08735a0>] (dev_pm_opp_set_rate+0xe8/0x214)
<0>[11257.755841] [<c08735a0>] (dev_pm_opp_set_rate) from [<c0878764>] (__cpufreq_driver_target+0xfc/0x308)
<0>[11257.763570] [<c0878764>] (__cpufreq_driver_target) from [<c087c28c>] (od_dbs_update+0xd0/0x1a4)
<0>[11257.772858] [<c087c28c>] (od_dbs_update) from [<c087d010>] (dbs_work_handler+0x40/0x7c)
<0>[11257.781362] [<c087d010>] (dbs_work_handler) from [<c034106c>] (process_one_work+0x244/0x588)
<0>[11257.789350] [<c034106c>] (process_one_work) from [<c0341414>] (worker_thread+0x64/0x5a8)
<0>[11257.798029] [<c0341414>] (worker_thread) from [<c03496fc>] (kthread+0x150/0x174)
<0>[11257.806102] [<c03496fc>] (kthread) from [<c0300130>] (ret_from_fork+0x14/0x24)
<0>[11257.813477] Exception stack(0xc64f1fb0 to 0xc64f1ff8)
<0>[11257.820510] 1fa0:                                     00000000 00000000 00000000 00000000
<0>[11257.825637] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[11257.833797] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[11257.841957] Code: c0f1b8b8 e52de004 e8bd4000 e59f3010 (e593300c) 
<4>[11257.848377] ---[ end trace 0dd45bf147c47bbf ]---
<0>[11257.876026] Kernel panic - not syncing: Fatal exception
<2>[11257.876066] CPU0: stopping
<4>[11257.880057] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D           5.15.69 #0
<4>[11257.882839] Hardware name: Generic DT based system
<4>[11257.890392] [<c03118f4>] (unwind_backtrace) from [<c030c3c4>] (show_stack+0x1c/0x28)
<4>[11257.894994] [<c030c3c4>] (show_stack) from [<c06b9b08>] (dump_stack_lvl+0x40/0x4c)
<4>[11257.902892] [<c06b9b08>] (dump_stack_lvl) from [<c030fc88>] (do_handle_IPI+0x2b4/0x32c)
<4>[11257.910272] [<c030fc88>] (do_handle_IPI) from [<c030fd20>] (ipi_handler+0x20/0x34)
<4>[11257.918169] [<c030fd20>] (ipi_handler) from [<c0387d04>] (handle_percpu_devid_irq+0x98/0x220)
<4>[11257.925811] [<c0387d04>] (handle_percpu_devid_irq) from [<c0380fb4>] (handle_domain_irq+0x6c/0xa0)
<4>[11257.934404] [<c0380fb4>] (handle_domain_irq) from [<c06d3f64>] (gic_handle_irq+0x88/0xbc)
<4>[11257.943259] [<c06d3f64>] (gic_handle_irq) from [<c0300b7c>] (__irq_svc+0x5c/0x78)
<4>[11257.951502] Exception stack(0xc0e01ee0 to 0xc0e01f28)
<4>[11257.958971] 1ee0: 00000000 00000a3d 1cc1f000 dd98c5c0 00000000 2dab0a80 c1d40840 00000000
<4>[11257.964009] 1f00: dd98b830 00000a3d 00000000 00000a3d ff3bfb60 c0e01f30 c087f7e0 c087f800
<4>[11257.972164] 1f20: 60000013 ffffffff
<4>[11257.980317] [<c0300b7c>] (__irq_svc) from [<c087f800>] (cpuidle_enter_state+0x1ac/0x434)
<4>[11257.983623] [<c087f800>] (cpuidle_enter_state) from [<c087fae8>] (cpuidle_enter+0x44/0x64)
<4>[11257.991956] [<c087fae8>] (cpuidle_enter) from [<c035d2e8>] (do_idle+0x1f0/0x284)
<4>[11258.000028] [<c035d2e8>] (do_idle) from [<c035d690>] (cpu_startup_entry+0x24/0x28)
<4>[11258.007580] [<c035d690>] (cpu_startup_entry) from [<c0d011dc>] (start_kernel+0x62c/0x720)

Firmware versions: OpenWrt SNAPSHOT r20871-690e238f5a / LuCI Master git-22.260.19132-34dd31a

dtaht · September 29, 2022, 11:40am

There are several different factors in play here. This is an attempt to experiment, tweaking this particular knob to see if it moves the needle in some direction.

I do not care about cpu, I care about network latency. If you run out of cpu, drop some packets, it will it fix itself.
many (not all) drivers process the NAPI_POLL_WEIGHT rx frames first and then process the write frames. The huge disparity between reads and write performance we were encountering on multiple rrul tests elsewhere strongly suggested to me that it was better to process writes more often.
WiFi, in general, cannot interrupt at the same rate as ethernet can, there is generally a minimum of over 120us built in between txops, that can be as great as 5.6ms. The ath10k driver in particular had no NAPI when I was working on it (6+ years ago). I regarded the addition as
a "cargo cult" exercise that needed the kinds of benchmarks we've been running. In all honestly I don't know how many wifi drivers interrupt per packet as a baseline, or per txop.

Otherwise I'd rip NAPI out entirely from the ath10k at least. A better driver design is actually one that drives rx and tx entirely independently, of course....

NAPI was designed before we had multicore systems, and a change to the input path to process more stuff at that time went into linux about 5? years ago, which looked good on x86 for the limited benchmarks that were run. x86 has a vastly bigger L2 cache that arms do not have, and also cannot context switch any where near as fast as arm does. So doing less work, more often, might be more cache friendly.
The principal thing to optimize for in networking is reducing RTT. To use another example, a lot of ethernet devices attempt to coalsec interrupts into 100usec by default (in addition to NAPI), because they are trying to optimize for stuff running in userspace, not for forwarding packets. If you get a big gbit packet (13us)... then wait... then push it to another interface... then wait

A typical TCP transaction takes 5 round trips to start, each of which
is essentially a single tiny packet, so in this simplified example, it would be (at a gbit) 1us times 5 (they are tiny packets) + overhead, vs say 100us time 5 + less overhead. Two useful simple tests for this would be iperf's new bounceback test as well as netperf's TCP_RR test, running while the device was otherwise idle.

Given that most of the time a system is not heavily loaded, in fact loafing, optimizing for the forwarding latency more thoroughly when it is very lightly loaded has always seemed like a win, on a router.

Take this example - say you are pulling 64 packets off the ethernet port, and dumping them into the wifi. The most you can stuff into ac is 96 packets, and wireless-n is typically 32, and these numbers are usually much less at distance, call it half that... so you end up with several txops used up per read in this case. If on the other hand, you stuff 8 at a time into the wifi driver, which may or may not be backed up, you get lower latency and a smaller txop...

The ideal fluid model is 1 packet in = 1 packet out, and switches, rather than routers, do even better than that by cutting through as soon as the header arrives. I'm puzzled as to why more TCAMs haven't shown up yet in mid-range routers....

sppmaster · September 29, 2022, 12:16pm

What CPU settings do you use? Fixed frequency, performance, ondemand or maybe other?

neonman63 · September 29, 2022, 1:05pm

No irqbalance, default image settings, scaling_governor=ondemand and fixed irqs by cores

vochong · September 29, 2022, 3:34pm

@neonman63 : your crash was definitely caused by the CPU frequency scaling problem. Please don't have high expectation that it has been actually fixed in 5.15. @sppmaster has experienced the same problem running Qosmio 5.15-based image.

Luckily for me, I have not bumped into such CPU frequency scaling related crash lately (with either 22.03|master|Qosmio 5.15). The other mysterious crashes (e.g. RCU stalling) I encountered were mostly gone by simply blasting irqbalance into oblivion. I've always used the ondemand governor.

vochong · September 29, 2022, 4:02pm

TCAM is very expensive, power hungry and consumes lots of silicon die, especially at the low-end 28 nm (or bigger) manufacturing process normally used to make chips used in low-end or mid-range routers/switches. This also translates to higher cost for the need of more bulky heatsinks, power circuit etc.

Consumer market = as much cost-cutting as possible.

amadeo · September 29, 2022, 4:41pm

Definitely not isolated to lenovo, I have 3 samsung devices, for two of them those messages are being generated:
note 20 ultra - android 12
s10 - android 12

The third device is on android 11 (with possible update to 12) and for it those messages are not generated.
I've tested "almost everything" to get rid of them as in addition they cause connection issues with Home Assistant mobile app and the local server.
Yesterday as a test I've flashed "OpenWrt 21.02 (Stable) + NSS Hardware Offloading Download" and restored the backup, it's been more than 12 hours since that point and I've not seen "br-lan: received packet on wlan0 with own address as source address"

I've thought about doing clean install today but since I'm not the one affected I don't know if there is any point in doing so.

Mpilon · September 29, 2022, 5:10pm

Sounds like this should be elevated as a new 22.03 bug to be evaluated more ...

... I'll poke at this some, starting with the error msg itself and where it's generated.

@ACwifidude , any thoughts?

vochong · September 29, 2022, 5:20pm

This is not a bug. They are simply messages from kernel driver indicating that the internal bridge has learned a new source MAC address of a frame coming to one of its bridge ports, either from Ethernet LAN ports or WIFI. The default aging timer is 300 seconds. If a learned MAC address is not seen after 300 seconds, it will be removed from the MAC/port mapping table and has to be re-learned again (new log message) if a new frame with such source MAC address is coming in later.

ACwifidude · September 29, 2022, 6:22pm

Yep, like @vochong said.

@dtaht Made some small stability improvements first. I’ll make a test image for the R7800 hopefully this weekend for people to try out.

amadeo · September 29, 2022, 6:43pm

Is it still normal that the log is truncated because of the number of messages being generated? (especially messages being "kern.warn")
Peaks for me were 40+ messages being hidden.
Why they are not present on 21.02 stable?

I did a clean install of 22.03 master stable without restoring the config (re-adding everything manually) and the messages are back.

Thu Sep 29 20:33:12 2022 kern.warn kernel: [ 2838.547779] br-lan: received packet on wlan1 with own address as source address (addr:XX:XX:XX:XX:XX:XX, vlan:0)
Thu Sep 29 20:35:05 2022 kern.warn kernel: [ 2951.510415] br-lan: received packet on wlan1 with own address as source address (addr:XX:XX:XX:XX:XX:XX, vlan:0)

Fortunately just two of them for now but I'll monitor the system log.
(not trying to ask the stupid questions, I just want to fully understand this as connection issues on my network happened at the same time as upgrade of router and those messages appearing)

Mpilon · September 29, 2022, 7:25pm

@ACwifidude, @vochong and others with interest in the received packet on %s with own address as source address ...

those interested in latency issues may want to check on the hardcoded timing interval for DEFINE_RATELIMIT_STATE(net_ratelimit_state, 5 * HZ, 10);

a quick look at /net/bridge/br_fdb.c gives a little context for this message -- here's the start of the relevant func:

void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
		   const unsigned char *addr, u16 vid, unsigned long flags)
{
	struct net_bridge_fdb_entry *fdb;

	/* some users want to always flood. */
	if (hold_time(br) == 0)
		return;

	fdb = fdb_find_rcu(&br->fdb_hash_tbl, addr, vid);
	if (likely(fdb)) {
		/* attempt to update an entry for a local interface */
		if (unlikely(test_bit(BR_FDB_LOCAL, &fdb->flags))) {
			if (net_ratelimit())
				br_warn(br, "received packet on %s with own address as source address (addr:%pM, vlan:%u)\n",
					source->dev->name, addr, vid);
		} else {
			unsigned long now = jiffies;
			bool fdb_modified = false;

			if (now != fdb->updated) {
				fdb->updated = now;
				fdb_modified = __fdb_mark_active(fdb);
			}

... the message is only generated if the fdb is local (interface) and rate-limited ...
BUT this path doesn't update seem to do any updating of anything, unless the call to net_ratelimit() is touching something behind the scenes -- but it's not passed any params.

the path generating the msg does no other work.

the conditional path resulting in the warning message doesn't seem to do anything else. I don't see this particular msg as informational, it's acting like it's complaining about getting blasted with 'local' packets (?)

and if the msg correctly reflects what's going on, -- it's calling packets with client device addrs as its own - the reported hw addr is the client device's.

it may serve the purpose of informing when a new addr is added but net_ratelimit() returns true if too many packets have been rec'd for this fdb ... whine and quit, which this does.

what's not clear to me is how this message (own address) relates to the reception of > threshold packets.

... it's fdb_insert() that provides the msg: "adding interface %s with same address as a received packet and I believe that's where @ACwifidude and @vochong were going.

the error msg we're talking about here is at line 631 in br_fdb.c ... and looks to be a cut/paste/modify that skipped the last step.

It's reporting on a known condition -- too much too fast, not what's reported and maybe this msg should be disappeared as it's not an error, nor something to be warned about. if anything, warn about the heavy traffic.

@amadeo -- it sounds like you have a legit traffic problem, but not solved or helped in your case by the normal handling of heavy traffic for the interface.

you may want to isolate what's happening on the network that's so frantic.

The other possibility is that something is clobbering the hash table / cache used by fdb_find_rcu()

vochong · September 29, 2022, 9:14pm

br_fdb is bridge forwarding database. Bridges reside at Layer-2 so bridges learn MAC addresses (as well as VLAN tag ID) at its ports in order to forward (switch) frames with a specific destination MAC address to the port at which that MAC address was previously seen, without flooding them to all the ports of a Layer-1 device such as hubs.

When a static FDB entry is added, the mac address from the entry is:

added to the bridge private HW address list and all required ports
are then updated with the new information.

vochong · September 29, 2022, 11:52pm

I remember seeing these same messages in 19.7 and 21.02 as well.

quarky · September 30, 2022, 1:08am

The same messages gets logged in my Broadcom based routers running dd-wrt firmware, and that's running Linux Kernel 4.4.

Seems quite harmless.

Mpilon · September 30, 2022, 1:14am

you might see my updates to the post just below yours about br_fb_update()