Queuing issue after snapshot update (ipq40xx + ath79)

setup

Bigger installation (>40) with a mix of WPJ531 and WPJ428 devices,
These errors started after a firmware update from 3835c92e065ab1b98e3e307cfebc0c8cf2f9f729 to 961e01fc67e7d9e60557df3474fa326216aa4839 (automated toolchain creates config files and adds packages).
It's still there in 1996d600bd4693279929fe9d1914c2fe4d2dd9ef.

diagnose so far

After the update, clients didn't receive DHCP responses after roaming. Even after automated reconnects.
After a few attempts with pauses in between, connections would start working again.

Some people also had problems with initial connections but I didn't get to those yet.
The error happens regardless of auth_alg=ft or auth_alg=open, so I'm ruling out WiFi-protocol-related issues for the time.

.. some boring stuff .. skip to the relevant part:

While the client wasn't sending anything, it wasn't responding to pings from the router (WPJ428/ipq40xx/ath10k).
The client itself was able to ping with decent response times and while it did that, the router also received responses to its own ping requests.

Here's what the router saw while the clients was pinging with -i 10:

64 bytes from 10.xx.xx.xx: seq=54 ttl=63 time=4235.418 ms
64 bytes from 10.xx.xx.xx: seq=55 ttl=63 time=3233.571 ms
64 bytes from 10.xx.xx.xx: seq=56 ttl=63 time=2232.333 ms
64 bytes from 10.xx.xx.xx: seq=57 ttl=63 time=1231.903 ms
64 bytes from 10.xx.xx.xx: seq=58 ttl=63 time=231.490 ms
64 bytes from 10.xx.xx.xx: seq=65 ttl=63 time=3232.787 ms
64 bytes from 10.xx.xx.xx: seq=66 ttl=63 time=2232.851 ms
64 bytes from 10.xx.xx.xx: seq=67 ttl=63 time=1232.997 ms
64 bytes from 10.xx.xx.xx: seq=68 ttl=63 time=232.661 ms
64 bytes from 10.xx.xx.xx: seq=69 ttl=63 time=15.641 ms
64 bytes from 10.xx.xx.xx: seq=70 ttl=63 time=4552.084 ms # tcp keep-alive
64 bytes from 10.xx.xx.xx: seq=73 ttl=63 time=1559.227 ms
64 bytes from 10.xx.xx.xx: seq=71 ttl=63 time=3561.172 ms
64 bytes from 10.xx.xx.xx: seq=72 ttl=63 time=2561.476 ms
64 bytes from 10.xx.xx.xx: seq=74 ttl=63 time=560.888 ms
64 bytes from 10.xx.xx.xx: seq=75 ttl=63 time=3259.119 ms
64 bytes from 10.xx.xx.xx: seq=76 ttl=63 time=2259.409 ms
64 bytes from 10.xx.xx.xx: seq=77 ttl=63 time=1259.576 ms
64 bytes from 10.xx.xx.xx: seq=78 ttl=63 time=260.443 ms
64 bytes from 10.xx.xx.xx: seq=85 ttl=63 time=3235.971 ms
64 bytes from 10.xx.xx.xx: seq=86 ttl=63 time=2249.590 ms
64 bytes from 10.xx.xx.xx: seq=87 ttl=63 time=1249.227 ms
64 bytes from 10.xx.xx.xx: seq=88 ttl=63 time=248.889 ms
64 bytes from 10.xx.xx.xx: seq=95 ttl=63 time=3232.052 ms
64 bytes from 10.xx.xx.xx: seq=96 ttl=63 time=2231.592 ms
64 bytes from 10.xx.xx.xx: seq=97 ttl=63 time=1232.011 ms
64 bytes from 10.xx.xx.xx: seq=98 ttl=63 time=231.759 ms

Burst of packets while the client sends? Smells like a queuing issue.
To confirm that sending traffic 'unclogs' the incoming data lane, I had the cooperative admin run ip a add 10.yyy.0.1/24 dev $dev; ping 10.yyy.0.2 while network-manager was busy with DHCP and it worked:

09:06:55.233036 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303
09:06:55.741784 IP6 fe80::xxxx:xxxx:xxxx:xxxx > ip6-allrouters: ICMP6, router solicitation, length 8
09:06:57.673814 ARP, Request who-has 10.xxx.0.2 tell 10.xxx.0.1, length 28
09:06:57.680549 IP 10.xxx.0.1 > igmp.mcast.net: igmp v3 report, 1 group record(s)
09:06:57.890660 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0 [3q] [4n] ANY (QM)? 3.a.9.5.5.8.a.a.0.5.8.3.9.d.c.b.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. ANY (QM)? The-Client.local. ANY (QM)? 1.0.xxx.10.in-addr.arpa. (218)
09:06:58.141465 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0 [3q] [4n] ANY (QM)? 3.a.9.5.5.8.a.a.0.5.8.3.9.d.c.b.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. ANY (QM)? The-Client.local. ANY (QM)? 1.0.xxx.10.in-addr.arpa. (218)
09:06:58.360553 IP 10.xxx.0.1 > igmp.mcast.net: igmp v3 report, 1 group record(s)
09:06:58.392182 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0 [3q] [4n] ANY (QM)? 3.a.9.5.5.8.a.a.0.5.8.3.9.d.c.b.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. ANY (QM)? The-Client.local. ANY (QM)? 1.0.xxx.10.in-addr.arpa. (218)
09:06:58.593397 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0*- [0q] 4/0/0 (Cache flush) PTR The-Client.local., (Cache flush) A 10.xxx.0.1, (Cache flush) PTR The-Client.local., (Cache flush) AAAA fe80::xxxx:xxxx:xxxx:xxxx (200)
09:06:58.680653 ARP, Request who-has 10.xxx.0.2 tell 10.xxx.0.1, length 28
09:06:59.595336 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303
09:06:59.693751 ARP, Request who-has 10.xxx.0.2 tell 10.xxx.0.1, length 28
09:06:59.763408 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0*- [0q] 4/0/0 (Cache flush) PTR The-Client.local., (Cache flush) A 10.xxx.0.1, (Cache flush) PTR The-Client.local., (Cache flush) AAAA fe80::xxxx:xxxx:xxxx:xxxx (200)
09:07:01.934168 IP 10.xxx.0.1.5353 > 224.0.0.251.5353: 0*- [0q] 4/0/0 (Cache flush) PTR The-Client.local., (Cache flush) A 10.xxx.0.1, (Cache flush) PTR The-Client.local., (Cache flush) AAAA fe80::xxxx:xxxx:xxxx:xxxx (200)
09:07:03.456766 IP6 fe80::xxxx:xxxx:xxxx:xxxx > ip6-allrouters: ICMP6, router solicitation, length 8
09:07:08.032137 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303
09:07:08.033093 IP 10.yy.0.1.67 > 10.yy.255.158.68: BOOTP/DHCP, Reply, length 300

There are no updates for ath10k in the commit range but there are 53 commits for mac80211. mac80211 is also used on those other devices (ath79/WPJ531/ath9k+ath10k). So I asked the client to move to one of those.
But there, the problem looks different. Responses come through regardless of whether the client is sending but there's also a timing component:

64 bytes from 10.xx.xx.xx: seq=0 ttl=63 time=32.773 ms
64 bytes from 10.xx.xx.xx: seq=1 ttl=63 time=62.852 ms
64 bytes from 10.xx.xx.xx: seq=2 ttl=63 time=204.118 ms
64 bytes from 10.xx.xx.xx: seq=3 ttl=63 time=30.652 ms
64 bytes from 10.xx.xx.xx: seq=4 ttl=63 time=36.900 ms
64 bytes from 10.xx.xx.xx: seq=5 ttl=63 time=265.002 ms
64 bytes from 10.xx.xx.xx: seq=6 ttl=63 time=287.134 ms
64 bytes from 10.xx.xx.xx: seq=7 ttl=63 time=195.082 ms
64 bytes from 10.xx.xx.xx: seq=8 ttl=63 time=13.278 ms
64 bytes from 10.xx.xx.xx: seq=9 ttl=63 time=37.209 ms
64 bytes from 10.xx.xx.xx: seq=10 ttl=63 time=56.085 ms
64 bytes from 10.xx.xx.xx: seq=11 ttl=63 time=79.974 ms
64 bytes from 10.xx.xx.xx: seq=12 ttl=63 time=100.691 ms
64 bytes from 10.xx.xx.xx: seq=13 ttl=63 time=121.912 ms
64 bytes from 10.xx.xx.xx: seq=14 ttl=63 time=2.599 ms
64 bytes from 10.xx.xx.xx: seq=15 ttl=63 time=168.146 ms
64 bytes from 10.xx.xx.xx: seq=16 ttl=63 time=189.272 ms
64 bytes from 10.xx.xx.xx: seq=17 ttl=63 time=289.188 ms
64 bytes from 10.xx.xx.xx: seq=18 ttl=63 time=144.044 ms
64 bytes from 10.xx.xx.xx: seq=19 ttl=63 time=170.746 ms
64 bytes from 10.xx.xx.xx: seq=20 ttl=63 time=195.774 ms
64 bytes from 10.xx.xx.xx: seq=21 ttl=63 time=211.513 ms
64 bytes from 10.xx.xx.xx: seq=22 ttl=63 time=234.636 ms
64 bytes from 10.xx.xx.xx: seq=23 ttl=63 time=147.964 ms
64 bytes from 10.xx.xx.xx: seq=24 ttl=63 time=2.608 ms

list of mac80211 commits in the range

git log --no-decorate --abbrev-commit  --pretty=oneline  up_to_xyz...up_to_wxy -- package/kernel/mac80211/ 
8246ab6a27 mac80211: fix masking nested A-MSDU support for mesh
e4b9538ddb mac80211: use KERNEL_MAKEOPTS instead of KERNEL_MAKE_FLAGS
58b65525f3 mac80211: Update to version 5.15.74-1
a7ca1b2314 mac80211: use board.json provided phy names in generated default config
50a03decdf mac80211: change the default config for a renamed wiphy
db9c4a066a mac80211: fix detecting highest radio* config section index
4d323303e7 mac80211: rename phy according to board.json entries on bringup
6603748e0c mac80211: change default ifname to <phy>-<type><index>
da6b77215b mac80211: fix typo in netifd script
88803cb0e6 mac80211: add patch that gives the driver more control over netdev offloads
26f400210d mac80211: backport security fixes
590eaaeed5 mac80211: fix issues in HE capabilities
f6c359a655 mac80211: sync rx STP fix with updated version
cec7dfa497 mac80211: fix issues with receiving small STP packets
45109f69a6 mac80211: fix compile error when mesh is disabled
f96744ba6b mac80211: mask nested A-MSDU support for mesh
eb07020de2 mac80211: fix decap offload for stations on AP_VLAN interfaces
b02b95d1b0 mac80211: fix parameter reading for AC_BE tx bursting
a6fd151308 mac80211: remove broken CFG80211_WEXT select from the r8723bs driver
3f814ecd24 mac80211: mark lib80211 as hidden
f613c1368b mac80211: disable drivers that rely on wireless extensions
6eeb5d4564 kernel: disable wireless extensions only when needed
412fcf3d44 mac80211: rt2x00: replace patches with v3 of pending series
d826c91704 mac80211: rt2x00: fix typo
e785ca05e9 mac80211: clean and submit a bunch of rt2x00 patches
aa9be386d4 mac80211: merge upstream fixes
8b06e06832 mac80211: merge pending fixes for tx queueing issues
31a6605de0 mac80211: rt2x00: experimental improvements for MT7620 wifi
d4feb66048 mac80211: add patch descriptions to rt2x00 patches
f338f76a66 mac80211: parse the correct set of HE capabilities for AP mode
1ea563ae13 base-files: wifi: add random MAC support for wifi-iface
8b804cae5e mac80211: backport upstream fix for unitialized lock usage
3aa18f71f9 mac80211: Update to version 5.15.58-1
9f1d622328 mac80211: fix AQL issue with multicast traffic
489d792c85 mac80211: enable CONFIG_BRCMFMAC_SDIO for imx/cortexa7
576b62712f mac80211: Add DRIVER_11AX_SUPPORT dependency to mac80211-hwsim and iwlwifi
24eee4b244 mac80211: rename patch to avoid duplicate patch number
51e9d496ba mac80211: fix mesh queue selection issue
a5526efde0 mac80211: read alternative brcm fw names from DT
7f1ac858e3 mac80211: add patch for mwifiex to fix cryptic errors/warnings
f8376ffbc1 mac80211: mwl: add patch to raise global limit of SSID up to 4
d2d5aabdbb hostapd: introduce min_tx_power option
f2982bcc89 hostapd: introduce background_radar option
40c2cd8bd0 ath10k: enable encapsulation offload by default
7bf5233a2a mac80211: increase airtime scheduler quantum
6d49a25804 mac80211: add airtime fairness improvements
946e42ed89 mac80211: brcmfmac: allow setting MAC from NVMEM cells
00f64ed660 ath10k: backport encapsulation offload support
7e627f12f6 Revert "mac80211: add airtime fairness rework/fixes"
1c377a176b Revert "mac80211: sync airtime fairness fixes with updated upstream submission"
958785508c mac80211: sync airtime fairness fixes with updated upstream submission
a10e0d7d35 mac80211: sync nl80211.h with upstream
6c152ce5b0 hostapd: randomize default BSS color

one small concern about the idea of mac80211 being the culprit

The router (who's not the DHCP server on that network) didn't see any DHCP responses when it wasn't working on the WPJ428 while it did see them when it worked:

09:04:43.194092 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303
09:04:44.876971 IP6 fe80::xxxx:xxxx:xxxx:xxxx > ip6-allrouters: ICMP6, router solicitation, length 8
09:04:47.209011 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303
09:04:53.822613 IP6 fe80::xxxx:xxxx:xxxx:xxxx > ip6-allrouters: ICMP6, router solicitation, length 8
09:04:55.578235 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from xx:xx:xx:xx:xx:xx (oui Unknown), length 303

I'm not sure that really means that the DHCP server never replied. Maybe the reply was 'routed' through the bridge from ethernet to wifi device without tcpdump seeing it?

questions about what to do next

There's no way i can do a bisect in that environment and i'm working on a reproduction at home but, overall, this feels out of my depths.

There are lot of commits (including those 'inside' the updates to mac80211) :smiley:
and I don't even know for sure this is caused by mac80211.
Should I continue trying to reproduce in order to be able to bisect or is there another thing that I can try?

Is it reasonable to assume those cross-platform timing issues are related?
Is anybody else experiencing similar issues?

EDIT: additional info

I've just realised the first five commits in the list are related to airtime fairness:
There are five SSIDs each with an airtime weight of 100.
Tx retries are at 0.2%.

Just a wild guess. So caution!

I came across this thread: IPQ40xx Switch Config "Strangeness" (swconfig) recently.

Because my first unfounded thought/impression was, as I saw those pings, something is looping within your network. And I wonder if this could be related to the VLAN quirks. There is still WIP about this VLAN issues.

Out of own experience. If you set this value explicit within your configuration this can cause wired issues with different devices. So better do not mess with this value.

@anon54515283 ipq40xx devices have DSA in master for a couple of weeks or even months now :tada: So the message you quoted doesn't apply anymore. If @yogo1212 has upgraded, they need to upgrade without keeping config

1 Like

Thx for this fast reply. Didn't know that you get triggerd for quoting others in this forum. Sorry for triggering you and dragging you into my wild guess. But your input about upgrading could also be the root cause.

1 Like

Pew! I never considered that.
Airtime weight is tied into the platform that generates the wifi config. I guess it's possible to not set the value on any interfaces if they're all the same.

I use VLANs, yes. The driver doesn't have any hard-coded VLANs, though. And the erroneous behaviour appears to be influenced more by WiFi.

EDIT: Ahh. @paraskevas cleared it up.

I appreciate the hint :slight_smile:
My firmware only keeps a certificate and a JSON file describing the network config on sysupgrade. /etc/config/network is configured by a program that reads the ports from /etc/board.json and detects swconfig appropriately, even automatically generating switch_vlans.

Do you also think something like this could be switch-related?

No idea. I had a lot of random issues with ipq40xx before DSA (drops, timeouts). I've been using DSA since it was a draft PR and all my issues went away immediately. Keep in mind that DSA doesn't use swconfig though :thinking:

1 Like

As you have mentioned this value explicit (so I guess you've set it) I was remembering that I've tinkerd around with this value recently which has caused severe wifi issues (with the wifi of the routers itself and the end devices). And someone (cannot remember actually so no trigger this time) gave me the hint that it is useless to adjust this value and it's causing just issues.

Normally this value is not set within the config explicit. So just delete it within the config.