AQL and the ath10k is lovely

vochong · June 23, 2022, 3:03pm

Please don't restart your WIFI everyday. Keep it running without restarting because some problems may only arise after several days. We also don't want perfection in anything because perfection does not exist (in the general sense of Heisenberg’s uncertainty principle), we just want things to work reasonably well and reliably for extended periods of time

vochong · June 23, 2022, 8:24pm

@quarky:

From what I can understand of the ATF code, the flags control the use of TX and RX airtime accounting. If set, the TX/RX time will be added to the ATF airtime consumption of a client transmit queue, which will then be used to determine if the client's transmit queue has used up it's share of transmit time. So clearing of the flag means the transmit queue will always be eligible for transmit, as the code will think that no airtime has been consumed.

This looks like a Quick-and-Dirty way to eliminate that high latency problem. All-you-can-eat buffet à l'américaine

quarky · June 24, 2022, 12:43am

Well, an all you can eat buffet will be a cause of high latency for latency sensitive traffic like VoIP or online gaming. For example, downloading a large software update file will drown out VoIP traffic. ATF is suppose to solve this problem, but we'll have to work together to iron out the bugs.

vochong · June 24, 2022, 1:48am

@quarky, It was more of a joke, but theoretically a possible Q&D workaround for my problem.

My main problem is extremely high latency for associated clients when some clients exit the house. My guess is that some corner logic may have depleted the associated clients of their transmit queues and rendered them unable to send/receive anything for a few minutes (aka high latency). Your original explanation mentioned that disabling the ATF flags would make the client transmit queues be always available for transmit, so theoretically it would remedy my high latency problem in a Q&D way :-). As a matter of fact, so far it seems to help that way after disabling the ATF flags (I kept AQL enabled).

quarky · June 24, 2022, 2:05am

Ah ... apologies. Didn't get the joke.

I think this is a good work-around, since AQL will limit the airtime as well.

uniqe13 · June 24, 2022, 11:35am

Unfortunately, errors still pop up (we'll see if the network becomes useless)...

[50208.914881] ath10k_ahb a000000.wifi: received unexpected tx_fetch_ind event: in push mode
[50208.921160] ath10k_ahb a000000.wifi: received unexpected tx_fetch_ind event: in push mode
[50208.929182] ath10k_ahb a000000.wifi: received unexpected tx_fetch_ind event: in push mode
[50208.937336] ath10k_ahb a000000.wifi: received unexpected tx_fetch_ind event: in push mode

quarky · June 24, 2022, 2:20pm

I'm joining in the fun and have applied @nbd's experimental patch to both 21.02 and master branches.

My R7800 is running the 21.02 build while my Linksys E8450 is running the master build. Both have the latest pull from the respective branches.

I must say so far things are looking good after a couple of hours of uptime. No latency issue that I can observe, so I'm optimistic.

Let's see how it goes this round.

dtaht · June 24, 2022, 6:00pm

I'm a little encouraged. I would like more folk to post the output of the
/sys/kernel/debug/iee*/phy*/netdev:*/aqm

I have fears that this codepath is not being updated with drops and marks correctly.

celle1234 · June 24, 2022, 6:44pm

I applyed the patches to 22.03 and until now it looks good, but I need more longterm testing.

uniqe13 · June 24, 2022, 8:14pm

Please...

/$ cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 13507 0 0 0 1 2174129 13533

and 10 minutes after

/$ cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 17994 0 0 0 1 2848428 18123
/$ cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 17998 0 0 0 1 2849344 18127

master on 5.15.45, + patches, 2,4GHz (phy0), 5GHz disabled

Are bugs, but wifi is usable yet (i.e. 10h after visible bugs in log)... is now 5 wifi user (2.4GHz)- 1 two stream and 4 one stream. Wireless is good in near (1-5 meters, on the this same floor, ping 2), and not good in far (10 meters)), but work stable.

anon98444528 · June 24, 2022, 8:22pm

r7500v2 # cat /sys/kernel/debug/iee*/phy*/netdev:*/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 86768 0 0 0 0 21823552 86771
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 86690 0 0 0 0 21806307 86700
r7500v2 # uptime
 16:20:49 up 1 day,  8:33,  load average: 0.00, 0.01, 0.00

Build described above here.

EDIT: if it helps, a few minutes later

r7500v2 # cat /sys/kernel/debug/iee*/phy*/netdev:*/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 87688 0 0 0 0 22095101 87691
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 87610 0 0 0 0 22077856 87620
r7500v2 # uptime
 16:34:44 up 1 day,  8:47,  load average: 0.08, 0.03, 0.00

dtaht · June 24, 2022, 9:10pm

In no case above are we actually logging that codel dropped any packets. doing a stress test and a packet capture to show that we did, indeed drop some would be reassuring, but I do think that (at least the logging) part of the code path is being bypassed somehow presently. @nbd?

... Or you are not dropping any packets. Dropping a ton of packets would be bad and lead to some of the behaviors describe on this thread, but dropping or marking at least some i necessary for good congestion control.

anon98444528 · June 24, 2022, 9:22pm

I'll try netem (configured to drop some packets) and then see if any drops show up.

EDIT: hmm, i can get netem to drop packets on a client (netperf slows down when transmitting from that client to AP with a 1% loss), but I'll need to figure out where to put netem (on an ifb device on the clinet then stream from AP to client?, on the wifi interface or br-lan on the r7500v2?) and where to look on the r7500v2 in addition to the aql log.

Sorry non expert, beer in one hand and bbq in the other so it will take time.

EDIT: I can put an ifb device on the client with netem configured to drop 1% of packets. Streaming from the AP to that clinet, I see it is dropping packets (netperf decreases versus netem at 0% loss and the ifb0 devices shows drops). I just don't know where to look on the r7500v2. The cat ...*aqm still shows no drops and I don't see change in drops on any other r7500v2 interfaces that correlate with the ifb0 device drop count.

Sorry - maybe someone with more experience (and does not have a beer in one hand) can do this better.

dtaht · June 24, 2022, 11:42pm

Hold that beer! netem drops them in the wrong place. hitting the interface has hard as you can would be a way to see if codel is working. multiple ping -f -s 1400 the_router perhaps, but ideally a stress test like flent or iperf to another server within your wifi network.

sjpacket · June 25, 2022, 12:23am

Hi @anon98444528,

Does your device have 512MB or 256MB RAM?
When testing, only single client is sending traffic and all other devices are not sending any traffic other than ssh traffic (no updates or anything else)
I am testing on 5.10 with ath10k-ct and ath10k-ct-htt

I am going through my test setup again, double checking all linux clients and I also found more macos clients I can use for testing

thanks for the update!

sjpacket · June 25, 2022, 12:24am

Hi @celle1234,

Are you using ath10k-ct (that's default with openwrt 22.03) or did you change it to ath10k mainline when you applied these patches?

Thanks!

anon98444528 · June 25, 2022, 12:28am

512MB
single client netperf - other clients were mostly not transmitting, but from past testing, I know something occasionally is sending data as I see it in the netperf (drops that recover). In my "baseline" test above, my network was pretty quiet and I saw little, if any, drop in throughput when doing a one client netperf. If you see a netperf throughput dip that recovers, likely one of your mac clients is "phoning home" I don't think it will be easy to control that.
That should be safe.

anon98444528 · June 25, 2022, 1:26am

flent rrul from one client to a netperf server wired to the AP, two others simultaneously (with each other and the rrul) doing netperf to a different netperf server also wired to the AP (one client streaming from the server, the other client streaming to the server, all clients sending data through the AP on the 5 GHz wifi). I got 3 collisions.

r7500v2 # cat /sys/kernel/debug/iee*/phy*/netdev:*/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 112477 0 0 0 3 29045537 112481

I'll keep trying and report back if I see something.

anon98444528 · June 25, 2022, 3:07am

I'd like to try a few tests with rtt_fair. I have 3 ubuntu boxes attached to the AP via wires. One of those boxes is the router so I won't mess with it's interfaces/qdisc. The router (box 1) rtt is 1-4 ms. The other two boxes I can do as i please so I'm thinking about using netem and ifb to introduce an artificial rtt delay. On box 2, something like

sudo ip link set dev ifb0 up
sudo tc qdisc add dev eth0 ingress
sudo tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0
sudo tc qdisc add dev ifb0 root netem delay 50ms 10ms distribution normal
sudo tc qdisc add dev eth0 root netem delay 50ms 10ms distribution normal

and similar setup but delay 100ms 20ms distribution normal on the third box.

EDIT: I'm still thinking this part through. i.e. the netperf stream and ping rtt should be consistent. Since I'm delaying 50 ms on both ingress and egress on box 2, the total average rtt should be 100 ms (200 ms for box 3). Since box 2 rtt is ~100 ms, the box 2 netperf packets should be delayed 50 ms on upload or download.

Will that be better than all 3 boxes having about the same (unaltered) rtt? Is this rtt spacing sufficient (2 ms, 100 ms, 200 ms)? Use the distribution? If so, is the variance of 10ms and 20ms ok?

Likely I'll post about this on my own thread to keep the noise down on this one.

celle1234 · June 25, 2022, 5:44am

I use ath10k mainline on my r7800. The CT driver and different CT firmwares were never usable in the past.

AQL and the ath10k is *lovely*

AQL and the ath10k is lovely