Due to @ParanoidZoid's great observations, I am now convinced the latest results I sent you are due to Ben's low-water mark change. I jump into more details in Edit 1 and 2 here: AQL and the ath10k is *lovely*
That said, I do acknowledge your desire to see the watermarks replaced with a bql implementation. However, I am curious as to the results I have seen in testing. Ben's htt->max_num_pending_tx / 4 with the ath10k-ct's 2048 tx buffers means this low-watermark would be 512. With me running the ath10k-ct-smallbuffers at 512 tx buffers from the start, this low-watermark for me would be 128.
It seems to me this would, at least in theory, result in even better/more consistent latency (at the potential cost of throughput) on the ath10k-ct-smallbuffers driver than on the ath10k-ct driver. Or am I crazy? (I can handle honesty )
It's not directly relevant to this discussion. Just curious. Same goes for what ethernet driver it is, but I don't know how to find out that from sysfs.
you are utterly correct that a 25% percentage of smaller tx ring results in less latency and jitter than a 25% percentage of a larger one, at a possible cost in in throughput. and you are not crazy.
However, packets range in size from 64 bytes to 64k bytes (with gso). and bytes = time, on ethernet.
on wifi, it's airtime, but on 802.11ac, (not n) bytes = time is a decent proxy, better than the tx ring watermarks. Figuring out airtime is what AQL is sorta supposed to do, but in this first version,
aql_threshold is a fixed value and shouldn't be.
I kicked off a 90 second bi-directional ipferf3 between my MacBook and a server that is also gigabit connected (through a Netgear switch to my R7800). The Netgear switch has tx/rx flow control enabled.
During that iperf3 run, I never saw the limit value on the R7800 go above 6123:
root@OpenWrt:~# ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off [fixed]
tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: on [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
I should have been checking eth1 instead of eth0. My primary SSID is tagged to eth1--it didn't dawn on me earlier that I should have been watching it.
Running the iperf3 again via WiFi resulted in eth1 bql limit values of 100k - 250k. Sounds like more believable values compared to what you were seeing.
Just for grins, I also connected my Mac to the R7800 via ethernet (gig) and run the same iperf3 back to my server.
Still way off topic! I'd be rather curious what cake does as a default qdisc on that hw without the shaper turned on. In it's full blown glory, it's pretty cpu intensive, but with gro splitting it should be able to get down to about 40kb on bql and and thus lower latency. On the other hand you can't push 2Gbit (bidir) on this hw through fq_codel at present, so you are hitting a limit somewhere in the rx path, ironically enough, the rx ring might be too small. I'm no fan of gro... particularly when done in software, I'd just as soon rip it out of the fast path. you can turn it off with ethtool. That said.... merely
Anyway, if you are bored, try this:
sysctl -w net.core.default_qdisc=cake
tc qdisc replace dev eth0 root pfifo
tc qdisc replace dev eth1 root pfifo
and rerun that test, as sort of a speed test of the simplest algo we have.
then try
tc qdisc del dev eth0 root
tc qdisc del dev eth1 root
(this should make cake be the default qdisc, check with tc -s qdisc show)
rerun the iperf test
tc -s qdisc show > cake_default.log
then
tc qdisc replace dev eth0 root cake besteffort flows
tc qdisc replace dev eth0 root cake besteffort flows
I still don't fully comprehend how DQL or the low-water mark helps reduce latency when we've already got AQL & ATF. I can see that the improvements, but I don't know how it works/it's mechanism with regards to the rest of the ath10k stack. Can anyone bring light to this?
Which make/model would be best? I'll seriously send it
Thank you for the offer, and I really do appreciate the sentiment. However, my problem is not so much lack of hardware, as lack of time to set up a proper testbed and running tests. I am planning to try to resurrect my old testbed, but I only have remote access so there are some limits to what I can do there. Otherwise I do have a lot of empty shelf space these days, but setting up a stack of routers there requires a bit more time investment...
I still don't fully comprehend how DQL or the low-water mark helps reduce latency when we've already got AQL & ATF. I can see that the improvements, but I don't know how it works/it's mechanism with regards to the rest of the ath10k stack. Can anyone bring light to this?
Astute observation, and this is actually the reason I don't think spending more time on DQL for ath10k is the right thing to do. Making things airtime-based is clearly the right thing to do, so I'd rather spend the effort tweaking AQL. The main thing that is missing there (I think) is a global limit on the whole interface; AQL only has a per-station throttle. Ben's original patch (with the low-watermark tweak) was because he was running tests with a lot of stations (as in, hundreds), which has not been a case that has seen a lot of focus for AQL. So it ought to be possible to improve this case with the existing infrastructure...
I'd like openwrt to ship what we have so far. It's an enormous improvement and more users need it. We can keep sorting out better approaches as we have time.
I don't care about 100 users, I care about, oh, a max of 32. For openwrt.
Please consider >32 no longer unreasonable. Maybe not >32 high bandwidth users, but definitely >32 users.
These days vacuums, thermostats, smoke detectors, fridges, doorbells, TVs, chromecasts, speakers, game consoles, and cars in the driveway all want on wifi. Having a smart speaker that can control colour changing smart bulbs and smart plugs, all with their own wifi connections, are popular with kids these days.
Not to mention occasionally hosting big multi-family event like Christmas where the kids run off into little groups to have tiktok watching parties while others want to stream 4k netflix. Ideally openwrt can support all this. I'm more worried that 64 isn't enough than 32, at least in a non-covid-19 world.
If I have any one end goal with this work in making the ath10k sing and dance, its to finally be able to test the l4s vs sce concepts on real hardware, on wifi. I really lack time and braincells for hard core kernel development, I'm mostly just a theorist, and although I LOVE hacking on code, I have to spend too much time at layers 8 and 9 of the stack these days to focus on it. so if you know anyone that can lend a hand to this effort over here:
Perhaps we can make progress on adding similar features to the wifi implementation and analyzing them.
thx. There's a ton of other stuff worth doing in wifi as well... probably more important than this. I keep thinking importing an optional ack filter based on the cake ack-filter will help, and also I think the wifi implementation needs to also adopt the drop batching stuff that is already in the qdisc....