AQL and the ath10k is lovely

dtaht · June 16, 2020, 5:16am

Partially because that test measures the wrong thing. That is because "rrul" testing the performance of 3 of 4 hardware queues which are filling up occupying airtime independently. A better comparison test for 5ms vs 10ms target is the rrul_be test, or the tcp_nup test I just mentioned above.

This thread has got really long, and I so appreciate everyone leaping on it. Somewhere in the past part of this thread we talked about how using packets, rather than bytes, in the high and low watermarks was probably a source of weirdness. A way to test this is for someone to try the regular -ct code, not the smallbuffers version, on the lower target. @tohojo who has his fingers in too many pies and doesn't even have this chip seemed to have some insight on that.

elsewhere (not on this thread, I think) I talked about how unbounded retries were messing wifi up in the case of interference. The only way know how to look at this is with aircaps.

and over here I talked about how testing for only a minute bit me badly: http://blog.cerowrt.org/post/disabling_channel_scans/ so -l 300 on a given test, and taking a long walk might help...

so far, to me, the net sum of testing was that 10ms seemed a safe default for the -ct-smallbuffers version, and worse case we should try to push that into the next mainline release. Actually getting
something even better and more consistent than that would be great. I did just get a new
x86_64 box that I can start putting custom kernels on.

dtaht · June 16, 2020, 5:19am

my take on aql was that it needed to be an algorithm, not a constant.

dtaht · June 16, 2020, 5:26am

In terms of trying to explain how we might think about the interactions between aql, napi, hi/low watermarks, bql, etc, I kind of find it useful to refer to this paper:

https://sci-hub.tw/10.1016/j.comnet.2020.107136

In that the pie folk finally realized that draining the rx ring and filling the tx ring in a real OS as opposed to a simulation... was a bulk operation, NOT a fluid model, and it broke pie's default rate estimator. They switched pie to timestamps which sort of fixed the problem (codel has always used timestamps, and while we always knew pie's rate estimator was broken...) AND did a really nice analysis of BQL and interrupt handling for two different ethernet cards.

If you can grok that...

Ath10k is even more complicated than that. I so wish we had access to firmware.

moeller0 · June 16, 2020, 11:16am

Mmmh, curious, rrul uses CS0, CS1, CS5, and EF. As far as I can tell in WMM that will map to AC_BE, AC_BK, AC_VI, AC_VI, so rrul should only exercise 3 of the 4 queues (which essentially does not invalidate your argument), no?

I guess, I need to run a ruul test over wifi and the check which queues accumulated more packets....

moeller0 · June 16, 2020, 11:18am

Ha, so much for the PIE-camp's claim that PIE is "better" than codel since without needing timestamps it could be more easily implemented in silicon....

dtaht · June 16, 2020, 2:48pm

you are correct, rrul only tests 3 of the 4 queues. My bad - both in the original spec and implementation, and too late to fix now. (rrulV2 anyone?). I also edited my comment above.

3 queues is still 2 too many to correctly test the target 5ms mod. However the side effects of using the extra hardware queues do seem to suggest that we fiddle with targets (or preferably, intervals) based on how many hw queues are in use.

moeller0 · June 16, 2020, 5:00pm

Well, there is rrul_cs8, which by using all 8 CSs should map two flows in each AC. This nicely demonstrates how anti-social greedy flows in AC_VO behave....

huaracheguarache · August 9, 2020, 12:55pm

An AQL airtime estimation patch has caused some issues for me. The patch assumes an average aggregate length of 16 which doesn't seem to work well at the very edge of my network.

Notice that the latency stays at around 100 ms under load with the patch:

Without the patch I'm getting a far better value of around just under 40 ms:

_FailSafe · August 9, 2020, 3:24pm

Interesting find. I had noticed some [new] spotty connectivity patterns on some of my devices near the edge of my network as well, but didn't associate it with this patch until you pointed it out.

Did you end up manually removing these patches and rebuilding?

Curious to hear if others are noticing issues as well.

huaracheguarache · August 9, 2020, 4:41pm

Yeah, I reverted the patch locally and compiled a new image.

_FailSafe · August 9, 2020, 5:47pm

The behavior pattern I am noticing (while not as scientific as a flent test) is this:
I SSH to one of my internal servers from a wirelessly connected iPhone/iPad (VHT80 connected) and start something like a watch -n1 ... command. It refreshes every second for a few seconds, then it "pauses" and takes several seconds (anywhere from 2-5+ secs) to refresh. It might pick back up with refreshes for a second or two before a long delay again. This cycle repeats endlessly.

This started fairly recently, so I reverted the same patch thinking it might have been related, but no dice. There appears to be something else going on in my case. I'm not losing sleep over it at this point, but hopefully I (or someone else) can get to the bottom of it soon.

huaracheguarache · August 9, 2020, 8:54pm

Hmm, I haven't experienced the issue you describe. If you can reliably reproduce it you could try to bisect to find the commit that causes it.

nbd · August 11, 2020, 8:18am

Hi @huaracheguarache,

Thanks for reporting this.
For high throughput links, my patch is necessary for decent aggregation performance, because otherwise the hardware queues will run empty too often.
Regarding your setup that produces these latency spikes, do you simply have too many stations producing traffic (not filling the individual queues enough to aggregate packets), or are the links using lower data rates because of weak signal strength?
Unfortunately I can't make the patch use actual A-MPDU length, because most drivers can't report it properly. So I have to find another way to work around this.

huaracheguarache · August 11, 2020, 8:51am

In my case I'm only testing one station at the edge of my network where the signal strength is poor which causes low data rates. There have been some issues with VoIP performance in these areas of my house lately. I suspect this is related to the patch.

anon98444528 · August 11, 2020, 10:13am

Not sure if its relevant to your observation but...

There is a report of lag with apple devices and ath10k-ct firmware on the r7800 here.

If you haven't already tried it, this post (EDIT new link) has a development firmware for the r7800 that apparently helps for kernel 5.4.55 (but not 5.4.56).

Something else to try if you find the time.

HTH

_FailSafe · August 11, 2020, 6:18pm

Huge thanks for calling my attention to this issue! I definitely experience the same behavior as described there and added a post to that thread as a result. It looks like great progress is being made there, so I will be following it for sure. Thanks again!

anon98444528 · August 11, 2020, 6:41pm

the op in that issue requested minimal contributions from others. I think your post is fine tho given you have similar hardware and symptoms. That said its best to create a new issue as @greearb suggested.

I may be experiencing similar issues and possibly related to apple devices, but different hardware (running ath10k-firmware-qca99x0-ct-full-htt firmware).

While pinging with an iphone 7, I do see some lost packets while also doing an iperf3 test from this phone to the AP. However my ping rtt's are much better than you report. ping during iperf3 from a 2019 mac book air to the AP seem fine...

A few other observations if it helps...

I have to be careful not iperf test when my spouse is connected (on the same 5GHz, windows 10 client) and doing a video conference as the audio will kick off - I don't recall this behavior earlier this year (Jan-Feb time frame).

I initially got very similar symptoms (sporadic 1000+ ms ping rtt) after upgrading an ubuntu client form 18.04 to 20.04 but this turned out to be client related (some wifi power saving setting - testing with the iphone 7 plugged in vs on battery, I don't see a difference but perhaps there are other apple device power saving features I could try).

I'm not sure if this is "ath10k AQL" related but the windows client video/voice behavior make me suspicious.

HTH

EDIT another "symptom" worth mentioning. I can no longer reliably use dsl reports to test "buffer bloat" from any AP wifi client (I have a separate DIY x86 router running sqm). If my AP wifi network is quiet, I can get results from a wifi client that will match a test done over the wire (on the same AP).

However, if the wifi network is "busy" bandwidth sporadically drops off (sometimes by as much as half) midway through the test and the test reports B or C for buffer bloat. A "busy" network does not seem to give these results when testing on the wire.

In the Jan-March time frame, I could get straight "A's" using fq_codel and simple.qos on the router, testing with a wifi client, when the network was "busy." I'm trying cake/piece of cake now, but I don't think this is router/sqm related - I'm pretty sure its happening upstream on the AP for wifi clients only.

nbd · August 12, 2020, 3:13pm

@huaracheguarache, please test the mac80211 commit from my staging tree at https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary
Hopefully it should resolve your AQL latency issue without hurting the high throughput case

huaracheguarache · August 13, 2020, 3:36pm

Ok, it seems like your latest patch has fixed the regression. Here are the results I got with no patches (I reverted the patch that caused the regression):

To test the behaviour with multiple stations downloading I ran an iperf3 test on my smartphone at around 150 seconds, hence the latency spike and drop in throughput. Here's also a close-up of the area where the ping is more stable:

And these are the results I got with the original patch and the fix:

It actually looks a bit better than the test without any of the patches. And a close-up:

Which look pretty good! What I don't really understand though is why the ping climbs so high when I run a concurrent download on my smartphone. This seems to be an issue with AQL, unrelated to your patches, which needs to be looked at.

huaracheguarache · August 13, 2020, 3:48pm

@dtaht Any idea what might be happening during the part with the high latency? I'm running a build with the codel target lowered to 10 ms and aql_threshold is set to 6000.

AQL and the ath10k is *lovely*

AQL and the ath10k is lovely