AQL and the ath10k is *lovely*

At the risk of being pendantic...

http://linuxwireless.sipsolutions.net/en/users/Documentation/iw/__v76.html#Modifying_transmit_bitrates describes how to set your wifi bitrates. I used to use mcs-4 a lot for reference...

When stress testing wifi against a server in the cloud, and your bandwidth to/from the cloud is less than your minimum achievable wifi rate, you are testing the cloud more than the wifi. Testing underloaded wifi is still valuable, btw, if you are experiencing dropouts, etc, but flood/stress testing it at an artificially low bitrate is a useful test of the fq, and codel code, as well as ATF if you are doing multiple station tests.

In the good ole days, I also would do a sweep of bitrates, as I did here:
https://blog.cerowrt.org/flent/airtime-c2/latency_flat_at_all_rates_cdf.svg

the highest rates like mcs-15 would have a tendency towards extreme flakyness, and it was always good to compare the highest rates against what the minstrel rate controller was detecting and achieving, as well as taking aircaps.

So many variables! so many chipsets! so little time, so little:

3 Likes

Ah, sorry this test was just to show the issues on macosx, and not related to the main thread. My current router uses neither AQL nor ATF, yet so I can not really contribute. Bowing out....

EDIT: With "bowing out..." I just want to say I stop my little flent on macosx subthread since that was off-topic....

1 Like

@amteza I followed your steps, but I am getting error about fping:

% flent rrul_be -p all_scaled -l 60 -H netperf-eu.bufferbloat.net -o filename.png
Starting Flent 2.0.1 using Python 3.9.13.
Starting rrul_be test. Expected run time: 70 seconds.
WARNING: Found fping, but couldn't parse its output. Not using.
ERROR: Runner Ping (ms) ICMP failed check: Cannot parse output of the system ping binary (/sbin/ping). Please install fping v3.5+.

% which ping
/sbin/ping

% which fping
/usr/local/bin/fping

% fping -v
fping: Version 5.1

% uname -a
Darwin NAME.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64

@dtaht Are there any Flent testing servers in India or closer to India?

Looks fine to me, same fping. No idea why it cannot parse output.

@reaper$ ➜  ~ fping -v
fping: Version 5.1
@reaper$ ➜  ~ which fping
/opt/homebrew/bin/fping

Note: Homebrew in ARM64 uses /opt/homebrew in place of /usr/local, is not relevant.

@ka2107, if you have time, please, try to run flent with --debug-error parameter.

% flent rrul_be -p all_scaled -l 60 -H netperf-eu.bufferbloat.net -o filename.png --debug-error -v
Starting Flent 2.0.1 using Python 3.9.13.
Executing test environment file /usr/local/lib/python3.9/site-packages/flent/tests/rrul_be.conf
Looking up hostname 'netperf-eu.bufferbloat.net'.
Executing test environment file /usr/local/lib/python3.9/site-packages/flent/tests/rrul_be.conf
Gathering local metadata
Executing 'uname -s' on localhost
Executing 'uname -r' on localhost
Executing 'find /sys/module -name .note.gnu.build-id' on localhost
Executing 'sysctl -e net.core.rmem_max net.core.wmem_max net.ipv4.tcp_autocorking net.ipv4.tcp_early_retrans net.ipv4.tcp_ecn net.ipv4.tcp_pacing_ca_ratio net.ipv4.tcp_pacing_ss_ratio net.ipv4.tcp_dsack net.ipv4.tcp_fack net.ipv4.tcp_sack net.ipv4.tcp_fastopen net.ipv4.tcp_syncookies net.ipv4.tcp_window_scaling net.ipv4.tcp_notsent_lowat net.ipv4.tcp_limit_output_bytes net.ipv4.tcp_timestamps net.ipv4.tcp_congestion_control net.ipv4.tcp_allowed_congestion_control net.ipv4.tcp_available_congestion_control net.ipv4.tcp_mem net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.ipv4.tcp_moderate_rcvbuf net.ipv4.tcp_no_metrics_save' on localhost
Looking up hostname 'netperf-eu.bufferbloat.net'.
Executing 'ip route get 193.10.227.30' on localhost
Executing 'route -n get 193.10.227.30' on localhost
Executing 'tc qdisc show dev en0' on localhost
Executing 'tc -s qdisc show dev en0' on localhost
Executing 'tc class show dev en0' on localhost
Executing 'ethtool -k en0' on localhost
Executing 'for i in /sys/class/net/en0/queues/tx-*; do [ -d $i/byte_queue_limits ] && echo -n "$(basename $i) " && cat $i/byte_queue_limits/limit_max; done' on localhost
Executing 'basename $(readlink /sys/class/net/en0/device/driver)' on localhost
Executing 'ip link show dev en0' on localhost
Executing 'ifconfig en0' on localhost
Executing 'ethtool en0' on localhost
Starting rrul_be test. Expected run time: 70 seconds.
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
which: Found netperf executable at /usr/local/bin/netperf
Ping (ms) UDP BE1: Adding child IrttRunner
which: /usr/local/bin/irtt is not an executable file
which: /usr/bin/irtt is not an executable file
which: /bin/irtt is not an executable file
which: /usr/sbin/irtt is not an executable file
which: /sbin/irtt is not an executable file
which: /usr/local/share/dotnet/irtt is not an executable file
UDP RTT test: Cannot use irtt runner (No irtt binary found in PATH.). Using netperf UDP_RR
Ping (ms) UDP BE1: Adding child NetperfDemoRunner
which: Found netperf executable at /usr/local/bin/netperf
Ping (ms) UDP BE2: Adding child IrttRunner
which: /usr/local/bin/irtt is not an executable file
which: /usr/bin/irtt is not an executable file
which: /bin/irtt is not an executable file
which: /usr/sbin/irtt is not an executable file
which: /sbin/irtt is not an executable file
which: /usr/local/share/dotnet/irtt is not an executable file
UDP RTT test: Cannot use irtt runner (No irtt binary found in PATH.). Using netperf UDP_RR
Ping (ms) UDP BE2: Adding child NetperfDemoRunner
which: Found netperf executable at /usr/local/bin/netperf
Ping (ms) UDP BE3: Adding child IrttRunner
which: /usr/local/bin/irtt is not an executable file
which: /usr/bin/irtt is not an executable file
which: /bin/irtt is not an executable file
which: /usr/sbin/irtt is not an executable file
which: /sbin/irtt is not an executable file
which: /usr/local/share/dotnet/irtt is not an executable file
UDP RTT test: Cannot use irtt runner (No irtt binary found in PATH.). Using netperf UDP_RR
Ping (ms) UDP BE3: Adding child NetperfDemoRunner
which: Found netperf executable at /usr/local/bin/netperf
which: Found fping executable at /usr/local/bin/fping
which: /usr/local/bin/ping is not an executable file
which: /usr/bin/ping is not an executable file
which: /bin/ping is not an executable file
which: /usr/sbin/ping is not an executable file
which: Found ping executable at /sbin/ping
WARNING: Found fping, but couldn't parse its output. Not using.
ERROR: Runner Ping (ms) ICMP failed check: Cannot parse output of the system ping binary (/sbin/ping). Please install fping v3.5+.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/flent/aggregators.py", line 123, in collect
    t.check()
  File "/usr/local/lib/python3.9/site-packages/flent/runners.py", line 1196, in check
    self.command = self.find_binary(host=self.host, **args)
  File "/usr/local/lib/python3.9/site-packages/flent/runners.py", line 1287, in find_binary
    raise RunnerCheckError(
flent.runners.RunnerCheckError: Cannot parse output of the system ping binary (/sbin/ping). Please install fping v3.5+.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/flent/__init__.py", line 60, in run_flent
    b.run()
  File "/usr/local/lib/python3.9/site-packages/flent/batch.py", line 618, in run
    return self.run_test(self.settings, self.settings.DATA_DIR, True)
  File "/usr/local/lib/python3.9/site-packages/flent/batch.py", line 515, in run_test
    res = self.agg.postprocess(self.agg.aggregate(res))
  File "/usr/local/lib/python3.9/site-packages/flent/aggregators.py", line 237, in aggregate
    measurements, metadata, raw_values = self.collect()
  File "/usr/local/lib/python3.9/site-packages/flent/aggregators.py", line 125, in collect
    raise RuntimeError("Runner %s failed check: %s" % (n, e))
RuntimeError: Runner Ping (ms) ICMP failed check: Cannot parse output of the system ping binary (/sbin/ping). Please install fping v3.5+.

If your not testing your wan (ISP) connection, all you need to use flent is a netperf server "netserver" and the ability to respond to pings (an irtt client/server is recommended).

It is possible to install netperf (with a netserver) via opkg on your openwrt wifi AP opkg update; opkg install netperf. You will need to keep an eye on your AP cpu usage to make sure your not making observations that are limited by the AP cpu.

You can record the AP cpu (plus other AP stats if your interested) via flent if you also install opkg install flent-tools. To get the stats I use a passwordless ssh key on the AP and then include in the flent command line --test-parameter=cpu_stats_hosts=<AP-host>.

Again, I've tried both a netperf server and an irrt server running from a osx box - I don't recommend you do this except to convice yourself its not a good idea (but running flent from osx out to a non osx server may still be useful). Linux netperf/irtt servers can work (just look out for the wifi powersave feature mentioned above).

Keep in mind, ATF (and i suspect AQL) have the most impact when streaming from your AP to your clients. Streaming from you client to your AP will not be as informative.

HTH

1 Like

Sorry, wasted your time, I cannot think of why if unable to parse the output if you are using the right binary.

@moeller0

your contributions are always welcome. In my case I was kind of flashing on some old history, and directing rodney angerfield in the direction of an uncaring universe.

I am happy we have got so many people pitching in, frustrated with all the bugs in wifi here, and elsewhere, and wishing there was funding for at least flent.

2 Likes

I have a TP-Link EAP245v3 (QCA9982) and had wifi latency issues that were really annoying when using SSH for example (scrolling was very laggy).

This seems to fixed by the latest airtime fairness patches on 22.03 (my current build is from commit 32e9095662).
However I noticed strange latency issues when testing bufferbloat on my DSL line today: Latency is a lot higher when the line is idle. While Down- and uploading, latency is excellent. The waveform bufferbloat test shows this very well:

This seems to be caused by my Wifi connection - if I connect via ethernet, latency is great, even when idle:

I guess this shouldn't happen, or is something like that to be expected over Wifi?

Here is a flent plot (via wifi), if that helps:

2 Likes

Is SQM on or off on the router? You are bottlenecked at the ISP link, not the wifi, in both tests.

Idle latency being poor on wifi is normal, due to the impact of powersave (usually a min 100ms)

The gross disparity between the first BE flow and the other flows is puzzling. The four should have converged, long before they did, and what happened at T+250 is really puzzling. What's the client driving the test?

You actually should not be seeing 0 bufferbloat on this test, but somewhere between 5-20ms.

1 Like

Ah, this explains the idle latency, thanks!

SQM on the router is on. It's an Apu2 that handles the PPPoE connection. That's also the node running netserver for flent, so the flent-plot should show the wifi speed. The flent client is a Thinkpad P14s with a Realtek RTL8852AE card.

Netserver should be able to drive the network harder than that. A rrul test on apu2 class hardware should be able to crack 500Mbits over ethernet. (low latency at the cost of this much bandwidth seems odd)

Are you using BBR?

No, I don't use BBR. I have to check if Fedora (installed on my laptop) sets any fancy networking options, but I doubt that.
Here is a flent plot of a run via ethernet:

This also shows a big difference between flows. But since this doesn't seem to be related to wifi, should I open another thread? This one is already huge...

1 Like

Heh. Just having baseline performance figures for this hardware also on this thread would be good, so no need to fork it. You've now shown that this device could - if it was working right - drive the wifi to saturation well past 40Mbit - well past 200Mbit and it isn't, due to yet some other problem we have not found yet. On some other bug thread here are people reporting problems with "ax" mode, try ac?

On the ethernet front...

My guess is that the APU2 has 4 hardware queues and you don't have irqbalance installed. (a "tc -s show dev the_lan_network_device" would show mq + 4 instances of fq_codel). In this test two flows landed in one hardware queue, another ended up well mapped to the right cpu, the other less so. tc -s qdisc show on your fedora box will probably also so mqs + fq_codel

A test of the lan ethernet device with just fq_codel on it (tc replace dev the_lan_device) will probably show the downloads achieving parity between each other but not a full gbit.

The icmp induced latency looks to be about right (it's usually ~500us) the induced udp latency of > 5ms surprisingly high. I'd suspect TSO/GRO. Trying cake on the lan interface (without mq but with the gso-split option) - would probably cut that (due to cutting BQL size). There are numerous other subsystems in play like TSQ.

Trying 4 instances of cake with gso-split on mq would also be intersting.

The world keeps bulking up things on us. So much of linux's development is on really high end boxes and some recent modifications like running more of the stack on rx looked good on machines with large caches but I suspect hurt on everything else.

What does openwrt use for RT_PREEMPT and clock ticks these days?

1 Like

Interesting may be:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=6fcc06205c15bf1bb90896efdf5967028c154aba
and
https://lwn.net/Articles/883713/

I apply patch:

--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -4764,7 +4764,6 @@ static void ath10k_mac_op_wake_tx_queue(struct ieee80211_hw *hw,
 					struct ieee80211_txq *txq)
 {
 	struct ath10k *ar = hw->priv;
-	int ret;
 	u8 ac;
 
 	ath10k_htt_tx_txq_update(hw, txq);
@@ -4777,11 +4776,9 @@ static void ath10k_mac_op_wake_tx_queue(struct ieee80211_hw *hw,
 	if (!txq)
 		goto out;
 
-	while (ath10k_mac_tx_can_push(hw, txq)) {
-		ret = ath10k_mac_tx_push_txq(hw, txq);
-		if (ret < 0)
-			break;
-	}
+	if (ath10k_mac_tx_can_push(hw, txq))
+		ath10k_mac_tx_push_txq(hw, txq);
+
 	ieee80211_return_txq(hw, txq, false);
 	ath10k_htt_tx_txq_update(hw, txq);

and mayble is better- I not visible bug in log ath10k_ahb a000000.wifi: failed to lookup txq for peer_id X tid X until about 24h used (max 7 clients and internet connection with maximum load). Connect in 10 meters test room is better ( now is about 15-20Mbits; without patch was about 2Mbits and somtimes 12-15)...

How long tests will are good? I no know yet :slight_smile:

we won't be ready for BIG tcp for a few more years. Certainly I can see the core google devs not testing anything on weak hw.

2 Likes

mumbai.starlink.taht.net

2 Likes

You are about the only person I know that can offer reasonably near-by RTT/OWD reflectors independent on whom from where ever asks :wink: Just waiting for mcmurdo.starlink.taht.net to come on-line :wink:

2 Likes

Sorry, I should have made my homework...
irqbalance was installed but not enabled (something I regularly forget on new OpenWrt installs). With irqbalance and just cake on eth1+2, latency indeed looks great:

So back to the EAP245v3. It only supports 802.11ac, the previous plot was ac as well. It's still slow, but at least it looks less chaotic.

I don't know what happened after 250s, I had to reconect, but dmesg on the ap didn't reveal anything. This was right after a reboot. I'll test again after a few hours/days uptime, since wifi tends to get worse after a few days.

2 Likes

As it has been many months of effort for all concerned here, I am in search long term stability first, regaining performance later. How many other simultaneous bugs do we still have? mt76 still has a range problem at 2.4ghz, there's other ath10k chipsets to try (are you using -ct, stock, -ct-smallbuffers?), nobody's tested the ath9k lately, qosify vs wifi of any sort, the "
iphone leaving the network" problem, and I forget what else.

More https://en.wikipedia.org/wiki/Six_Sigma

Less agile.

If we can get down to where all that stuff goes away, my original goal this quarter was to find more ways of making wifi more capable of twitch and cloud gaming! There's a bunch of easy stuff along that road that I'd have hoped to have tried by now, but setting the bar right now at "not crashing", at "not doing weird stuff over long intervals", and getting confirmation from everyone testing that that's the results we're getting across more chipsets will make for a MUCH better next openwrt release, which I'd settle for!

Anyway your wifi result is very good. The latency inflation you see is typical of the codel settings (target 20ms), and number of txops (2 each way, max 5.7ms) we currently use. (I would like to reduce these in the future) It does seem to be quite a lot less than what ac is capable of, and the upload especially seems low. I don't know anything about the driver for that chipset.

One thing to explore is to try and figure out if it is the rate controller or some form of starvation that is limiting your bandwidth. (from your results above it "feels" like the rate controller) Doing an aircap during a short test can pull out what the advertised rates and typical transmissions actually are, google for "aircap tshark". I don't remember enough about how to get at ath10k rate controller stats from the chip, or even if that's possible. @nbd? @tohojo?

Another simpler test is to just test downloads and uploads separately.

flent -H APU -t the_test_conditions --step-size=.05 --te=download_streams=4 --socket-stats tcp_ndown

flent -H APU -t the_test_conditions --step-size=.05 --te=upload_streams=4 --socket-stats tcp_nup

Another is setting the range of rates manually.

Also if -l 300 is reliably cutting off after 250s that is either a flent bug, or a wifi bug! I can think of a multiplicity of causes for that - a dhcp renew failure, an overflow bug somewhere in the DRR implementation, if it's repeatable that would be weirdly comforting. If it were tied to the rate being achieved (say we got twice the throughput by testing tcp_ndown, and it crashes in half the time), that would also be "better".

-l 600, maybe. I'd be really happy if folk ran the rrul_be, rrul, and rtt_fair tests for days and days at a time (I think the upper limit is about 2000s per test). Six Sigma... one of the nice things about fq_codel derived solutions is that - when it's working right - you hardly notice when something else is saturating the system.

4 Likes