Adding OpenWrt support for Xiaomi AX3600 (Part 1)

dspalu32 · February 11, 2022, 1:07pm

Leak, delayed free, periodic cleanup ... I could agree with all opinions.

I've been trying different iperf3 workaround settings over the past weeks: router-to-router, WDS chains, exercising just one or both ath11k radios, bandwidth limiting iperf3. I always have all three radios enabled. Traffic never passes over ath10k for the workaround. (Adding OpenWrt support for Xiaomi AX3600 - #5636 by dspalu32)

If traffic is zero or near-zero, it'll grow, often to the point of OOM.
if traffic bursts high often, it'll hold a solid low usage value: e.g. full-rate iperf3 every 10 minutes = flat 150Mb free
Between those two extremes, if iperf is run less frequently (e.g. 30 minutes) and/or bandwidth-limited (-c -b 10M or -b 100M), then there's a sawtooth profile to RAM usage, with the peak-trough magnitude directly affected by transfer rate. The workaround is effectively useless at 1M, but OK at 10M and better again at 100M+

I never tried to trace the source RAM usage. Presumably lsmod will show it accumulating in ath11k?

dchard · February 11, 2022, 1:28pm

This is only my opinion, but searching for workarounds is not a viable option. Efforts should be put into finding at least the cause so we can reliably reproduce it, which would give Robi or Ansuel a chance to actually fix it.

As it seems low traffic scenario is the key, I am long thinking about setting up a single client environment, where on both end traffic is guaranteed to not reach the wifi stack, so we can simulate a zero traffic scenario. The problem is that my router is carrying my live traffic, so...

dchard · February 11, 2022, 1:43pm

Yep. I am already using a manual set in the startup file to do some tuning:

#assign 4 rx interrupts to each cores
echo 8 > /proc/irq/50/smp_affinity
echo 4 > /proc/irq/51/smp_affinity
echo 2 > /proc/irq/52/smp_affinity
echo 1 > /proc/irq/53/smp_affinity

#assign 3 tcl completions to 3 CPUs
echo 4 > /proc/irq/73/smp_affinity
echo 2 > /proc/irq/74/smp_affinity
echo 1 > /proc/irq/75/smp_affinity

It definitely helped, but might not be the most optimal solution. Non the less: with this and with SW offload + packet steering, I can get gigabit speeds on PPPoE, without NSS. If we can further tune this, that would be a delight.

kirdes · February 11, 2022, 1:48pm

@robimarko And what about that 220-napi_threaded.patch? Is that supposed to work with the upstream ath11k or only the qsdk ath11k?

robimarko · February 11, 2022, 1:50pm

No idea, I was just referring to SMP affinity setting

bitthief · February 11, 2022, 2:28pm

I can confirm that this helps a lot on my setup also, especially for the WireGuard VPN tunnels. I also use irqbalance and haveged though.

Oh yeah, another thing that helps is changing the CPU governor to performance:

# CPU governor
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

It almost doubles the speeds, I can easily reach 250-400mbps over WireGuard.

robimarko · February 11, 2022, 2:30pm

Hmm, currently we are using on-demand by default, I was already looking at using schedutil instead cause it looks to be the replacement for on-demand and should be better.

BTW, has anybody seen commits on wlan-open that look like enabling PCI and AHB at the same time?

bitthief · February 11, 2022, 2:37pm

Yes, I've seen that ondemand is used by default. I investigated it a while ago, and realised that the CPU cores were mainly "stuck" in lower frequencies, which clearly impacted VPN speeds (and maybe NAT-ing and other functionality etc.). As soon as I switched to performance, I noticed immediate improvements. Combining it with the SMP affinities, it bumped up the max speeds by about 1.5-2x.

From my experience with governors, especially on x86 CPUs, you either use performance, or the newer schedutil, for best results. Ondemand is the worst of them, by far. performance is the king though.

robimarko · February 11, 2022, 2:39pm

Well, performance is just a codename for keeping the CPU at max OPP point, nothing else.
It doesn't provide any scaling at all, schedutil should be better then ondemand.

bitthief · February 11, 2022, 2:47pm

Yeah, of course, it just pins the cores to max frequency. It's a cheap device, if it burns down in 2 years because of their crappy thermals, c'est la vie.

I think that for people who have many devices etc. connected, or if you do NAT-ing, VPN, lots of firewalling and stuff, it's the best option.

If we had the NSS offload fully working, and especially the NSS crypto acceleration (and if Qualcomm weren't the crappy company they are and also provided offloads for WireGuard's algorithms in nss-clients), you could get away with schedutil in most situations. But that's not the case for us.

robimarko · February 11, 2022, 2:49pm

Well, then you will make sure that tsens and passive cooling works then as intended.
Cause its gonna get toasty for sure.

I switched the config to schedutil, meant to do that a while ago but forgot

bitthief · February 11, 2022, 2:54pm

Haha, I've been running it like this for like a month now, I don't feel the case heating up, so it's fine Worst case scenario, might setup a crontab to switch to schedutil during the night.

I had a Linksys years ago which was almost burning down by default, and it's still going after years. The kirkwood/viper one

Ah, awesome, should definitely help.

robimarko · February 11, 2022, 2:56pm

It would be great if you can log the temperatures, I am really interested to see them

bitthief · February 11, 2022, 2:58pm

Sure, can do, I'm curious also. I can write a script in a sec.

What's the easiest place to get the values from? Do you know the /sys nodes by any chance?

robimarko · February 11, 2022, 3:02pm

/sys/class/thermal/thermal_zoneX/temp

X is 0 to 11

kirdes · February 11, 2022, 3:03pm

I only found this regarding ath11k_ahb and pci

https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/commit/?h=NHSS.QSDK.12.1.r2&id=e367130ca1dae2ff43bab1273aa266320f6f6288

bitthief · February 11, 2022, 3:23pm

Sweet, thanks!

So it seems to hover around 55C degrees:

55100 55700 55400 54400 54400 54700 55400 54700 55400 55400 55400 54100

I don't think this should be a concern, right? What would you say are optimal temps for this SoC?

I'll make a script to log them over time and keep you posted.

bitthief · February 11, 2022, 3:32pm

My advice would be to just go through the ath11k code by hand, grep and extract a list of all the buffers being declared/initialized, as well as the malloc() and free() calls. You can't have THAT many.
Go through each relevant part of the code, find any #defines or static variables which might be related (I'd expect that specific traffic threshold to be a static value, if your assumptions are correct).
And simply start playing around with the values until you start seeing different behaviours. I'd probably multiply or divide each relevant value by 2, one by one, as a starting point.

I was never able to trigger this bug since I always have stuff connected on each radio, but if it would've affected me, that's what I think is the best way to go about it.

bitthief · February 11, 2022, 3:36pm

Even finding the relevant functions/syscalls/entrypoints which trigger this buffer cleanup process would be more than enough to implement a temporary stop-gap solution, you can just expose that part of the code as a /sys or /proc node through a function or syscall or whatever, which triggers that free() sequence.

Something like:
echo 3 > /proc/sys/vm/drop_caches

Instead of focusing on iperf3 ugly hacks and stuff, I think a collaborative effort by those affected would be the best approach to finally solve this problem.

RobertP · February 11, 2022, 3:43pm

This might be worth trying, however chances are probably slim.
What do you reckon @robimarko ?