24.10 EdgeRouter X, poor SQM performance

My EdgeRouter X performs great with 22.03 and 23.05 fq_codel SQM on a ~100Mbps connection, but after upgrading to 24.10 performance absolutely plummeted. Without a router, I get approximately +30ms latency when downloading and +80ms when uploading. With 22.03 and 23.05, both drop to +0ms. With 24.10 it's closer to +10ms and +130ms, and speeds seem to fluctuate a bit more. I'm not sure where to begin troubleshooting, so any advice is appreciated.

Do you need SQM at all?
Please show results after disabling SQM and rebooting router
https://www.waveform.com/tools/bufferbloat
Then with firewall soft offload enables
Then with firewall hard offload
Then disable firewall offload again.
Now set SQM download/ingress to zero and upload/egress to half of achieved in no offload case
Measure
Splice up towards upload bw, when latency grows step back
If download is still bad repeat with ingress direction.

1 Like

I would manage expectations with an ERX, very slow CPU, but with SQM make sure you have packet steering enabled set to all CPUs under /luci/admin/network/network

If that doesn't resolve it, try to install and enable luci-app-irqbalance. It might help a bit.

Obviously disable HFO since that will bypass QoS, but if those don't resolve it try enabling SFO, it might work in combination with SQM but unlikely to help much. Those are under a dropdown box now on /luci/admin/network/firewall.

For SQM are you sure it's setup right under /luci/admin/network/sqm. Make sure it's enabled and onr the correct interface. Set fq_codel and per packet overhead too.

Yes, I need SQM. As mentioned, without SQM I get +30ms when downloading and +80ms when uploading. Firewall offloading has no noticeable impact. The ingress and egress targets are accurate for my connection and perform great before 24.10, and I don't plan on reducing them when I can just run 23.05 instead.

I knew that going into the purchase, but it manages fq_codel SQM at 100Mbps quite well with 22.03 and 23.05. I think I pushed it to 150ish Mbps when I had a faster connection. I believe I tried tweaking packet steering, but it didn't seem to help. I'll give irqbalance a shot later, since I've managed to get back to 23.05 and the partition migrating process is a bit of a pain. And yeah, all the SQM settings were set according to the wiki and consistent between versions.

Please show the numbers.

1 Like

Is your connection symmetric?

Please show the numbers.

Okay. For the third time. They haven't changed.

Without SQM on any version:
Unloaded latency: 15ms
DL loaded at ~120Mbps: +30ms
UL loaded at ~120Mbps: +80ms

With SQM on 22.03 or 23.05 (fq_codel, simple.qos, targets set to 100Mbps):
Unloaded latency: 15ms
DL loaded at consistent 100Mbps: +0ms
UL loaded at consistent 100Mbps: +0ms

With SQM on 24.10 (exact same SQM settings):
Unloaded latency: 15ms
DL loaded at 80-100Mbps: +10ms
UL loaded at 80-100Mbps: +130ms

If your only advice is to sacrifice even more bandwidth when the hardware is perfectly capable with a previous version of the software, I don't think you are worth listening to.

And @hecatae, yes, symmetric at 100-120Mbps, varying throughout the day.

I asked whether firewall offload improves throughput i.e router is capped by CPU resources without any sqm and please post links.

Now, that is exactly my advice :wink: but hear me out, this is useful as a diagnostic.

I also want to see the output of:

ifstatus wan | grep device
cat /etc/config/sqm
tc -s qdisc
tc -d qdisc

to get an idea how sqm is set up...

EDIT: fixed typo, thanks @hecatae

2 Likes

@moeller0 I think you are missing a t in that output?

1 Like

Now that I can understand.

ifstatus and SQM config on 23.05, but eth0 was definitely still my WAN device on 24.10 and I copied the config file verbatim after updating:

# ifstatus wan | grep device
	"l3_device": "eth0",
	"device": "eth0",

# cat /etc/config/sqm
config queue 'eth1'
	option interface 'eth0'
	option debug_logging '0'
	option verbosity '5'
	option linklayer 'ethernet'
	option qdisc 'fq_codel'
	option script 'simple.qos'
	option overhead '44'
	option download '100000'
	option upload '100000'
	option enabled '1'

Do you want to see tc for 23.05, or just 24.10 for various bandwidth targets?

Theoretical comparing these outputs might be interesting, but since the goal is to get 24.10.0 up, let's start with 24, please.

Quick note with:

	option overhead '44'
	option download '100000'
	option upload '100000'

You can maximally expect the following TCP/IP throughput
IPv4: 100.000 * ((1500-20-20)/(1500+40)) = 94.81 Mbps
IPv6: 100.000 * ((1500-40-20)/(1500+40)) = 93.51 Mbps

So getting 100 in a speedtest ist suspicious.

Could you post a screenshot of a capacity test here please:

Cloudflare gives slightly different numbers than Waveform and there is always a hefty amount of noise in any of these measurements, but fine. I think I'm done being interrogated about these measurements, though. If you don't want to believe me, I'll just run 23.05 and move on.

This is not about belief or disbelief, but about getting well described data to reason from what might have happened. I am not sitting at your router and can get this data myself so we need to work together to try to work around the language barrier as well (english is not my first language). Yet, if you prefer to leave it as it is, I am fine with that as well.

I have provided plenty of accurate data. It obviously isn't sufficient to diagnose the problem, but screenshots of single samples of that data will not clarify anything. Asking for them just means you don't trust the data I already shared. Perhaps if flashing between 23.05 and 24.10 on the ERX was less annoying I would entertain this some more, but since most of the replies here have been more interested in proving my measurements wrong than diagnosing the problem, I am going to leave it as it is.

You provided exactly zero waveform links with latency distributions.

Maybe, maybe not, I am happy to work with you to help trying to figure out what went wrong between 23 and 24, but you also need to work with me.

Indeed, so I expect that I will have to ask for more data in due course, I am/was trying to get an overview of the current situation with 24. Sure you know this, but I am at a loss and hence would need some data, if that rubs you the wrong way, maybe we should stop.

Not really, all it means is that I wanted to see a specific set of speedtest and information.

Mmmh, fair enough, however to understand why 24 is misbehaving without being able to step by step diagnose 24 seems next to impossible, at least I am out of my league with that.

Well, I am sad you see it this way...

...but that is your choice.

However that should have given me pause in retrospect... if someone asking for help starts by drawing up red lines, maybe my spare time is better spend elsewhere.
That comes over pretty rude and entitled...

1 Like

Can you see why this is needlessly frustrating to work with? If you doubt me at the very first step, it doesn't bode well for future steps.

When you wanted additional data, like the tc information, I was happy to provide once I got around to flashing 24.10 again. I definitely didn't expect anyone to have any proposed solutions before I could provide additional data from 24.10. You asking for the same data that I already provided multiple times is what rubs me the wrong way.

I'm sad you see it this way, but I have better things to do than go back and forth with brada4 when they do nothing but demand the same data in a different form that is impossible to provide without dedicating a significant amount of time to their request. They can have their links if I ever bother flashing 24.10 for another reason, but I'm not doing it just for that.