Issues with latency spikes when using SQM/Cake (slooow ramp-up during speedtests without SQM to blame?)

Sorry, but as I was trying to put my thoughts into somewhat coherent sentences, I noticed my English skills are just not sufficient for that task. This is why I used some translation software. I hope this is okay!

* Please click here for the automated translation... *

I don't even know where to start... :disappointed_face:

The situation is that I moved into a shared apartment with a friend and we simply took our existing networks with us. He uses an AVM Fritz!Box 7490 and I use some x86 China box from AliExpress, on which I have installed OPNsense. To somehow connect the networks to the Internet, I bought a NanoPi R6S from Amazon and slapped OpenWrt 24.10.1 on it (following this installation guide and this guide for performance optimization (option #1). After I deactivated NAT in both the Fritz!Box and the OPNsense router and configured the static routes in OpenWrt, everything worked fine (Well, except for the problems with DS-lite, but even that works fine now. Big THANK YOU to @moeller0 and @pavelgl !).

However, we quickly realized that as soon as one of us starts a download (or even just streams videos), the other can no longer use the internet. You can completely forget about playing Counter-Strike or Call of Duty (yes, I know, typical first-world problem...).

After a brief search, I came across the term “bufferbloat”, which seems to describe the situation pretty well. So I did some research on how to counteract this. I quickly came across this article on the OpenWrt wiki. Although I'm pretty sure I followed everything exactly as described at first, we still had problems.

I don't know how many hours I've spent changing a setting and then using FLENT to check whether the change had a positive or negative effect. All for naught.

The problem: As soon as the bandwidth is fully utilized, there is a sharp increase (10-25x) in latency. Although the ping drops almost as quickly as it has shot up, it is enough to cause extreme stuttering in the games and there are persistent, noticeable delays as long as the Internet connection is so busy. Although the ping is fine most of the time, warnings such as “extrapolation” are then displayed (and opponents no longer topple over so easily :sweat_smile:). You can really see this in all speed tests. Completely independent of the server used, only the initial latency is of course slightly higher with more distant servers.

I have noticed that (without SQM) the available bandwidth is not utilized initially during these speed tests, especially in the receiving direction. It first jumps to around 30-40 Mbps and then veeery slowly climbs up to the maximum possible 150-160 Mbps. I can also observe this in the direction of transmission, but it doesn't take quite as long for the bandwidth to be fully utilized. Strangely enough, this only happens when I don't run one speed test after the other. In other words, only when the Internet connection “rests” a little between speed tests.

As a layman, I suspect a connection... maybe CAKE can't cope with this slow ramp-up in bandwidth?

As I wrote, I initially adopted the settings from the guide in the OpenWrt Wiki (for VDSL2 with PPPoE and VLAN tagging), but have now (after countless FLENT tests) ended up with the following settings:

(* Please scroll down... *)

Ich weiß gar nicht wo ich anfangen soll... :disappointed_face:

Die Situation ist, dass ich mit einem Freund in eine Wohngemeinschaft gezogen bin und wir unsere vorhandenen Netzwerke einfach mitgenommen haben. Er benutzt eine AVM Fritz!Box 7490 und ich irgendeine x86-Chinakiste von AliExpress, auf die ich OPNsense installiert habe. Um die Netzwerke irgendwie mit dem Internet zu verbinden, habe ich kurzerhand einen NanoPi R6S bei Amazon gekauft und OpenWrt 24.10.1 draufgeklatscht (dabei bin ich dieser Anleitung für die Installation und dieser Anleitung für Leistungs-Optimierung (Option #1) gefolgt). Nachdem ich sowohl in der Fritz!Box als auch im OPNsense router NAT deaktiviert habe und die statischen Routen in OpenWrt eingetragen habe, hat erst einmal alles funktioniert (Naja, bis auf die Probleme mit DS-lite, aber selbst das klappt nun alles prima. Großes DANKESCHÖN an @moeller0 und @pavelgl !).

Nur haben wir schnell festgestellt, dass sobald einer von uns einen Download startet (oder auch nur Videos streamt) der jeweils andere das Internet faktisch nicht mehr benutzen kann. Vor allem währenddessen Counter-Strike oder Call of Duty zu spielen (ja, ich weiß, typisches First-World-Problem...) kann man komplett vergessen.

Nach kurzer Recherche bin ich über den Begriff "Bufferbloat" gestolpert, der die Situation ziemlich gut zu beschreiben scheint. Also habe ich mich informiert, wie man dem entgegenwirken kann. Schnell bin ich im OpenWrt Wiki auf den entspechenden Eintrag gestoßen. Obwohl ich mir ziemlich sicher bin, anfangs alles wirklich 1-zu-1 befolgt zu haben, hatten wir immer noch Probleme.

Ich weiß nicht wieviele Stunden ich bis jetzt damit verbracht habe, eine Einstellung zu ändern und anschließend mit FLENT zu überprüfen, ob die Änderung nun eher positive oder negative Auswirkungen hatte. Genützt hat es nichts.

Das Problem: Sobald die Bandbreite ausgelastet wird, kommt es zu einem krassen Anstieg (10-25x) der Latenz. Zwar sinkt der Ping fast so schnell wie er in die Höhe geschossen ist, aber es reicht für einen extremen Ruckler in den Spielen und es kommt zu anhaltenden, spürbaren Verzögerungen solange die Internetverbindung derartig ausgelastet ist. Obwohl der Ping die meiste Zeit in Ordnung ist, werden dann so Warnungen wie "Extrapolation" angezeigt (und Gegner kippen nicht mehr so leicht um :sweat_smile:). Das kann man bei wirklich allen Speedtests beobachten. Völlig unabhängig des verwendeten Servers, nur die Ausgangslatenz ist bei entfernteren Servern natürlich etwas höher.

Mir ist aufgefallen, dass (ohne SQM) bei diesen Speedtests die verfügbare Bandbreite vor allem in Empfangsrichtung zunächst überhaupt nicht ausgereizt wird. Sie springt erst auf etwa 30-40 Mbps, um dann gaaanz langsam bis auf die maximal möglichen 150-160 Mbps zu klettern. In Senderichtung kann ich das auch beobachten, allerdings dauert es dort nicht ganz so lange, bis die Bandbreite voll ausgereizt wird. Komischerweise passiert das nur wenn ich nicht einen Speedtest nach dem anderen mache. Also nur, wenn zwischen den Speedtests die Internetverbindung etwas "ruht".

Als Laie vermute ich da einen Zusammenhang... vielleicht kommt CAKE mit diesem langsamen Anstieg der Bandbreite nicht klar?

Wie geschrieben habe ich zunächst die Einstellungen des Guides aus dem OpenWrt Wiki (für VDSL2 mit PPPoE und VLAN tagging) übernommen, bin aber inzwischen (nach unzähligen FLENT Tests) bei den folgenden Einstellungen gelandet:

Basic Settings:
^- Enable this SQM instance: [x]
^- Interface name: pppoe-wan (Ist das überhaupt korrekt? Oder hätte ich hier "eth1" oder "eth1.7" wählen sollen?)
^- Download speed (ingress): 135000 (Speedtest ohne SQM ~150-160 Mbps)
^- Upload speed (egress): 35000 (Speedtest ohne SQM ~40-45 Mbps)
^- Enable debug logging: [ ]
^- Log verbosity: warning

Queue Discipline:
^- Queueing discipline: cake
^- Queue setup script: piece_of_cake.qos
^- Advanced Configuration: [x]
^- Squash DSCP (ingress): SQUASH
^- Ignore DSCP (ingress): Allow
^- ECN (ingress): ECN (default)
^- ECN (egress): NOECN (default)
^- Dangerous Configuration: [x]
^- Hard queue limit (ingress): 
^- Hard queue limit (egress): 
^- Latency target (ingress): 
^- Latency target (egress): 
^- Qdisc options (ingress): ingress pppoe-ptm ether-vlan nat
^- Qdisc options (egress): besteffort pppoe-ptm ether-vlan nat

Link Layer Adaption:
^- Link layer: none (default)

Ein paar Anmerkungen dazu:

  1. Laut den Speedtests (wie z.B. dem von OOKLA, aber auch vom Anbieter, Vodafone) liegt die reale Bandbreite wohl so zwischen 150-160 Mbps in Empfangs- und 40-45 Mbps in Senderichtung.
  2. Laut unserem Draytek 167 Modem liegt die actual rate bei 177277 kbps und 46655 kbps.
  3. Ich bin mir mega unsicher, ob ich pppoe-wan, eth1 oder eth1.7 als Interface benutzen muss, so eindeutig finde ich die Information dazu im Wiki nicht, was natürlich an meinen mangelnden Englischkentnissen liegen kann. Wegen diesem Thread hier habe ich pppoe-wan eingestellt.
  4. Ich benutze in Senderichtung besteffort, weil beim standardmäßigen diffserv3 (und auch bei diffserv4) Traffic laut tc -s qdisc show grundsätzlich in den falschen Tins landet.
  5. Wahrscheinlich macht es wenig Sinn diffserv3 als ingress Option zu verwenden, wenn ich gleichzeitig Squash DSCP (ingress) auf SQUASH stehen habe, oder? Trotzdem scheint hier dieser verdammte Ping-Spike weniger häufig aufzutreten. Kann aber Zufall sein.
  6. Ich hatte Link layer zuvor, wie im Wiki vorgeschlagen, auf Ethernet with overhead: select for e.g. VDSL2. stehen und Per Packet Overhead (bytes) in 2er Schritten von 22 bis 46, sowie Minimum packet size von 68 bis 86 gestellt (natürlich ohne pppoe-ptm und ether-vlan) und das Resultat immer wieder mit FLENT getestet. Das beste Ergebnis hatte ich mit einem Overhead von 34, weshalb ich nun stattdessen die beiden Optionen verwende, die dasselbe bezwecken.

Jedenfalls zeigt mit diesen Einstellungen tc qdisc show nun:

qdisc cake 805d: dev pppoe-wan root refcnt 2 bandwidth 35Mbit besteffort triple-isolate nat nowash no-ack-filter split-gso rtt 100ms ptm overhead 34
qdisc cake 805e: dev ifb4pppoe-wan root refcnt 2 bandwidth 135Mbit besteffort triple-isolate nat wash ingress no-ack-filter split-gso rtt 100ms ptm overhead 34

Außerdem habe ich die folgenden Änderungen in /etc/sysctl.conf vorgenommen, die die Situation anscheinend etwas verbessert haben, aber auch hier will ich nicht ausschließen, dass die FLENT Ergebnisse nur zufällig besser wurden:

# disable packet steering
net.core.rps_sock_flow_entries = 0

# increase per-CPU receive queue length
net.core.netdev_max_backlog = 2000

# increase packet processing budget per interrupt
net.core.netdev_budget = 1000

# maximum socket receive buffer size
net.core.rmem_max = 1048576

# default socket receive buffer size
net.core.rmem_default = 262144

# disable timer migration
kernel.timer_migration = 0

Und in /etc/rc.local:

# set scheduler to performance
find /sys/devices/system/cpu/cpufreq -name scaling_governor -exec awk '{print "performance" > "{}" }' {} \;

# disable NIC offload features on eth1
ethtool -K eth1 gso off tso off gro off rx-gro-list off rx off tx off rx-vlan-offload off tx-vlan-offload off

# disable energy efficient ethernet
ethtool --set-eee eth0 eee off
ethtool --set-eee eth1 eee off
ethtool --set-eee eth2 eee off

Damit wird bei FLENT (immer direkt auf der OpenWrt Maschine ausgeführt) aus:


Nun in den allermeisten Fällen das hier:


Aber leider auch immer wieder:


Danke schon einmal fürs Lesen!

I will reply in english, for the sake of the forum, if you have questions, I am happy to answer specifics in German. The first step is helping me getting a better feel for your situation.

So, the Nanopi R6S is your primary router then, that does the PPPoE/ds-lite tunnel to the ISP and SQM?
In that case I would like to see the following pieces of information taken from a terminal/shell in the R6S:

ifstatus wan | grep -e device
cat /etc/config/sqm
tc -s qdisc

I would also like to see a screenshot of a capacity test from here:

(Achtung/Attention: this contains a small map and your current IP address, if you consider these sensitive just black them out in the screen shot before posting. the geoip stuff often is wrong so not necessarily requiring redaction, but IP addresses might be considered sensitive)

One question, why do you still run the Fritzbox and Opensense machines, why not use the R6S as primary router handling everything?

2 Likes

Thank you so much for your reply! I will try to respond in English, now, too, as I'm also feeling very uncomfortable using any other language but English in a primarily English speaking forum.

Yes, this is correct. The topology looks like this:

FTTC → APL → DrayTek Vigor 167 → OpenWrt → OPNsense → Zyxel XMG-105 → my PC
                                      ||                          ↳ my TV
                                      |↳ AVM Fritz!Box → roommate's PC
                                      |              ↳ roommate's printer/scanner
                                      ↳ roommate's TV

(The cable between the APL and the modem has been replaced by a DTAG technician with a Cat. 7 ethernet cable just the other day. All the cables within the LAN are Cat. 7 ethernet cables, too. I didn't bother to mention the WiFi devices.)

Here's the output of the commands you asked me for:

root@openwrt ~# ifstatus wan | grep -e device
"l3_device": "pppoe-wan",
"device": "eth1.7",
root@openwrt ~# cat /etc/config/sqm

config queue 'eth1'
option enabled '1'
option interface 'pppoe-wan'
option download '135000'
option upload '35000'
option qdisc 'cake'
option script 'piece_of_cake.qos'
option linklayer 'none'
option debug_logging '0'
option verbosity '2'
option qdisc_advanced '1'
option squash_dscp '1'
option squash_ingress '0'
option ingress_ecn 'ECN'
option egress_ecn 'NOECN'
option qdisc_really_really_advanced '1'
option iqdisc_opts 'ingress pppoe-ptm ether-vlan nat'
option eqdisc_opts 'besteffort pppoe-ptm ether-vlan nat'
root@openwrt ~# tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc mq 0: dev eth0 root
Sent 1272911864 bytes 1004400 pkt (dropped 4, overlimits 0 requeues 1463)
backlog 0b 0p requeues 1463
qdisc fq_codel 0: dev eth0 parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 1202997304 bytes 850341 pkt (dropped 0, overlimits 0 requeues 845)
backlog 0b 0p requeues 845
maxpacket 1506 drop_overlimit 0 new_flow_count 462 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 69914560 bytes 154059 pkt (dropped 4, overlimits 0 requeues 618)
backlog 0b 0p requeues 618
maxpacket 1506 drop_overlimit 0 new_flow_count 483 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc mq 0: dev eth1 root
Sent 32702461624 bytes 76491166 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
qdisc fq_codel 0: dev eth1 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 32702461624 bytes 76491166 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 1514 drop_overlimit 0 new_flow_count 10 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc mq 0: dev eth2 root
Sent 139839104787 bytes 107321592 pkt (dropped 0, overlimits 0 requeues 4)
backlog 0b 0p requeues 4
qdisc fq_codel 0: dev eth2 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 139839104787 bytes 107321592 pkt (dropped 0, overlimits 0 requeues 4)
backlog 0b 0p requeues 4
maxpacket 4542 drop_overlimit 0 new_flow_count 850 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.7 root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 805d: dev pppoe-wan root refcnt 2 bandwidth 35Mbit besteffort triple-isolate nat nowash no-ack-filter split-gso rtt 100ms ptm overhead 34
Sent 16149424662 bytes 41122501 pkt (dropped 20606, overlimits 34643030 requeues 0)
backlog 0b 0p requeues 0
memory used: 1086208b of 4Mb
capacity estimate: 35Mbit
min/max network layer size:           29 /    1492
min/max overhead-adjusted size:       64 /    1550
average network hdr offset:            0

Tin 0
thresh         35Mbit
target            5ms
interval        100ms
pk_delay         83us
av_delay          5us
sp_delay          1us
backlog            0b
pkts         41143107
bytes     16180021426
way_inds      2016733
way_miss       342165
way_cols          226
drops           20606
marks               1
ack_drop            0
sp_flows            1
bk_flows            1
un_flows            0
max_len         59200
quantum          1068

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ----------------
Sent 83713852519 bytes 65086534 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 805e: dev ifb4pppoe-wan root refcnt 2 bandwidth 135Mbit besteffort triple-isolate nat wash ingress no-ack-filter split-gso rtt 100ms ptm overhead 34
Sent 81173861854 bytes 63379625 pkt (dropped 1706909, overlimits 103240368 requeues 0)
backlog 0b 0p requeues 0
memory used: 6752128b of 6750000b
capacity estimate: 135Mbit
min/max network layer size:           28 /    1492
min/max overhead-adjusted size:       63 /    1550
average network hdr offset:            0

Tin 0
thresh        135Mbit
target            5ms
interval        100ms
pk_delay       1.07ms
av_delay        507us
sp_delay          2us
backlog            0b
pkts         65086534
bytes     83713852519
way_inds      3781114
way_miss       404940
way_cols           49
drops         1706909
marks               1
ack_drop            0
sp_flows            5
bk_flows            1
un_flows            0
max_len          1492
quantum          1514

That's a pretty cool speedtest I wasn't even aware of before. Thanks for pointing out it shows the (surprisingly accurate) geo-location.

With SQM still enabled:

With SQM temporarily disabled:

Yes, I'm aware this unnecessarily adds complexity, because we could just set up some firewall rule on OpenWrt to make it so we can't access each other's devices. There's also the fact my roommate is using his Fritz!Box to its fullest extent, ie. NAS (mainly for saving scans from his printer/scanner), DECT base station (including call rules and telephone book with lots of entries) and to send/receive faxes. I'm using many of the features of OPNsense, too, that I didn't want to bother to figure out how to do on OpenWrt (I really like OpenWrt and think it's an amazing project, but I still prefer the intuitive WebUI of OPNsense over OpenWrt's LuCI). When we first moved in together, we had a lot of stuff on our minds (with deaths in our families and severe health issues) and this seemed to be the simplest solution to a problem we didn't want to spend much time on. And to be honest, except for that weird issue with DS-lite, it was surprisingly easy to get OpenWrt working exactly as we wanted, once I figured out how to disable NAT on the Fritz!Box to prevent double-NAT. Maybe at a later date we'll get rid of the redundant routers. In the end, I don't think this should've any effect on the issue discussed in this thread tho, given that I did run FLENT and speedtests (using speedtest-go) on the OpenWrt router itself.

PS: I don't have any traffic shaping / schedulers enabled on OPNsense either, but I'm unaware if the Fritz!Box does something that could mess with the traffic, because the WebUI of that device is anything but transparent in that regard.

1 Like

So I am on the record as not being a fan of keywords for overhead, so I would recommend to change that to something like:

config queue 'eth1'
	option ingress_ecn 'ECN'
	option egress_ecn 'ECN'
	option itarget 'auto'
	option etarget 'auto'
	option verbosity '5'
	option qdisc 'cake'
	option qdisc_advanced '1'
	option squash_dscp '1'
	option squash_ingress '0'
	option qdisc_really_really_advanced '1'
	option linklayer 'ethernet'
	option linklayer_advanced '1'
	option tcMTU '2047'
	option tcTSIZE '128'
	option linklayer_adaptation_mechanism 'default'
	option debug_logging '1'
	option tcMPU '88'
	option enabled '1'
	option interface 'pppoe-wan'
	option upload '35000'
	option script 'layer_cake.qos'
	option overhead '34'
	option iqdisc_opts 'dual-dsthost nat ingress memlimit 32mb'
	option eqdisc_opts 'dual-srchost nat memlimit 32mb'
	option download '135000'

and about that overhead a bit more research might be in order, yours is the first confirmed ds-lite link I ever saw data from, so we might need a few iteration to get this right.
What would help is to see the output of ifconfig but just redact the MAC addresses in a way that different MACs have different redacted versions, but that the actual MACS can not be recovered, this IMHO is especially relevant for WiFi interfaces (but assuming the R6S actually has WiFi, just remove these completely, what I really want to see is the records for pppoe-wan, eth1, and eth1.7).

This is a lot of hash collisions on upload, I wonder whether the ds-lite tunnel screws up the flow dissector somewhat? cake needs to poke deep into the packets to find the L4 headers to do its flow hashing but if it does not find these headers all flows will has into the same bin which would result both in poor performance under load and loads of collisions... Mind you that is just a hypothesis/speculation... but at least something to guide further research/testing on...

Regarding the capacity test, did these use IPv6 or IPv4 (what type of address was reported at the bottom of the map...)?

Never say never, but I agree I was just asking out of curiosity.

Oh FritzOS has its own traffic shapers, even for ingress (where I think they actually use cake) but if these should be engaged this should mostly/only affect the traffic going through the FB, not your traffic through OPNsense.
No, what we see is an issue with the SQM instance on the R6S. I guess CPU load might be an issue as the R6S IIRC has the small A55 cores on the lower CPU addresses and the beefier a76 on the higher addresses, while Linux tends to assign CPUs from the bottom...

So that might also be an issue, to look at.

That is not an ideal test for the routing capabilities of the R6S as that puts on additional load on the router that would not happen during normal operations... But let me ask, what did you test with flent, some machine on the internet (if yes, which) or just from behind the OPNsense to the R6S?

1 Like

But those A55 cores should have no problems with 135 Mb/s, if OP hasn't tried it yet changing the Realtek 8169 driver, to the 8125-rss one to spread the load onto more cores, is a possibility.
Other than that i agree that speed tests should be performed on end devices, Waveform bufferbloat test works well for that.

1 Like

Thank you! I have replaced my /etc/config/sqm with what you provided me with.

Yeah, about DS-lite... I actually called our ISP, Vodafone, about our issues with the connection about two weeks ago. During that telephone call I passingly mentioned that I had issues with DS-lite and the support immediately said something along the lines of "yeah, we actually have many customers complaining about that, I'll just go ahead and disable that for you" (I'm paraphrasing). This is also why there was a DTAG technican the other day who replaced the old cable between the APL and the TAE. But this really did nothing to improve the situation as far as I can tell. Anyway, I guess we're using Dualstack now:

root@openwrt ~# ifconfig
br-lan    Link encap:Ethernet  HWaddr [MAX_ADDRESS_A]
inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
inet6 addr: [LINK_LOCAL_IPV6_ADDRESS_A]/64 Scope:Link
inet6 addr: [GLOBAL_UNICAST_IPV6_ADDRESS_A]/60 Scope:Global
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:923932 errors:0 dropped:1482 overruns:0 frame:0
TX packets:3220286 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:371681983 (354.4 MiB)  TX bytes:4414886769 (4.1 GiB)

eth0      Link encap:Ethernet  HWaddr [MAX_ADDRESS_A]
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4179 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B)  TX bytes:500347 (488.6 KiB)
Interrupt:58

eth1      Link encap:Ethernet  HWaddr [MAX_ADDRESS_B]
inet addr:192.168.0.2  Bcast:192.168.0.255  Mask:255.255.255.0
inet6 addr: [LINK_LOCAL_IPV6_ADDRESS_B]/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:3280378 errors:0 dropped:0 overruns:0 frame:0
TX packets:1080698 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4459237276 (4.1 GiB)  TX bytes:425914187 (406.1 MiB)
Interrupt:72

eth1.7    Link encap:Ethernet  HWaddr [MAX_ADDRESS_B]
inet6 addr: [LINK_LOCAL_IPV6_ADDRESS_B]/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:3279993 errors:0 dropped:0 overruns:0 frame:0
TX packets:1080691 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4446061864 (4.1 GiB)  TX bytes:421590597 (402.0 MiB)

eth2      Link encap:Ethernet  HWaddr [MAX_ADDRESS_C]
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:1018586 errors:0 dropped:0 overruns:0 frame:0
TX packets:3220345 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:377580603 (360.0 MiB)  TX bytes:4414890431 (4.1 GiB)
Interrupt:84

ifb4pppoe-wan Link encap:Ethernet  HWaddr [MAX_ADDRESS_D]
inet6 addr: [LINK_LOCAL_IPV6_ADDRESS_C]/64 Scope:Link
UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
RX packets:127324 errors:0 dropped:0 overruns:0 frame:0
TX packets:127324 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:32
RX bytes:166070218 (158.3 MiB)  TX bytes:166070218 (158.3 MiB)

lo        Link encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:65536  Metric:1
RX packets:456 errors:0 dropped:0 overruns:0 frame:0
TX packets:456 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:47445 (46.3 KiB)  TX bytes:47445 (46.3 KiB)

pppoe-wan Link encap:Point-to-Point Protocol
inet addr:[PUBLIC_IPV4]  P-t-P:84.59.211.1  Mask:255.255.255.255
inet6 addr: [LINK_LOCAL_IPV6_ADDRESS_D]/128 Scope:Link
UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1492  Metric:1
RX packets:3279635 errors:0 dropped:0 overruns:0 frame:0
TX packets:1080322 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3
RX bytes:4419809257 (4.1 GiB)  TX bytes:397811015 (379.3 MiB)

I have restarted the OpenWrt router since my last reply in this thread. Also, please let me know if I accidentally redacted information you needed or failed to redact "compromising" information.)

Again, I'm very sorry for not clarifying earlier that we no longer use DS-lite.

Both tests used IPv6, but the second test cloudflare for some reason picked a different server that's physically closer to us.

Thanks for letting me know. I had a suspected that to be the case, but as you said, I don't think this could possibly affect speedtests executed either on any of my devices or on the OpenWrt router itself.

This is correct. Although I had hoped that following that CPU optimization guide (Option #1 specifically) would've solved this issue.

:person_facepalming: I hadn't even considered that!

Since I couldn't get netperf/netserver on my dedicated server (OVH) to play nicely with FLENT running on the OpenWrt router (I suspect the remote firewall being the culprit here, even with nftable rules added that should've allowed netperf traffic to pass), I just used netperf-eu.bufferbloat.net every time (see also the bottom text on the FLENT screenshots).

Thank you so much, too, for your reply! And sorry that I failed to mention that I'm using kmod-r8125-rss - 6.6.86.9.015.00-r1 already!

Could you post a current version of tc -s qdisc after the last reboot again, please?

1 Like

Sure. After the reboot and with your SQM settings applied, it now looks like this:

root@openwrt ~# tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc mq 0: dev eth0 root
Sent 842451 bytes 6953 pkt (dropped 1, overlimits 0 requeues 12)
backlog 0b 0p requeues 12
qdisc fq_codel 0: dev eth0 parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 436004 bytes 4788 pkt (dropped 0, overlimits 0 requeues 5)
backlog 0b 0p requeues 5
maxpacket 561 drop_overlimit 0 new_flow_count 3 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 406447 bytes 2165 pkt (dropped 1, overlimits 0 requeues 7)
backlog 0b 0p requeues 7
maxpacket 561 drop_overlimit 0 new_flow_count 15 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc mq 0: dev eth1 root
Sent 1243621677 bytes 2935261 pkt (dropped 0, overlimits 0 requeues 9)
backlog 0b 0p requeues 9
qdisc fq_codel 0: dev eth1 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 1243621677 bytes 2935261 pkt (dropped 0, overlimits 0 requeues 9)
backlog 0b 0p requeues 9
maxpacket 1462 drop_overlimit 0 new_flow_count 7 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc mq 0: dev eth2 root
Sent 7681346421 bytes 5743677 pkt (dropped 0, overlimits 0 requeues 1)
backlog 0b 0p requeues 1
qdisc fq_codel 0: dev eth2 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
Sent 7681346421 bytes 5743677 pkt (dropped 0, overlimits 0 requeues 1)
backlog 0b 0p requeues 1
maxpacket 1506 drop_overlimit 0 new_flow_count 44 ecn_mark 0
new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.7 root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 8005: dev pppoe-wan root refcnt 2 bandwidth 35Mbit diffserv3 dual-srchost nat nowash no-ack-filter split-gso rtt 100ms noatm overhead 34 mpu 88 memlimit 32Mb
Sent 787611871 bytes 1900518 pkt (dropped 13831, overlimits 1443784 requeues 0)
backlog 0b 0p requeues 0
memory used: 396Kb of 32Mb
capacity estimate: 35Mbit
min/max network layer size:           29 /    1492
min/max overhead-adjusted size:       88 /    1526
average network hdr offset:            0

Bulk  Best Effort        Voice
thresh       2187Kbit       35Mbit     8750Kbit
target         8.33ms          5ms          5ms
interval        103ms        100ms        100ms
pk_delay          0us       3.21ms        163us
av_delay          0us       1.76ms         28us
sp_delay          0us          2us          3us
backlog            0b           0b           0b
pkts                0      1892683        21666
bytes               0    795651336     12291241
way_inds            0        53717            0
way_miss            0        19031           47
way_cols            0          123            0
drops               0        13829            2
marks               0            0            0
ack_drop            0            0            0
sp_flows            0            0            1
bk_flows            0            1            0
un_flows            0            0            0
max_len             0        38720         1280
quantum           300         1068          300

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ----------------
Sent 4154046817 bytes 3253628 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 8006: dev ifb4pppoe-wan root refcnt 2 bandwidth 135Mbit diffserv3 dual-dsthost nat wash ingress no-ack-filter split-gso rtt 100ms noatm overhead 34 mpu 88 memlimit 32Mb
Sent 4050530021 bytes 3181890 pkt (dropped 71738, overlimits 5054428 requeues 0)
backlog 0b 0p requeues 0
memory used: 1900800b of 32Mb
capacity estimate: 135Mbit
min/max network layer size:           32 /    1492
min/max overhead-adjusted size:       88 /    1526
average network hdr offset:            0

Bulk  Best Effort        Voice
thresh       8437Kbit      135Mbit    33750Kbit
target            5ms          5ms          5ms
interval        100ms        100ms        100ms
pk_delay          0us        114us        264us
av_delay          0us          8us         57us
sp_delay          0us          3us          5us
backlog            0b           0b           0b
pkts                0      3248480         5148
bytes               0   4153045209      1001608
way_inds            0       280480           33
way_miss            0        19068         2587
way_cols            0            1            0
drops               0        71738            0
marks               0            0            0
ack_drop            0            0            0
sp_flows            0            0            4
bk_flows            0            1            0
un_flows            0            0            0
max_len             0         1492          456
quantum           300         1514         1029

And the cloudflare speedtest like this:

I just realized I completely ignored this part of the question. No, to be clear: I've always ran FLENT on the OpenWrt to netperf-eu.bufferbloat.net. I then copied the resulting .gz files using scp to my desktop PC to open them in the FLENT GUI and make screenshots of the graphs.

1 Like

Could you post the output of ifstatus wan and ifstatus wan6 please/
ACHTUNG these are mildly sensitive, but typically are different after a PPOE reconnect, so capture that data but do not post it yet, initiate a PPPOE reconnection, check that the new values for the external addresses and prefixes did change, than post the old obsolete data. Or, if you trust me post it via PN :slight_smile: (mind you there is little reason to trust me and almost no way to verify my trustworthyness :slight_smile: )

Just to reiterate: Without SQM the slow ramp-up of the download bandwidth is present on my desktop PC, on my roommate's desktop PC, on our smartphones and the OpenWrt router itself.

Even with your SQM settings, there's still the occasional speedtest that shows a massive increase in latency, but this mostly happens only when the connection has been idle for a couple minutes. No idea if this is just a coincidence or actually a clue. Of course this issue is especially elusive now that I'm actively trying to catch it ("Vorführeffekt" in full force).

About five minutes ago, I've also double-checked the perceived delay in online games (Call of Duty Black Ops 6 specifically, it even showed the "Extrapolation" warning again), once my roommate started watching some videos on YouTube.

I shall do as you tell me:

root@openwrt ~# ifstatus wan
{
    "up": true,
    "pending": false,
    "available": true,
    "autostart": true,
    "dynamic": false,
    "uptime": 7632,
    "l3_device": "pppoe-wan",
    "proto": "pppoe",
    "device": "eth1.7",
    "updated": [
        "addresses",
        "routes"
    ],
    "metric": 0,
    "dns_metric": 0,
    "delegation": true,
    "ipv4-address": [
        {
            "address": "2.206.20.83",
            "mask": 32,
            "ptpaddress": "84.59.211.1"
        }
    ],
    "ipv6-address": [
        {
            "address": "fe80::688a:86d0:a022:c6ad",
            "mask": 128
        }
    ],
    "ipv6-prefix": [

    ],
    "ipv6-prefix-assignment": [

    ],
    "route": [
        {
            "target": "0.0.0.0",
            "mask": 0,
            "nexthop": "84.59.211.1",
            "source": "0.0.0.0/0"
        }
    ],
    "dns-server": [
        "176.95.16.250",
        "176.95.16.251"
    ],
    "dns-search": [

    ],
    "neighbors": [

    ],
    "inactive": {
        "ipv4-address": [

        ],
        "ipv6-address": [

        ],
        "route": [

        ],
        "dns-server": [

        ],
        "dns-search": [

        ],
        "neighbors": [

        ]
    },
    "data": {

    }
}

root@openwrt ~# ifstatus wan6
{
    "up": true,
    "pending": false,
    "available": true,
    "autostart": true,
    "dynamic": false,
    "uptime": 7804,
    "l3_device": "pppoe-wan",
    "proto": "dhcpv6",
    "device": "pppoe-wan",
    "updated": [
        "prefixes"
    ],
    "metric": 0,
    "dns_metric": 0,
    "delegation": true,
    "ipv4-address": [

    ],
    "ipv6-address": [

    ],
    "ipv6-prefix": [
        {
            "address": "2a00:1e:9ac0:8000::",
            "mask": 56,
            "preferred": 3005,
            "valid": 6605,
            "class": "wan6",
            "assigned": {
                "lan": {
                    "address": "2a00:1e:9ac0:8000::",
                    "mask": 60
                }
            }
        }
    ],
    "ipv6-prefix-assignment": [

    ],
    "route": [
        {
            "target": "::",
            "mask": 0,
            "nexthop": "fe80::12e8:78ff:fed4:e147",
            "metric": 512,
            "valid": 3446,
            "source": "2a00:1e:9ac0:8000::/56"
        }
    ],
    "dns-server": [
        "2a01:860::53",
        "2a01:860::153"
    ],
    "dns-search": [

    ],
    "neighbors": [

    ],
    "inactive": {
        "ipv4-address": [

        ],
        "ipv6-address": [

        ],
        "route": [

        ],
        "dns-server": [

        ],
        "dns-search": [

        ],
        "neighbors": [

        ]
    },
    "data": {
        "passthru": "001700202a0108600000000000000000000000532a010860000000000000000000000153"
    }
}

I've ran FLENT again to demonstrate that dreaded initial latency spike. From the OpenWrt router:


From my desktop PC (not as bad, but it's really random):


This looks indeed more like dual stack without CG-NAT to me than ds-lite or any CG-NAT abomination.

That gets us back to the slow ramp up... sure this can happen somewhere within Vodafone"s network, but I have no clue where and why.

One thing to consider is maybe running cake-autorate if the available capacity fluctuates/ramp up/down reliably?

2 Likes

This is something I had suspected, too. Maybe Vodafone is doing some ISP-side dynamic rate adaptation or something? I'm really not looking forward to calling the support again (and having to answer questions akin to "have you tried turning the modem off and on again?" and "are you using WiFi?" again for what feels like the 100th time) , but I guess I have no choice? But you're fairly confident this is an issue outside of our LAN / not specific to OpenWrt, correct?

I hope running the following was sufficient to make that work at least in theory:

root@openwrt ~# wget -O /tmp/cake-autorate_setup.sh https://raw.githubusercontent.com/lynxthecat/cake-autorate/master/setup.sh^C
root@openwrt ~# sh /tmp/cake-autorate_setup.sh
Detected Operating System: openwrt
Installation directories for detected Operating System:
- Script prefix: /root/cake-autorate
- Config prefix: /root/cake-autorate
Continue with installation? [Y/n] Y
Installing cake-autorate using /root/cake-autorate (script) and /root/cake-autorate (config) directories...

Now edit the config.primary.sh file as described in:
https://github.com/lynxthecat/cake-autorate/tree/master#installation-on-openwrt

3.3.0-PRERELEASE successfully installed, but not yet running

Start the software manually with:
cd /root/cake-autorate; ./cake-autorate.sh
Run as a service with:
service cake-autorate enable; service cake-autorate start

root@openwrt ~# service cake-autorate enable; service cake-autorate start

If so, these are the results.

From the OpenWrt router:



From my desktop machine:



1 Like

Can you try enabling the data, summary and cpu data fields and obtaining a data file encompassing a couple of speed tests and a minute or so of a large file download like say a Linux ISO from a good host? How to do this is explained in the README on GitHub.

1 Like

Of course. I hope this is sufficient.

I’ll have a look tomorrow if I get a chance. In the meantime, you can try @moeller0’s octave plotting tool listed on the GitHub.

1 Like

Thanks!

Unfortunately no matter what I do I can't seem to get it working. For me Octave (v10.2.0) just slows down until it eventually freezes once I try to load the log file.

Thanks to AI, I've vibe-coded a Python translation (without knowing either MATLAB nor Python, nor what the graphs even are supposed to show me). These are the results:







Maybe some of this turns out to be helpful by chance?

1 Like

The top graphs look about right... I can run the octave script fine on octave 9.4, but at least on macos pdf export is still broken...

First observation is the delay thresholds can be set considerably tighter...

1 Like

Yes perhaps try:

# Delta thresholds
dl_avg_owd_delta_max_adjust_up_thr_ms=8.0 # (milliseconds)
ul_avg_owd_delta_max_adjust_up_thr_ms=8.0 # (milliseconds)

dl_owd_delta_delay_thr_ms=10.0 # (milliseconds)
ul_owd_delta_delay_thr_ms=10.0 # (milliseconds)

dl_avg_owd_delta_max_adjust_down_thr_ms=20.0 # (milliseconds)
ul_avg_owd_delta_max_adjust_down_thr_ms=20.0 # (milliseconds)

So edit the config.primary.sh file to include the above lines at the end and restart.

Now open up a terminal and run:

tail -f /var/log/cake-autorate/cake-autorate.primary.log | grep SUMMARY

Now try saturating the connection and observe what happens on the terminal. Try to get a feel for the output lines as this will help you work out whether it's behaving as it should.

Open a separate terminal that pings 1.1.1.1 and verify that the latency remains within a sensible range despite whatever you do with the connection.

If you could re-run with those settings and provide data that'd be ace.

Amazed that you were able to convert @moeller0's code to python and it still works.

2 Likes

I assume this is the most important one for experts such as yourself and @Lynx to detect problems? Can you tell me what's wrong with the other graphs? As I have no idea what they even should look like and they just seemed okay to me, I hadn't asked the AI to explain the logic to me.

I assume this is what @Lynx 's proposed changes to the configuration are for?

Thanks, I shall do that next!

Not only due to my lacking English skills, I cannot even begin to explain to you how much this all is basically Greek to me.

Well, yeah, so am I... given that I, again, have absolutely no idea what I'm doing and don't even know how to program myself (except for maybe some very much non-POSIX-compliant Bash scripts). :sweat_smile: Those LLMs have come a long way.

1 Like

A super crude test is this one here:

https://www.waveform.com/tools/bufferbloat

Try this with and without SQM.

1 Like