SQM Bufferbloat Issues

RyanBlakeIT · February 13, 2024, 3:09am

I have a 1000/50Mbps DOCSIS 3.1 (coax) connection to my ISP. On older versions of OpenWRT and even with a Raspberry Pi 4B, I was able to keep bufferbloat down to 0ms (or very close) by setting my speeds to 80000/45000 respectively.

However, I started to notice issues with my RPi 4B and thought maybe it was time to switch to an x86 box. Fast forward, I now have an iKOOLCORE R2 and I'm still seeing bad bufferbloat. At first I thought it was because my CPU was running so slowly, so after fixing that issue with intel_psense=passive in grub and setting min values in /sys, I was able to get a stable 2.5Ghz across all cores:

root@ROUTER:~# cat /proc/cpuinfo | grep MHz
cpu MHz         : 2500.000
cpu MHz         : 2500.150
cpu MHz         : 2500.000
cpu MHz         : 2498.211
cpu MHz         : 2500.000
cpu MHz         : 2500.000
cpu MHz         : 2494.878
cpu MHz         : 2500.000

Still had bufferbloat issues so I checked htop and none of the cores are hitting 100%, so I don't think it's a CPU bottleneck.

I did also disable ipv6 using grub.cfg (ipv6.disable=1), but I was having issues before making this change.

At this point I'm not sure what else to try. If anyone has experienced this issue or has any suggestions, I'd be grateful.

In hopes of helping with troubleshooting, here's my sqm config file:

config queue 'wan'
        option qdisc 'cake'
        option debug_logging '1'
        option verbosity '5'
        option squash_ingress '0'
        option interface 'eth1'
        option ingress_ecn 'ECN'
        option squash_dscp '0'
        option qdisc_advanced '1'
        option qdisc_really_really_advanced '1'
        option iqdisc_opts 'docsis besteffort ingress nat dual-dsthost'
        option eqdisc_opts 'docsis nat ack-filter dual-srchost'
        option script 'layer_cake.qos'
        option linklayer 'ethernet'
        option overhead '22'
        option upload '45000'
        option enabled '1'
        option download '750000'
        option egress_ecn 'NOECN'

Edit: Forgot to include Waveform Results: https://www.waveform.com/tools/bufferbloat?test-id=6b8aa49d-e874-403e-a70d-ef1e3bfe23d8

dlakelan · February 13, 2024, 4:10am

Set that to 650000 and you'll be doing a lot better. At least based on the achieved download speed

moeller0 · February 13, 2024, 6:29am

I am with @dlakelan this likely is an oversubscribed segment where your true capacity share is well below the former 8XX Mbps, so your cake shaper never engages but you still see the CMTS bufferbloat....
I also agree with the proposed immediate remedy of reducing the shaper rate, beyond that you might want to consider one of the autorate scripts....

DanaGoyette · February 13, 2024, 7:57pm

What NICs are you using? When I tried SQM with Realtek 2.5GbE NICs and a certain cable modem, there was some weird incompatibility between that NIC and my cable modem.

On that machine (an ODROID-H2+), it could do SQM at >1 gigabit when the WAN was connected to a gigabit switch, yet when connected to the cable modem, an SQM limit of 1 gigabit resulted in something like 300-500 megabits of actual throughput.

RyanBlakeIT · February 14, 2024, 3:12am

It uses Intel i226-V onboard (both eth0 and eth1 are). However, I did swap back to the Raspberry Pi 4B last night and that didn't help, so I switched back to the iKOOLCORE R2.

The first waveform link in my original post is the Bufferbloat with shaping enabled, my apologies. With sqm off, I was getting a solid 950/54Mbps last night.

This is an older one from back when I was using the Raspberry Pi 4b and it was working properly:

700000/50000:

650000/50000:

500000/35000:

Edit: I had to lower to 500000/30000 to keep upload latency consistently down to 0-1ms. https://www.waveform.com/tools/bufferbloat?test-id=d7755310-d6ea-4291-b899-058c16b5df15

Also, I wanted to share that I am using ethtool to lower the advertised throughput to both my modem (Arris S33v2) and my primary switch (TP-Link TL-SG116E) to Gigabit full duplex.

This is my rc.local:

# Put your custom commands here that should be executed once
# the system init finished. By default this file does nothing.

# This will disable eth2
echo 1 > /sys/bus/pci/devices/0000:04:00.0/remove
# Optimizations for eth0 and eth1
ethtool --set-eee eth0 eee off
ethtool --set-eee eth1 eee off
ethtool -A eth0 autoneg off rx off tx off
ethtool -A eth1 autoneg off rx off tx off
ethtool -K eth0 rx-checksumming off tx-checksumming off scatter-gather off generic-segmentation-offload off generic-receive-offload off hw-tc-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off receive-hashing off
ethtool -K eth1 rx-checksumming off tx-checksumming off scatter-gather off tcp-segmentation-offload off generic-receive-offload off rx-vlan-offload off tx-vlan-offload off hw-tc-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off receive-hashing off
ethtool -K br-lan highdma off tx-checksumming off tx-checksum-ip-generic off generic-receive-offload off tx-vlan-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off tx-tunnel-remcsum-segmentation off tx-esp-segmentation off tx-vlan-stag-hw-insert off
ethtool -s eth0 autoneg on advertise 0x020
ethtool -s eth1 autoneg on advertise 0x020

#/etc/init.d/network restart

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy2/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy3/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy4/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy5/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy6/scaling_min_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy7/scaling_min_freq

echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy2/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy3/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy5/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy6/scaling_max_freq
echo -n '2500000' > /sys/devices/system/cpu/cpufreq/policy7/scaling_max_freq

(sleep 15; /etc/init.d/ddns restart)&

exit 0

RyanBlakeIT · February 14, 2024, 6:19am

I did some testing using iperf3 tonight.

As a sanity test to make sure nothing is wrong with my LAN side, I tested connecting from my laptop over ethernet (gigabit) to the openwrt router:

send_results
{
        "cpu_util_total":       8.700490,
        "cpu_util_user":        2.177608,
        "cpu_util_system":      6.522882,
        "sender_has_retransmits":       0,
        "streams":      [{
                        "id":   1,
                        "bytes":        1137967104,
                        "retransmits":  -1,
                        "jitter":       0,
                        "errors":       0,
                        "packets":      0
                }]
}
get_results
{
        "cpu_util_total":       17.760828,
        "cpu_util_user":        0.747836,
        "cpu_util_system":      17.012992,
        "sender_has_retransmits":       -1,
        "congestion_used":      "cubic",
        "streams":      [{
                        "id":   1,
                        "bytes":        1137836032,
                        "retransmits":  -1,
                        "jitter":       0,
                        "errors":       0,
                        "omitted_errors":       0,
                        "packets":      0,
                        "omitted_packets":      0,
                        "start_time":   0,
                        "end_time":     10.007259
                }]
}
[  4]   9.00-10.00  sec   108 MBytes   910 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec                  sender
[  4]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec                  receiver

Reverse of same test:

send_results
{
        "cpu_util_total":       16.295695,
        "cpu_util_user":        4.809217,
        "cpu_util_system":      11.486477,
        "sender_has_retransmits":       -1,
        "streams":      [{
                        "id":   1,
                        "bytes":        1182154372,
                        "retransmits":  -1,
                        "jitter":       0,
                        "errors":       0,
                        "packets":      0
                }]
}
get_results
{
        "cpu_util_total":       1.142716,
        "cpu_util_user":        0,
        "cpu_util_system":      1.142706,
        "sender_has_retransmits":       1,
        "congestion_used":      "cubic",
        "streams":      [{
                        "id":   1,
                        "bytes":        1183186944,
                        "retransmits":  0,
                        "jitter":       0,
                        "errors":       0,
                        "omitted_errors":       0,
                        "packets":      0,
                        "omitted_packets":      0,
                        "start_time":   0,
                        "end_time":     10.004835
                }]
}
[  4]   9.00-10.00  sec   113 MBytes   949 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.10 GBytes   947 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.10 GBytes   946 Mbits/sec                  receiver

I ran a constant ping and hit <1ms-2ms while running iperf3 on the LAN, so I would think bufferbloat on the LAN side isn't an issue.

Bidirectional test external server (client) to openwrt (server) with sqm:

[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec   519 MBytes   435 Mbits/sec    3             sender
[  5][TX-C]   0.00-10.02  sec   517 MBytes   433 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  24.4 MBytes  20.4 Mbits/sec   24             sender
[  7][RX-C]   0.00-10.02  sec  24.0 MBytes  20.1 Mbits/sec                  receiver

Bidirectional test openwrt (client) to external server (server) with sqm:

[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  37.1 MBytes  31.1 Mbits/sec   10             sender
[  5][TX-C]   0.00-10.01  sec  36.6 MBytes  30.6 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec   537 MBytes   450 Mbits/sec    3             sender
[  7][RX-C]   0.00-10.01  sec   534 MBytes   447 Mbits/sec                  receiver

Bidirectional test external server (client) to openwrt (server) without sqm:

[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  1.17 GBytes  1.01 Gbits/sec    7             sender
[  5][TX-C]   0.00-10.03  sec  1.17 GBytes  1.00 Gbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  63.8 MBytes  53.5 Mbits/sec    7             sender
[  7][RX-C]   0.00-10.03  sec  60.1 MBytes  50.3 Mbits/sec                  receiver

Bidirectional test openwrt (client) to external server (server) without sqm:

[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  26.4 MBytes  22.1 Mbits/sec   15             sender
[  5][TX-C]   0.00-10.02  sec  25.5 MBytes  21.3 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  1.17 GBytes  1.01 Gbits/sec    0             sender
[  7][RX-C]   0.00-10.02  sec  1.17 GBytes  1.00 Gbits/sec                  receiver

ethtool -S eth1 | grep -E "(err|drop)" | grep -v ": 0"
     rx_long_length_errors: 46
     rx_errors: 46
     rx_length_errors: 46
     rx_fifo_errors: 7
     rx_queue_3_drops: 7

When I tested using waveform's bufferbloat test, I received high latency without sqm (as expected) but also with sqm enabled unless I drop my speed down to half (500000/25000). This was not the case before so I'm not entirely sure what has changed to cause this? I'm hopeful it's something obvious that I simply cannot see that a second pair of eyes can help me with.

moeller0 · February 14, 2024, 7:56am

So here is something to keep in mind, codel style AQM's operate by allowing a small 'standing' queue (controlled by a parameter called 'target' that can be explicitly changed in codel/fq_codel but only indirectly in cake*) but will tolerate larger queues only transiently (think of a queue as a shock absorber, you typically want these to be at the default position, so they maintain the capability to absorb a shock, if you routinely operate these close to maximum compression, good luck evening out the next bump in the road).
BUT the result of this is that if you operate a link at saturation long enough your average queueing delay will end up being close to target (or for bidirectionally saturating traffic often close to 2*target). In other words with default cake/fq_codel the EXPECTED latency increase under load during a saturating test is around +5 ms... (but see **)
Also keep in mind that modern browsers are great jack-of-all-trades tools, but they are not high-precision measurement environments so expect some delay spikes in web-tests like the waveform bufferbloat test that reflect delays inside the browser, not the network.

*) According to the theory behind codel there are two control variables, interval (called 'rtt' in cake) and 'target'. Now if the minimum sojourn time for any packet (that is the time delta between the timestamp from enqueueing a packet and the time that same packet is dequeued) stays above target for a duration of interval, codel will schedule a drop/mark event and at the same time transiently set interval to a smaller value so if above target delays persists the next mark/drop will be scheduled even earlier. It can be shown that if target is set to 5-10% of interval this 'control method' results in relatively low queueing delay all the while still yielding a pretty nice throughput. Based on this cake simploy sets its internal target to 5% of the rtt variable...

**) Cake for all its great features, is a bit CPU hungry in its shaper, and unfortunately its shaper if CPU starved will try to still deliver the set shaper rate while allowing larger delays, so for cake often larger than expected delays with the achieved rate stying below the set shaper rate often indicates running out of CPU cycles...

moeller0 · February 14, 2024, 8:02am

Well, there is always the chance that your ISP's local segment is congested and can not deliver the contracted rates at acceptable responsiveness... SQM operates based on the theory that you use the shaper to make sure that SQM is in control of the bottleneck queue, but if you set your shaper to X Mbps but the true bottleneck's rate drops to below X then SQM will not be able to control the bufferbloat effectively anymore...

Maybe have a look at cake-autorate for an adaptive method that tries to adjust the shaper rate dynamically based on experienced delay.

ed8 · February 14, 2024, 12:19pm

Possibly it's also worth trying out the setting fq_codel + simple.qos (instead of cake)?

mindwolf · February 14, 2024, 3:11pm

I agree. TBF is far less taxing on the cpu but lacks some bells and whistles. Nonetheless, I get A+ at full 600/30 and proper overheads using simplest_tbf.qos. htop shows an average of 20% across each core at max.

My nics are i226-v as well
This should already be set to off by default in the driver

ethtool --set-eee eth0 eee off
ethtool --set-eee eth1 eee off

You'll need receive offloads such as GRO to achieve speeds greater than 500Mbit and cake has GSO peeling by default activated for transmit, so I would suggest removing all these (sans TSO) unless you're using a very underpowered device. Even in that case, you'll want to reduce the bandwidth pipe down to acceptable levels as others have said.

NOTE: Disabling offloads on br-lan won't accomplish anything as queueing doesn't happen there. IFB ( ingress ) would be the other "device" to disable them on besides iface ( egress )

ethtool -K eth0 rx-checksumming off tx-checksumming off scatter-gather off generic-segmentation-offload off generic-receive-offload off hw-tc-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off receive-hashing off
ethtool -K eth1 rx-checksumming off tx-checksumming off scatter-gather off tcp-segmentation-offload off generic-receive-offload off rx-vlan-offload off tx-vlan-offload off hw-tc-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off receive-hashing off
ethtool -K br-lan highdma off tx-checksumming off tx-checksum-ip-generic off generic-receive-offload off tx-vlan-offload off tx-gre-segmentation off tx-gre-csum-segmentation off tx-ipxip4-segmentation off tx-ipxip6-segmentation off tx-udp_tnl-segmentation off tx-udp_tnl-csum-segmentation off tx-gso-partial off tx-tunnel-remcsum-segmentation off tx-esp-segmentation off tx-vlan-stag-hw-insert off

Here's a script to help with disabling all offloads if you want to test on/off scenarious or limits of your device:

#!/bin/sh

# List all network interfaces
for interface in $(ip -o link show | awk -F': ' '!/noqueue/{print $2}')
do
    echo "Checking offloads for interface: $interface"

    # Check offloads and disable them if necessary
    ethtool -k $interface | while read line; do
        # Check if line contains 'fixed', 'off', or 'not supported'
        if echo "$line" | grep -q "fixed"; then
            echo "Skipping fixed offloads: $line"
        elif echo "$line" | grep -q "off"; then
            echo "Skipping pre-disabled offloads: $line"
        elif echo "$line" | grep -q "not supported"; then
            echo "Skipping unsupported offloads: $line"
        elif echo "$line" | grep -q "on"; then
            offload=$(echo "$line" | awk '{print $1}')
            offload=${offload%:} # Remove trailing colon

            # Disable the offload if it's not fixed, pre-disabled, or unsupported
            echo "Disabling offload: $offload"
            ethtool -K $interface $offload off
        fi
    done
done

RyanBlakeIT · February 15, 2024, 12:21am

So I've re-enabled cake-autorate (I used it in the past, but haven't recently) and I also turned everything back to default (commented out those lines mentioned earlier) except for hw-tc-offload, which I turned off.

ethtool -K eth0 hw-tc-offload off
ethtool -K eth1 hw-tc-offload off

Here's the current ethtool -k settings for the interfaces:

root@ROUTER:~# ethtool -k eth1 | grep -v "fixed" | grep -v ": off"
Features for eth1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ip-generic: on
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
receive-hashing: on
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
root@ROUTER:~# ethtool -k ifb4eth1 | grep -v "fixed" | grep -v ": off"
Features for ifb4eth1:
tx-checksumming: on
        tx-checksum-ip-generic: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-vlan-offload: on
highdma: on
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-sctp-segmentation: on
tx-udp-segmentation: on
tx-gso-list: on
tx-vlan-stag-hw-insert: on
root@ROUTER:~# ethtool -k eth0 | grep -v "fixed" | grep -v ": off"
Features for eth0:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ip-generic: on
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
receive-hashing: on
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on

After making the changes and rebooting, I'm still seeing bufferbloat.

When I ran some tests at first, download was perfect, +0ms but upload was +20ms. Now it's lower on the upload but increased on the download. I looked at htop and CPU utilization on all cores doesn't go over 20%.

Config of cake-autorate:

dl_if=ifb4eth1 # download interface
ul_if=eth1     # upload interface

adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=225000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=450000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=900000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=11250  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=22500 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=45000  # maximum bandwidth for upload (Kbit/s)

connection_active_thr_kbps=2000  # threshold in Kbit/s below which dl/ul is considered idle

Based on what I'm seeing with htop, I don't think it's bottlenecking at the CPU. @mindwolf - Since you also have the i226-v, would you mind sharing what kind of internet connection do you have and if you have any issues with it?

I also removed the negotiation limit of gigabit and am allowing 2.5Gbps as I installed a new 2.5G switch and the Arris S33 already supports 2.5Gbps (but I don't see a way to tell what it negotiates at on the web UI, but I'm also not seeing errors so I presume it's correct).

Edit: I turned off packet steering and rebooted again. Doesn't really seem to make a difference?

The below is the best I've been able to get after multiple test runs but without making any other changes to settings outside of what I mentioned above, including disabling packet steering (since it mentions in the UI that it might help or hurt performance). I also know when you split workloads across CPUs, it can impact latency so I figured if my single core is sufficient, why not just disable it. Even when testing with it off, I don't see any single core go over 33%.

Hudra · February 15, 2024, 9:26am

Have you tried disabling C-States and Intel P-States?

Probably negligible when it comes to latency, but it might also help to install intel-microcode.

mindwolf · February 15, 2024, 6:07pm

Just a hunch, try switching the order of eth1, eth0 such as wan is now eth0 & lan is now eth1 in /etc/config/network && /etc/init.d/network restart. Edit sqm settings plus switch the cabling to match.

RyanBlakeIT · February 15, 2024, 11:45pm

@Hudra - It seems crazy to me, too, but so far after updating the intel microcode, it has been performing better. Maybe it's coincidence, so time will tell.

@mindwolf - Interesting thought. I did notice that the R2 has a separate board with the last interface (eth0) so I was considering disabling that altogether and using eth2 as eth0 instead for the LAN (that way I don't have to be assigned a new public IP address, either ). I'll see how things go, and if they persist, that'll be my next step.

Hudra · February 16, 2024, 10:48am

Mini PC manufacturers often release BIOS updates infrequently or not at all. This can leave systems without the latest fixes and optimizations that could improve performance, stability, and security. In such scenarios, utilizing Intel Microcode becomes particularly helpful. Perhaps in your specific case, this has indeed provided a notable improvement

Some ideas and further approaches to optimizing latency beyond the solutions already discussed or tried:

As previously mentioned, disabling Intel P-States and C-States can reduce latency.
Additionally, turning off hyper-threading and virtualization in the Bios could also be beneficial.

RyanBlakeIT · February 18, 2024, 11:37pm

Thanks to everyone for the suggestions, it is genuinely appreciated. It's great to see a helpful community still thriving.

I have already disabled Intel P-States and C-States as far as I can tell. I also e-mailed iKOOLCORE to make sure there aren't any other settings I'm missing. HT isn't enabled on the CPU it's using (i3-N300 just uses efficiency cores) and virtualization is disabled.

I'm getting really good results on download, but upload still jumps a little:

Not sure if that can be optimized more. If not, I'm still getting A & A+. Again, huge thank you to everyone.

Edit: I made some further adjustments to the sqm config and the cake-autorate config, which seems to have me in a good place.

sqm:

config queue 'wan'
        option qdisc 'cake'
        option script 'layer_cake.qos'
        option interface 'eth1'
        option linklayer 'none'
        option download '800000'
        option upload '40000'
        option enabled '1'
        option debug_logging '1'
        option verbosity '5'
        option squash_ingress '0'
        option ingress_ecn 'ECN'
        option egress_ecn 'NOECN'
        option squash_dscp '0'
        option qdisc_advanced '1'
        option qdisc_really_really_advanced '1'
        option iqdisc_opts 'docsis besteffort ingress nat dual-dsthost'
        option eqdisc_opts 'docsis nat ack-filter dual-srchost'

cake-autorate (config.primary.sh):

#!/usr/bin/env bash

# *** INSTANCE-SPECIFIC CONFIGURATION OPTIONS ***
#
# cake-autorate will run one instance per config file present in the /root/cake-autorate
# directory in the form: config.instance.sh. Thus multiple instances of cake-autorate
# can be established by setting up appropriate config files like config.primary.sh and
# config.secondary.sh for the respective first and second instances of cake-autorate.

### For multihomed setups, it is the responsibility of the user to ensure that the probes
### sent by this instance of cake-autorate actually travel through these interfaces.
### See ping_extra_args and ping_prefix_string

dl_if=ifb4eth1 # download interface
ul_if=eth1     # upload interface

# Set either of the below to 0 to adjust one direction only
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=225000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=450000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=800000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=11250  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=22500 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=40000  # maximum bandwidth for upload (Kbit/s)

connection_active_thr_kbps=2000  # threshold in Kbit/s below which dl/ul is considered idle

# *** OVERRIDES ***

### See defaults.sh for additional configuration options
### that can be set in this configuration file to override the defaults.
### Place any such overrides below this line.

# *** OUTPUT AND LOGGING OPTIONS ***

output_processing_stats=1       # enable (1) or disable (0) output monitoring lines showing processing stats
output_load_stats=1             # enable (1) or disable (0) output monitoring lines showing achieved loads
output_reflector_stats=1        # enable (1) or disable (0) output monitoring lines showing reflector stats
output_summary_stats=1          # enable (1) or disable (0) output monitoring lines showing summary stats
output_cake_changes=1           # enable (1) or disable (0) output monitoring lines showing cake bandwidth changes
debug=1                         # enable (1) or disable (0) out of debug lines

Hopefully this can help someone else who might stumble upon this thread.

moeller0 · February 19, 2024, 7:42am

As I tried to explain, the way CoDel based AQM's operate you have to expect that the standing queue increases by at least 1 * target, with the default target being 5ms...
If you really need lower numbers you can reduce the codel interval (e.g. rtt 50 will reduce the target to 2.5 ms) but that will result in a throughput sacrifice for flows with RTTs noticeably larger than 50ms. That might or might not be an acceptable trade-off for you. (Please note that according to cake's man page there is a lower limit for interval/target, once you set target close to the kernel's own timing jitter you will get way more drops/markings and hence much less throughput, so realistically something like 10ms might be a reasonable lower bound for rtt, resulting in a tsaget of 0.5ms).

RyanBlakeIT · February 19, 2024, 8:12am

You explained it well the first time, at least the concept of 5ms, but the extra clarity on how that number is derived helps. However, I'm struggling understanding where "rtt" can be set as you mentioned target is derived from this for cake. And yes, I did "Google" earlier and again just now, but I'm clearly not using the right keywords or something. Would you be able to direct me where to make this change? Does it have to be made for both ingress and egress? Again, my apologies if I should know this already.

moeller0 · February 19, 2024, 8:27am

In sqm-scripts not directly, you will need to use something like:

        option iqdisc_opts 'docsis besteffort ingress nat dual-dsthost rtt 25ms'
        option eqdisc_opts 'docsis nat ack-filter dual-srchost rtt 25ms'

See e.g. here:

ROUND TRIP TIME PARAMETERS top

   Active Queue Management (AQM) consists of embedding congestion
   signals in the packet flow, which receivers use to instruct
   senders to slow down when the queue is persistently occupied.
   CAKE uses ECN signalling when available, and packet drops
   otherwise, according to a combination of the Codel and BLUE AQM
   algorithms called COBALT.

   Very short latencies require a very rapid AQM response to
   adequately control latency.  However, such a rapid response tends
   to impair throughput when the actual RTT is relatively long.
   CAKE allows specifying the RTT it assumes for tuning various
   parameters.  Actual RTTs within an order of magnitude of this
   will generally work well for both throughput and latency
   management.

   At the 'lan' setting and below, the time constants are similar in
   magnitude to the jitter in the Linux kernel itself, so congestion
   might be signalled prematurely. The flows will then become sparse
   and total throughput reduced, leaving little or no back-pressure
   for the fairness logic to work against. Use the "metro" setting
   for local lans unless you have a custom kernel.

   rtt TIME
        Manually specify an RTT.

You can set this differently for ingress and egress, but the logic behind this parameter would imply that setting the same value for both directions makes some sense... (but in my book practical beats theoretical here, so if an asymmetric setting works best for your traffic and network just go for it)...

RyanBlakeIT · February 19, 2024, 8:50am

Glad I asked, that's exactly what I needed to know. Made the change to 25ms for ingress/egress and I'm getting a more solid A+ and better max latency numbers (clusters are closer).

Lowering to 20ms gives me even better, so I think I'm going to stop there (for now ).

Really appreciate the explanations and help!!