Some times web page loading had stuck or waiting for a long time. After searching forum, I had found @moeller0's advice on this post. There was large packets on wan and ifb4wan interfaces ( max_len on Cake statistics) too. I had installed ethtool and run theese commands.
ethtool -K wan tso off gso off gro off
ethtool -K lan2 tso off gso off gro off
ethtool -K lan3 tso off gso off gro off
ethtool -K lan4 tso off gso off gro off
ethtool -K lan5 tso off gso off gro off
After some iperf test which run for upload and download same time, I got these results.
These are likely switchports, adding fq_codel to this will direct all switch traffic through the CPU, I would guess, so not a real option if you want line rate switching... A switch typically does not queue much and to be fast needs to operate in hardware...
If you want a typical sqm installation you should select your wan interface for sqm... as you did.
The configuration looks OK, you could add the word ingress to option iqdisc_opts 'nat dual-dsthost' and then maybe even increase option download '40000' to 46000 but that is not going to change much.
As that other post states, that should not be necessary, but if you tested and it improves things, by all means stick to it. BTW what does ethtool -k lan2 return?
root@RB750Gr3:~# ethtool -k lan2
Features for lan2:
rx-checksumming: on [fixed]
tx-checksumming: on
tx-checksum-ipv4: on [fixed]
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on [fixed]
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on [fixed]
tx-tcp6-segmentation: on [fixed]
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
That is hard, but for a best effort approach add these commands to /etc/rc.local. Unless these inrerface go away and come back up again this should help.
However I would very much try to reconfirm that this actually helps, cake deals with meta packets on its own and should not need that, that was more important for fq_codel on slower links.
That does not really matter per se, cake with the split-gso option will split these meta packers into normal ~1500 byte sized packets and send these out individually, so a 21196 byte meta packet will not cause problems...
However these download delay statistics after loading several pages look way toto large, the av_delay should be much closer to 5ms even under load. This might imply that your router is already partially CPU bound. What kind of rputer do you have and what software do you use on the router?
Also with the ethtool command issued do you still see the same high delay values?
Router is Mikrotik RB750Gr3. As you can see from image below, architecture is MediaTek MT7621 ver:1 eco:3. I have flashed with OpenWrt 22.03.3 r20028-43d71ad93e / LuCI openwrt-22.03 branch git-22.361.69894-438c598.
Okay, a dual core 880 MHz MIPS, clearly not the fastest router on the planet... but properly configured it still should allow traffic shaping your aggregate ~55 Mbps...
Which implies that GSO/GRO meta packets are not your problem, so I would leave them enabled, as they help in reduce the oad on the router's networking stack...
Yes, please preferably with the router's radio disabled. When I, years ago tested a BT HomeHub5A (also dual cure MIPS, older and slower CPU cores though) I already ran into CPU limits when trying to traffic shape 50/10 and at the same time use the WiFi.
Maybe have a look here:
and see whether you could distribute the processing better over your CPUs to potentially gain a bit more throughput...
Also, if only for testing, switch your sqm configuration from piece_of_cake.qos/cake to simplest.qos/fq_codel, as that will do the following:
a) reduce the CPU load, as simplest is computationally a bit cheaper than cake
b) also HTB+fq_codel will on CPU overload tend to result in a loss in throughput with lower latency, while cake on CPU overload will show less throughput loss but a larger increase in latency.
Note the reported latency is calculated from the sojourn time of the packets in the queue, that is the time from entering the queue to leaving it again, if cake does not get the CPU when it would need want to transmit packets, it will send a few more packets on the next cycle (keeping throughput high) but that means all later packets see the delay in getting CPU as additional delay. This description is not 100% correct but should describe the underlaying principle reasonably well.
Other than that yes I agree ECN can be nice and helpful. And many servers are already prepared to use ECN if only the endpoints ask for it (thst is the endpoint TCP stack negotiates to use ECN).
This can also, depending on the TCP stack result in somewhat higher goodput (less retranmissions).
You can't really but try:
tc -s qdisc
tc -q qdisc
but uou need to do iy while the test still runs as the transient fq_codel stats are not kept very long. I also think they do not contain convenient measures of average and peak delay.