OK, tried a couple of things including quantum change, qlen, nothing really changes things, but if I set rate to 950000, then we get 50000 more, with the same pings etc.
Only explanation for that is, that there is a bug in some calculation.
OK, tried a couple of things including quantum change, qlen, nothing really changes things, but if I set rate to 950000, then we get 50000 more, with the same pings etc.
@moeller0 is our "king of weird framing math" guy, perhaps he can figure out the proper curve to set the rate here. Try 450 and 475 perhaps to get other data points?
Me, I am trying to figure out... hmm... do you overclock? underclock? have a dynamic cpu governor? the token bucket should fill independently of the actual clock rate...
Anyway, it sounds like lowering the burst size was an improvement, and tweaking it more up accomplishes little. Going back to the bifurcation we saw at flows 1024 vs flows 16, we have confirmed that codel is doing the right thing (5ms queue depth for ping and all flows holding steady except for some puzzling anomalies), but I'm really fond of having tons of queues for normal workloads.
One of my great joys in life is how much better than PIE this is, and has always been. Why do anything else, I've always thought... even if you could only hash into 16 or 32 queues as early experiments at cablelabs tried, drop head works so great....
If your brain hasn't exploded from overwork... a rrul test always makes me happy. There's one that has a settable number of flows.
Just for giggles, if you are bored, and for SCIENCE! and since this is not an internet path, try target 1ms interval 10ms, flows 1024, tcp_nup with 128 flows. Nobody bothers to tune codel for the data center, and in software 1ms was about the lowest target we could achieve. What many don't "get" about that still being miserable is that on a 1Gbit path, a single packet takes 13us to traverse the network and an ack about 1us, so in an ideal world (one envisioned by both the l4s and SCE folk, with some interesting work now in BBRv2 for sub-packet windows) - you'd have 27us worth of buffering.
Are there any other cool or weird options nss lets you fiddle with? What does this accel_mode thing do?
I think you are talking about
rrul_var. Here is my template for running this.
date ; ping -c 10 netperf-eu.bufferbloat.net ; ./run-flent --ipv4 -l 120 -H netperf-eu.bufferbloat.net rrul_var --firstname.lastname@example.org --email@example.com --socket-stats --test-parameter bidir_streams=8 --test-parameter markings=0,32,64,96,128,160,192,224 --test-parameter ping_hosts=184.108.40.206 -D . -t file_name_stem_string
0,32,64,96,128,160,192,224 are the TOS byte decimal values for CS0,CS1,CS2,CS3,CS4,CS5,CS6,CS7
192.168.42.1 is my openwrt router with flent-tools installed, and 220.127.116.11 is just an easy to remember address for an additional, close by ICMP reflector. I like to start out with a few manual pings so I have a better feel for the unloaded latency to the test servers...
853 Mbps TCP/IPv4 goodput out of 950 gross shaper rate
with before 800 out of 900... so ~100Mbps difference (or ~10%) that is a lot for overhead.... typically on Linux if no overhead is specified a qdisc will take the kernel's idea about an sku's size, which is 14 bytes larger than the IP frame size (which makes sense, as these 14 bytes are things the kernel itself might need to handle, but that obviously is not a complete picture of as most link layers require a whee bit more overhead/headers to do their thing). But I have no idea what the NSS code does internally. I do doubt though that
853 = 950 * ((1500-20-20)/(1500+X))
(1500 + X) = (950/853) * (1500-20-20)
X = ((950/853) * (1500-20-20)) - 1500 = 126.025791325
X = ((900/800) * (1500-20-20)) - 1500 = 142.5
100 Bytes of per-packet-overhead added inside the NSS qdisc would be the root cause of the issue.
Sorry, but I am out of ideas.
P.S.: The math is typically the lesser issue in "weird framing" compared to getting the information about the applicable per-packet-overheads.
I've been busy all weekend so I haven't been able to fiddle with flent, time is limited I'm afraid
I see you've been working with @KONG to get some decent measurements? I'll have to find some time to get flent started for the first time
-- EDIT --
flent --step-size=.05 --socket-stats --te=upload_streams=4 tcp_nup -H netperf-eu.bufferbloat.net flent --step-size=.05 --socket-stats --te=download_streams=4 tcp_ndown -H netperf-eu.bufferbloat.net
Now, to be clear, this is from a NSS build without SQM enabled. So you might call this my baseline without any form of SQM.
Here are the result files:
I'm getting to know flent step by step, I see that you can create plots from the data someone else gathered by using flent. I've attached a few to this post. I can't "read" these plots like you and KONG can but I can see a very clear difference between KONG's results with SQM for NSS and mine without:
And another one:
If you guys would like to see more stuff, let me know what flent commands I need to enter and what I need to change/activate on my R7800 for NSS fq_codel.
Thx for contributing. I like having "teaching moments".
I'm not sure why the green flow is so slow for so long but it's not a huge problem.
What happens when a machine runs out of cpu is always variable, and I imagine your router ran out. You end up with drops on the rx ring typically. You could also have run out of memory bandwidth.
It may be your router doesn't have BQL in this mode. bqlmon is a good tool.
Check your NAS to see if it's running sch_fq or fq_codel (tc -s qdisc show). It doesn't look fq-y or fq-codelly, it looks like your queue is short as it looks like you ran out of buffer space on the rx ring or the egress qdisc.
In terms of teaching how to read stuff like this:
What we look for is a good tcp sawtooth, which this plot is reflection of, but as it is sampled, and is actually per packet, you don't see the detail you would see with an accurate simulation such as NS3. The cwnd plot can be more revealing. The "tighter" the sawtooth the more quickly a new flow can enter the link. The various tcp_square_wave tests are helpful in understanding that.
The width of the sawtooth is relative to the perceived delay (buffering + physical distance), the height relative to the number of flows and bandwidth, and the average, average. In a non-FQ'd world, you'll see the sawtooths cross up and down across the entire range of bandwidth, with FQ they hit the average and bounce off. The ultimate goal of tcp over any RTT is to give a fair share of the network to other TCPs but it takes a geometrically increasing amount of time to do the longer the perceived delay.
In order to see the bufferbloat problem in all it's glory that everyone had in 2011, and so many still have today, replace the egress qdisc on the router with pfifo, and repeat the test.
tc qdisc replace dev your_wan_interface pfifo 10000
A sawtooth is typical of Cubic and Reno. Cubic's sawtooth is, well, cubic, it has a distinct curve that cannot be seen at this resolution.
To see something very, very different, switch your tcp to bbr.
A lot of people have missed that flent has a gui. (flent-gui) it lets you flip between plots rapidly and merge two or more pieces of data to create a comparison plot.
btw, we have flent servers all over the world. There's one in germany, another in england.
Luckily I monitor my R7800 with luci_statistics and send that to InfluxDB/Grafana. These are the stats around the time I measured with flent:
The spikes you see above are the times I measured with flent.
No errors were measured
The router's load (it's a dual core, so no sweat at all)
And the one we're after: CPU usage.
And how memory usage was at that time.
I can provide the RRD files too but they don't carry so much detail anymore like these pictures from Grafana.
It appears that BQL is zero, is that good?
root@OpenWrt:~# cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit 0
I don't think I have BQL, this R7800 is directly connected to the FTU/NT, so no ISP modem is interfering. How can I enable BQL?
I'm learning a lot! I had no idea my NAS (self built Ubuntu machine) was running anything like that:
~$ tc -s qdisc qdisc noqueue 0: dev lo root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn Sent 48942431499 bytes 67601851 pkt (dropped 7, overlimits 0 requeues 4560923) backlog 0b 0p requeues 4560923 maxpacket 67770 drop_overlimit 0 new_flow_count 67801 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc noqueue 0: dev br-24e8282b2b0b root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc noqueue 0: dev docker0 root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc noqueue 0: dev br-65dbb47417f0 root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc noqueue 0: dev br-d105b742a060 root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc noqueue 0: dev veth08
I'm running docker on this NAS, I didn't include all the queue's for each container.
Ah, since it's al about buffering, a tighter sawtooth means less time spent in buffer processing. A wide sawtooth means some device is waiting for packets that isn't being transmitted by another. A tighter sawtooth means both communicating devices are perfectly teamed up and "synchronized". Am I right?
Gotcha, so if a sawtooth is very large and wide, that's a huge waste of time and add's up to latency in the chain from source to destination. Every layer in the OSI model add's it's own piece of buffering, the goal is to get them all aligned and tuned to another so that buffers can be smaller and pass data along as soon as it's received.
I'll do that when I'm the only one on the home network, right now I'm not
Just a quick note, you can try running your nas with cake gso-split without a bandwidth parameter, and get less latency for mixed traffic... if you have "enough cpu" on it. I really need to come up with another term for that, as we don't mean straight line cpu performance but the ability to context switch rapidly.
Also, if you are a graphena whiz? I've longed to plot dynamically the output of a high resolution irtt sample, but the output has two problems in that it uses mixed units (us, ms, s), and the one way delay measurements require a sync'd clock. the rtt measurement does not. (ok, that's three problems). If your nas however is doing nothing but TCP (no tunnels), sch_fq is better that fq_codel.
The json output doesn't use mixed units but requires the test complete before plotting. four problems.
a typical test I do is:
irtt client -i3ms -d1m --dscp=0xfe
and then I parse that with tr, awk and gnuplot. There isn't a way to send the json output (-o whatever.json) to a stream either. I have some really cool plots of how starlink works but haven't twiddled it enough to do one way delay yet (five problems).
@dtaht @moeller0 OK. I saw dropped packets on the machine which is previously hosted netserver, therefore I just installed netserver on my laptop and verified, that I see no dropped packets on client and server side. The result looks better. Upload is about 840Mbps now when I configure a limit of 900Mbps.
Heh. Thx for digging further into it. Ironically chrome will no longer let me download flent.gz files from this site for some reason, but your result now seems close enough to correct to just ship.
I'm still quite puzzled as to the bifurcation thing that happened, and would like to somehow test ipv6.
For giggles how does it work if you hash into a prime number of buckets? Say 17 or 31.
67 or 997 might be better. I know I was a really weird kid, but for some reason I memorized a bunch of primes like that.
at one point I was helping hisham and he didn't have access to cake. we were trying to get per-internal-ip fairness using DRR, If I remember right, we did some packet captures and decided on some average number of streams on his network, and then picked a prime number such that the typical number of streams hashing to a bucket was something like 6, the prime might have been like 11 or 7 or something. The idea being that whether you had the avg number or +1 or -1 or whatever, it'd be a smallish percentage change from the avg. If you try to get always 1 per bucket, then if you get a collision it's suddenly 2x as much as normal!
Just some thoughts.
Thanks, the flent series meta reports a sum of 819 Mbps, while the theoretical maximum for a shaper rate of 900 is (assuming TCP/IPv4 without options and no explicit overhead configured):
900 * ((1500-20-20)/(1500+14)) = 867.90 Mbps
IMHO that seems pretty usable, but I will happily trade-in throughput for better latency-under-load performance.
Aside from testing with ipv6, how about with nat enabled?
What other offloads are in this thing?
Another thought is flows 16 limit 1024. I keep thinking we are out of some on-chip resource. Each flow consumes 64 bytes of extra memory, a packet limit of 1024 uses up 2k each for the packet list, and the default is a whopping 4096 which makes sense without GRO at 10Gbit, but not at 1gbit.
I've "installed" the NSS SQM scripts from your repo on my R7800:
I've also "installed" the selectable NSS.qos setup script option from your repo:
I've enabled SQM this way on a stable 21.02 build from @ACwifidude's repo (
OpenWrt 21.02-SNAPSHOT r16328+19-f441be3921).
When I look at the log I see some errors, I'd like to resolve them to get SQM with NSS running based on your latest findings but realize I might be walking on thin ice here. Maybe you can tell from the messages what I've done wrong?
Wed Nov 24 11:35:05 2021 user.notice SQM: Starting SQM script: nss.qos on eth1, in: 490000 Kbps, out: 490000 Kbps Wed Nov 24 11:35:06 2021 daemon.err modprobe: failed to find a module named act_ipt Wed Nov 24 11:35:06 2021 daemon.err modprobe: failed to find a module named sch_fq_codel Wed Nov 24 11:35:06 2021 kern.err kernel: [102179.705566] debugfs: File 'virt_if' in directory 'stats' already present! Wed Nov 24 11:35:06 2021 kern.info kernel: [102179.705885] Created a NSS virtual interface for dev [nssifb] Wed Nov 24 11:35:06 2021 kern.info kernel: [102179.711564] NSS IFB data callback registered Wed Nov 24 11:35:06 2021 kern.info kernel: [102179.717390] NSS IFB transmit callback registered Wed Nov 24 11:35:06 2021 kern.info kernel: [102179.722088] NSS IFB module loaded. Wed Nov 24 11:35:06 2021 daemon.err modprobe: failed to find a module named act_ipt Wed Nov 24 11:35:06 2021 daemon.err modprobe: failed to find a module named sch_fq_codel Wed Nov 24 11:35:07 2021 user.notice SQM: ERROR: cmd_wrapper: tc: FAILURE (2): /sbin/tc filter add dev eth0 parent 1:0 protocol ip prio 0 u32 match ip protocol 1 0xff flowid 1:13 Wed Nov 24 11:35:07 2021 user.notice SQM: ERROR: cmd_wrapper: tc: LAST ERROR: RTNETLINK answers: Not supported We have an error talking to the kernel Wed Nov 24 11:35:07 2021 user.notice SQM: ERROR: cmd_wrapper: tc: FAILURE (2): /sbin/tc filter add dev eth0 parent 1:0 protocol ipv6 prio 1 u32 match ip protocol 1 0xff flowid 1:13 Wed Nov 24 11:35:07 2021 user.notice SQM: ERROR: cmd_wrapper: tc: LAST ERROR: RTNETLINK answers: Not supported We have an error talking to the kernel Wed Nov 24 11:35:07 2021 user.notice SQM: WARNING: sqm_start_default: nss.qos lacks an egress() function Wed Nov 24 11:35:07 2021 kern.warn kernel: [102180.383768] nss_qdisc_init:NSS qdisc 25ef04a5 (type 1) used along with non-nss qdiscs, or the interface is currently down Wed Nov 24 11:35:07 2021 kern.info kernel: [102180.393108] 61a7cae9: Found net device [eth0] Wed Nov 24 11:35:07 2021 kern.info kernel: [102180.403845] 61a7cae9: Net device [eth0] has NSS intf_num  Wed Nov 24 11:35:07 2021 kern.info kernel: [102180.409845] Nexthop successfully set for [eth0] to [nssifb] Wed Nov 24 11:35:07 2021 user.notice SQM: nss.qos was started on eth0 successfully
I reckon that SQM isn't working (properly) on eth1. It's probably because some kernel modules to be loaded can't be found. That very likely results in the message that the cmd_wrapper is unable to talk to the kernel.
I'm also a bit puzzled that SQM is also started on eth0 since I explicitly set eth1:
root@OpenWrt:/etc/config# cat sqm config queue 'eth1' option interface 'eth1' option linklayer 'none' option verbosity '5' option qdisc 'fq_codel' option script 'nss.qos' option download '490000' option upload '490000' option debug_logging '1' option enabled '1'
Any clues what I did wrong?
I've found this:
And indeed, that package is not installed on this build. Since it's a community build, I can't install it anymore. I was able to install those missing packages, the modules are loaded (checked with
lsmod. There is one module missing however:
daemon.err modprobe: failed to find a module named sch_fq_codel
I can't find it in the modules directory, these are the ones loaded:
# lsmod | grep sch_ sch_cake 40960 0 sch_codel 20480 0 sch_dsmark 20480 0 sch_fq 20480 0 sch_gred 24576 0 sch_hfsc 28672 0 sch_htb 28672 0 sch_ingress 16384 0 sch_multiq 16384 0 sch_pie 20480 0 sch_prio 20480 0 sch_red 20480 0 sch_sfq 24576 0 sch_tbf 20480 0 sch_teql 16384 0
Since I see a
sch_codel and a
sch_fq, could it be a (re)naming issue of module names or is there actually another module
fq_codel is built in, not a module, in openwrt. not really an error here.