What is currently the fastest non-x64 router?

Not directly with S3: I would need one or two intermediate steps. An EC2 instance, then copy files from the bucket to EC2 local storage, then rsync locally from/to EC2.
Another option is an IPSec tunnel into my VPC and configure S3 as a service endpoint, which would allow me to use regular s3 cli over VPN.
Either approach is cumbersome to setup, but possible.

It is a flat list of objects without directories. They fake-out directories by "interpreting" the / char in the Console, but the object store itself does not support a hierarchy. Objects can not be modified in place, but only overwritten. It is possible to use multi-part u/d though.

It is not, just like its alternatives: they use the same S3 API to access the objects, so they do not solve the problem. They arguably make it worse, but introducing more complexity.

I am curious as to the difference you might perceive if at the 500 setting you enabled ecn? ( https://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN/ )

A) does aws accept it? (marks in cake) if so
B) what does the cake backlog and average delay go to?
C) Is there a difference in cpu if you are marking frantically instead of dropping frantically?
D) improvement in FCT?

1 Like

I too wondered about rclone. I use that all the time and like it. Pity it won't help here then.

Dumb thought perhaps but what about WireGuard or the like either through a well connected VPS or VPN wouldn't that permit encapsulation into one flow over the last mile link?

I think CP/M had a similar flat filesystem (not saying S3 an CP/M's filesystem are identical, just that hierarchy is not a strict requirement for a filesystem)... This however is not really the problematic point IIUC, what is missing is some serialization of small files so that the set-up cost and ramp-up time of a TCP connection can be amortized over a larger number of total packets transferred. The advantage would be that such flows run closer to their achievable capacity share and are considerably more reactive to AQM intervention than a flock of flows that terminate while still inside slow-start or even in their initial window. At that point a "reasonable" level of parallelism would still help to speed things up, but it would not require triple-digit number of concurrent connections.
But that is all moot, unless one can somehow McGyver something like this with a close-to-the-storage VPS/VPC/VM.

Well, you were right all along. I ended up getting myself a Lenovo SFF desktop:

Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz
16GB RAM
WAN - RTL8168g/8111g (onboard)
LAN - Intel(R) PRO/1000 Network Connection (a PCIe card)
Configured exactly the same as wrt3200acm 
+5 Watts idle (temp 25..30C)
+10 Watts (on top of idle)  when fully loaded 300/300 Mbps both ways (temp 40..45C) 

Q: Should Intel NIC be WAN and Realtek be LAN? Not sure if there is a difference.

With aws cli and s4cmd (both python) I can do 256 concurrent uploads and 256 downloads (bidirectional). s5cmd (Golang) can double that at 512/512 concurrent streams for d/u. Both of those max out CPU on the client laptop and the chances are I cannot have a voice/video call at the same time: ping shows some packet loss and latency jumps up to three times. I can live with that.
All the other clients are not impacted and if a second one starts d/u it gets 50% of bandwidth in either direction. No packet losses on other clients. Latency is effectively unchanged.

Above was a test with large files. I will test small files later. My goal was to tune for the worst use case (small files / more concurrent u/d) and then use the same setting for mixed u/d. Sometime it would be all large files and I would rather not have to remember to reconfigure concurrency. Sometime I will not even know the files size distribution.

Here is some data in case @moeller0 would like to take look :wink:

Contrack usage >3K (with 2 x 512 concurrent streams)
CPU usage between 20 and 40% on all cores.
config queue 'eth1'
	option qdisc 'cake'
	option script 'piece_of_cake.qos'
	option ingress_ecn 'ECN'
	option egress_ecn 'ECN'
	option itarget 'auto'
	option etarget 'auto'
	option enabled '1'
	option interface 'pppoe-wan'
	option download '317000'
	option upload '317000'
	option linklayer 'ethernet'
	option overhead '50'
	option qdisc_advanced '1'
	option squash_dscp '1'
	option squash_ingress '1'
	option qdisc_really_really_advanced '1'
	option iqdisc_opts 'nat dual-dsthost ingress mpu 64'
	option eqdisc_opts 'nat dual-srchost mpu 64'
qdisc noqueue 0: dev lo root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 226685211813 bytes 262228085 pkt (dropped 20, overlimits 0 requeues 259) 
 backlog 0b 0p requeues 259
  maxpacket 4542 drop_overlimit 0 new_flow_count 9029 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1522 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 242625872117 bytes 273040064 pkt (dropped 21, overlimits 0 requeues 70358) 
 backlog 0b 0p requeues 70358
  maxpacket 1522 drop_overlimit 0 new_flow_count 97940 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth2 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev eth0.10 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.20 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.30 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.40 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.50 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.60 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.71 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.72 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.73 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.74 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.75 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.76 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.77 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.35 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc cake 8005: dev pppoe-wan root refcnt 2 bandwidth 317Mbit besteffort dual-srchost nat nowash no-ack-filter split-gso rtt 100ms noatm overhead 50 mpu 64 
 Sent 236619008615 bytes 273039825 pkt (dropped 11496347, overlimits 393509251 requeues 0) 
 backlog 1034883b 835p requeues 0
 memory used: 15639680b of 15220Kb
 capacity estimate: 317Mbit
 min/max network layer size:           40 /    1500
 min/max overhead-adjusted size:       90 /    1550
 average network hdr offset:            0

                  Tin 0
  thresh        317Mbit
  target            5ms
  interval        100ms
  pk_delay       51.5ms
  av_delay       14.6ms
  sp_delay       1.18ms
  backlog      1036355b
  pkts        284537006
  bytes    251400357086
  way_inds     83813485
  way_miss       135036
  way_cols      9508112
  drops        11496347
  marks             172
  ack_drop            0
  sp_flows          469
  bk_flows          396
  un_flows            0
  max_len         66240
  quantum          1514

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ---------------- 
 Sent 230263130128 bytes 267281179 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc cake 8006: dev ifb4pppoe-wan root refcnt 2 bandwidth 317Mbit besteffort dual-dsthost nat wash ingress no-ack-filter split-gso rtt 100ms noatm overhead 50 mpu 64 
 Sent 222968238855 bytes 262062557 pkt (dropped 5217703, overlimits 454633670 requeues 0) 
 backlog 1240360b 920p requeues 0
 memory used: 15505472b of 15140Kb
 capacity estimate: 317Mbit
 min/max network layer size:           29 /    1500
 min/max overhead-adjusted size:       79 /    1550
 average network hdr offset:            0

                  Tin 0
  thresh        317Mbit
  target            5ms
  interval        100ms
  pk_delay       62.7ms
  av_delay       19.4ms
  sp_delay        731us
  backlog      1241832b
  pkts        267281179
  bytes    230263130128
  way_inds    101565684
  way_miss       193724
  way_cols      8605323
  drops         5217703
  marks             291
  ack_drop            0
  sp_flows          422
  bk_flows          437
  un_flows            0
  max_len          1500
  quantum          1514
          RX packets:295944344 errors:0 dropped:319 overruns:0 frame:0
          RX packets:277632471 errors:0 dropped:2616 overruns:0 frame:0

There's not going to be much of a difference either way, so keeping it the way it comes up by default would have its merits (convenience/ lazyness).

Personally, I would put the 'better' card LAN side, as that's typically where most of the magic happens (multiple VLANs, complex inter-VLAN routing at full 1 GBit/s wire speed, etc.) - while your WAN 'just' has to meet the rather simple demands of your ISP (single VLAN, 'just' 330 MBit/s) - with all the complex stuff (PPPoE(?), SQM(!)) happening on the CPU (and not the ethernet card/ driver). On paper, that should be the Intel card (quite some differences between e1000, e1000e, igb, igc - and the various supported generations of each), which should offer more in the sense of offlioading to the hardware, but don't underestimate r8168! It might not do as much in hardware, but since PCIe and first generation core CPUs, that doesn't need to be a disadvantage (less bugs in linux than some early Intel silicon, firmware, offload engines) and 1 GBit/s is easy for the aforementioned (17 years old or newer) x86_64 hardware anyways (offloading matters for >=10 GBit/s, the faster, the more).

So, really, it doesn't matter - your hardware is way faster than needed and would be just as good as router, as it is as desktop (arbitrary windows 11 CPU requirements aside).

1 Like

Still a relative large average delay, well above the 5ms target, that means some level of dropping is granteed: 100*11496347/284537006 = 4.04% that is still noticeable, but down from the ~20% you had before. However the 172 ECN drops indicate that S3 might not support ECN or you did not enable it on your laptop? The ECN settings in /etc/config/sqm are ignored by cake, it will always use ECN signaling for packets/flows that have either ECT(0) or ECT(1) set, but for that to be the case both endpoints need to negotiate ECN.
If we assume bulk flows to carry data and sparse flows to carry ACKs (which is not 100% correct) the full rotation delay would be:
(1000*((1500+50)*8)/(317*1000^2)) * 396 = 15.4902208202 milliseconds
(1000*((100+50)*8)/(317*1000^2)) * 469 = 1.77539432177 milliseconds
-> ~17ms, again still above the target which will more or less keep all flows in drop mode
average rate of the bulk flows would be 317/(396+(469/40)) = 0.78 Mbps (assuming ACK flows take 1/40 of forward data flows) or (0.78*1000^2)/(1550*8) = 62.9032258065 packet/second with an average packet interval of 1000/( (0.78*1000^2)/(1550*8)) = 15.8974358974 ms. As expected most flows seem depressed to a single queued packet implying that we see the cake behavior that Dave described.
(On first glance the download direction looks similar so I will just refrain from running the numbers for that).

If the cyclic delay/CPU load behavior is gone it indicates that your old router was not fully up to the task... surely cake was part of the problem, but I still think that probably it was not cake/fq_codel alone, but that is pure speculation.

Fair enough. Maybe you can create 2 scripts to start the synchronization, one tuned to large and one tuned to small files and test which behaves better with mixed loads so you know what to use when in doubt?

I guess if you are happy as is the one thing to still try is to see whether you can configure your laptop to use ECN which might allow to reduce the 4% drop rate a bit, resulting hopefully in more effective throughput and smoother traffic per flow (less retransmissions).

Did not make much difference as far as I can see. Probably not related to enabling ECN, I have just finished a phone call with no issues. It does look like I no longer have to coordinate/schedule bulk uploads/downloads and can have calls from the same laptop at the same time.
This was the goal: not having to run around asking if I can start moving large amounts of data sometime in both directions.

qdisc noqueue 0: dev lo root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 318259549952 bytes 390660402 pkt (dropped 109, overlimits 0 requeues 911) 
 backlog 0b 0p requeues 911
  maxpacket 6056 drop_overlimit 0 new_flow_count 16749 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1522 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 370137296573 bytes 406472140 pkt (dropped 1004, overlimits 0 requeues 107538) 
 backlog 0b 0p requeues 107538
  maxpacket 1522 drop_overlimit 0 new_flow_count 137387 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: dev eth2 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev eth0.10 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.20 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.30 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.40 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.50 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.60 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.71 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.72 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.73 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.74 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.75 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.76 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.77 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.35 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc cake 8015: dev pppoe-wan root refcnt 2 bandwidth 317Mbit besteffort dual-srchost nat nowash no-ack-filter split-gso rtt 100ms noatm overhead 50 mpu 64 
 Sent 81578484660 bytes 84737325 pkt (dropped 1828275, overlimits 129613193 requeues 0) 
 backlog 1452088b 1008p requeues 0
 memory used: 15252Kb of 15220Kb
 capacity estimate: 317Mbit
 min/max network layer size:           35 /    1500
 min/max overhead-adjusted size:       85 /    1550
 average network hdr offset:            0

                  Tin 0
  thresh        317Mbit
  target            5ms
  interval        100ms
  pk_delay        126ms
  av_delay       13.9ms
  sp_delay        951us
  backlog      1452088b
  pkts         86566608
  bytes     83941889994
  way_inds     27354810
  way_miss        44954
  way_cols       441436
  drops         1828275
  marks               0
  ack_drop            0
  sp_flows          302
  bk_flows          210
  un_flows            0
  max_len         63296
  quantum          1514

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ---------------- 
 Sent 61164595558 bytes 84001339 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc cake 8016: dev ifb4pppoe-wan root refcnt 2 bandwidth 317Mbit besteffort dual-dsthost nat wash ingress no-ack-filter split-gso rtt 100ms noatm overhead 50 mpu 64 
 Sent 60075495068 bytes 83256466 pkt (dropped 744153, overlimits 125211398 requeues 0) 
 backlog 1028334b 720p requeues 0
 memory used: 6946688b of 15140Kb
 capacity estimate: 317Mbit
 min/max network layer size:           32 /    1500
 min/max overhead-adjusted size:       82 /    1550
 average network hdr offset:            0

                  Tin 0
  thresh        317Mbit
  target            5ms
  interval        100ms
  pk_delay         46ms
  av_delay         11ms
  sp_delay        223us
  backlog      1028334b
  pkts         84001339
  bytes     61164595558
  way_inds     31868321
  way_miss        49542
  way_cols       363182
  drops          744153
  marks               0
  ack_drop            0
  sp_flows          255
  bk_flows          253
  un_flows            0
  max_len          1500
  quantum          1514

ECN was not used, otherwise we would see some marks here... possible that the AWS servers do not support this or maybe the local S3 tools might disable this...

Well, I have no control over that. BTW, you mentioned TCP slow start a few times: should it be disabled?

Often one needs to explicitly configure the OS to use ECN; Dave collected some instructions some time ago:
https://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN/

Slow start is as far as I can tell not really avoidable.
But in spite of the name, TCP will essentially double its congestion window once every RTT, so this is ex'onential growth, which really is not that slow. The idea behind slowstart is to probe pretty agrssively for the capacity, and once the limit is found switch to congestion avoidance with significantly less aggressive probing behavior.

Slow start is what makes TCP adapt to links from a few kbps to gbps in a relative short while. However where it hurts is with your usecase, where you apparently use fresh TCP connection(s) for every file, and many of these files are so small that the TCP connection is still far away from reaching capacity when the connection already terminates, add to this the TCP and TLS? Handshakes and you have a pretty high latency for small files and hence a low effective rate...

In one of the amazon documents they recommend to keep a pool of TCP connections hanging around and re-use them for multiple transfers, but IIRC that was for the REST API or so. Probably the tools you use do not offer such an option?

I did exactly that on MacOS.

Which version of Macos? @dtaht mentioned in some other venue that apple does some wonky things with ECN (might be that they already try to phase out rfc3168 ECN to prepare for the L4S experiment).
Or also well possible, AWS might not offer ECN for transfers to/from s3 buckets.

Latest Monterey. The default was set to 2 and I changed both to 1. CAKE does not squash ECNs?

I have tgevsame OS on intel, will try to see whether I can reliably enable ECN.

Now cake will honor ECN and mark packets CE instead of dropping them. However I believe the ovlerload BLUE component will drop when it engages, but I have not looked at the code recently.

The term "squash" we invented to describe stripping out dscp values. I no longer remember the difference between wash and squash, but in both cases they preserve the ecn bits. Depending on how well the l4s experiment goes we may have to invent a way to squash that selectively.

The other hidden parameter on OSX is:

sudo sysctl -w net.inet.tcp.disable_tcp_heuristics=1

I have some data that shows that later OSXs are only negotiating ecn for https connections. So the only way to tell if it is working or not is to do a packet capture and/or inspect cake.

Yes, on overload, blue will start dropping packets. It's a feature that rarely engages but might be in your scenario... I use RFC3168-style ecn as a means to debug the qdisc and aqm...

1 Like

Wash is cake's term for 'first look at the DSCP and sort the packet into the matching priority tin for the selected diffserv mode (for besteffort there only is a single tin so all dscps end up in the same tin) and then re-mark the dscp field not to leak your home dscps to your ISP, who might do silly things to dscps including dropping the packets completely. The idea is that the combination of diffserv modes and wash allows more different potentially helpful courses of action than a pure re-mark DSCP before cake does its priority tinning.

thx! yea, I viewed getting CS1 into wifi a bad thing at the time (and still do) but I wasn't allergic to trying to respect it somewhat on the ISP link. How long have we been at this now? 12 years? I'm ready for a rest home, when I cannot remember key design decisions like this, or where to find them in the doc....

1 Like

OpenWrt by know allows user to specify their own qos_map, which should allow to move CS1 and LE back into AC_BE (or just CS1).

Not sure I would consider wash a key feature of cake ;), we really just cleaned up what you had originally prototyped in simple.qos by folding the DSCP remaking into cake (as fans of one stop shopping), it was natural to put it after tin sorting (who does not want prioritization at all just uses besteffort) as the opposite putting wash before tin sorting but allowing different diffserv modes just results in empty tins :wink:

1 Like

Interesting, because I have seen numerous posts that suggest that all offloads should be disabled and SQM works better in that case. Is there truth in that?