Not directly with S3: I would need one or two intermediate steps. An EC2 instance, then copy files from the bucket to EC2 local storage, then rsync locally from/to EC2.
Another option is an IPSec tunnel into my VPC and configure S3 as a service endpoint, which would allow me to use regular s3 cli over VPN.
Either approach is cumbersome to setup, but possible.
It is a flat list of objects without directories. They fake-out directories by "interpreting" the / char in the Console, but the object store itself does not support a hierarchy. Objects can not be modified in place, but only overwritten. It is possible to use multi-part u/d though.
It is not, just like its alternatives: they use the same S3 API to access the objects, so they do not solve the problem. They arguably make it worse, but introducing more complexity.
A) does aws accept it? (marks in cake) if so
B) what does the cake backlog and average delay go to?
C) Is there a difference in cpu if you are marking frantically instead of dropping frantically?
D) improvement in FCT?
I too wondered about rclone. I use that all the time and like it. Pity it won't help here then.
Dumb thought perhaps but what about WireGuard or the like either through a well connected VPS or VPN wouldn't that permit encapsulation into one flow over the last mile link?
I think CP/M had a similar flat filesystem (not saying S3 an CP/M's filesystem are identical, just that hierarchy is not a strict requirement for a filesystem)... This however is not really the problematic point IIUC, what is missing is some serialization of small files so that the set-up cost and ramp-up time of a TCP connection can be amortized over a larger number of total packets transferred. The advantage would be that such flows run closer to their achievable capacity share and are considerably more reactive to AQM intervention than a flock of flows that terminate while still inside slow-start or even in their initial window. At that point a "reasonable" level of parallelism would still help to speed things up, but it would not require triple-digit number of concurrent connections.
But that is all moot, unless one can somehow McGyver something like this with a close-to-the-storage VPS/VPC/VM.
Well, you were right all along. I ended up getting myself a Lenovo SFF desktop:
Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz
16GB RAM
WAN - RTL8168g/8111g (onboard)
LAN - Intel(R) PRO/1000 Network Connection (a PCIe card)
Configured exactly the same as wrt3200acm
+5 Watts idle (temp 25..30C)
+10 Watts (on top of idle) when fully loaded 300/300 Mbps both ways (temp 40..45C)
Q: Should Intel NIC be WAN and Realtek be LAN? Not sure if there is a difference.
With aws cli and s4cmd (both python) I can do 256 concurrent uploads and 256 downloads (bidirectional). s5cmd (Golang) can double that at 512/512 concurrent streams for d/u. Both of those max out CPU on the client laptop and the chances are I cannot have a voice/video call at the same time: ping shows some packet loss and latency jumps up to three times. I can live with that.
All the other clients are not impacted and if a second one starts d/u it gets 50% of bandwidth in either direction. No packet losses on other clients. Latency is effectively unchanged.
Above was a test with large files. I will test small files later. My goal was to tune for the worst use case (small files / more concurrent u/d) and then use the same setting for mixed u/d. Sometime it would be all large files and I would rather not have to remember to reconfigure concurrency. Sometime I will not even know the files size distribution.
Here is some data in case @moeller0 would like to take look
Contrack usage >3K (with 2 x 512 concurrent streams)
CPU usage between 20 and 40% on all cores.
There's not going to be much of a difference either way, so keeping it the way it comes up by default would have its merits (convenience/ lazyness).
Personally, I would put the 'better' card LAN side, as that's typically where most of the magic happens (multiple VLANs, complex inter-VLAN routing at full 1 GBit/s wire speed, etc.) - while your WAN 'just' has to meet the rather simple demands of your ISP (single VLAN, 'just' 330 MBit/s) - with all the complex stuff (PPPoE(?), SQM(!)) happening on the CPU (and not the ethernet card/ driver). On paper, that should be the Intel card (quite some differences between e1000, e1000e, igb, igc - and the various supported generations of each), which should offer more in the sense of offlioading to the hardware, but don't underestimate r8168! It might not do as much in hardware, but since PCIe and first generation core CPUs, that doesn't need to be a disadvantage (less bugs in linux than some early Intel silicon, firmware, offload engines) and 1 GBit/s is easy for the aforementioned (17 years old or newer) x86_64 hardware anyways (offloading matters for >=10 GBit/s, the faster, the more).
So, really, it doesn't matter - your hardware is way faster than needed and would be just as good as router, as it is as desktop (arbitrary windows 11 CPU requirements aside).
Still a relative large average delay, well above the 5ms target, that means some level of dropping is granteed: 100*11496347/284537006 = 4.04% that is still noticeable, but down from the ~20% you had before. However the 172 ECN drops indicate that S3 might not support ECN or you did not enable it on your laptop? The ECN settings in /etc/config/sqm are ignored by cake, it will always use ECN signaling for packets/flows that have either ECT(0) or ECT(1) set, but for that to be the case both endpoints need to negotiate ECN.
If we assume bulk flows to carry data and sparse flows to carry ACKs (which is not 100% correct) the full rotation delay would be: (1000*((1500+50)*8)/(317*1000^2)) * 396 = 15.4902208202 milliseconds (1000*((100+50)*8)/(317*1000^2)) * 469 = 1.77539432177 milliseconds
-> ~17ms, again still above the target which will more or less keep all flows in drop mode
average rate of the bulk flows would be 317/(396+(469/40)) = 0.78 Mbps (assuming ACK flows take 1/40 of forward data flows) or (0.78*1000^2)/(1550*8) = 62.9032258065 packet/second with an average packet interval of 1000/( (0.78*1000^2)/(1550*8)) = 15.8974358974 ms. As expected most flows seem depressed to a single queued packet implying that we see the cake behavior that Dave described.
(On first glance the download direction looks similar so I will just refrain from running the numbers for that).
If the cyclic delay/CPU load behavior is gone it indicates that your old router was not fully up to the task... surely cake was part of the problem, but I still think that probably it was not cake/fq_codel alone, but that is pure speculation.
Fair enough. Maybe you can create 2 scripts to start the synchronization, one tuned to large and one tuned to small files and test which behaves better with mixed loads so you know what to use when in doubt?
I guess if you are happy as is the one thing to still try is to see whether you can configure your laptop to use ECN which might allow to reduce the 4% drop rate a bit, resulting hopefully in more effective throughput and smoother traffic per flow (less retransmissions).
Did not make much difference as far as I can see. Probably not related to enabling ECN, I have just finished a phone call with no issues. It does look like I no longer have to coordinate/schedule bulk uploads/downloads and can have calls from the same laptop at the same time.
This was the goal: not having to run around asking if I can start moving large amounts of data sometime in both directions.
ECN was not used, otherwise we would see some marks here... possible that the AWS servers do not support this or maybe the local S3 tools might disable this...
Slow start is as far as I can tell not really avoidable.
But in spite of the name, TCP will essentially double its congestion window once every RTT, so this is ex'onential growth, which really is not that slow. The idea behind slowstart is to probe pretty agrssively for the capacity, and once the limit is found switch to congestion avoidance with significantly less aggressive probing behavior.
Slow start is what makes TCP adapt to links from a few kbps to gbps in a relative short while. However where it hurts is with your usecase, where you apparently use fresh TCP connection(s) for every file, and many of these files are so small that the TCP connection is still far away from reaching capacity when the connection already terminates, add to this the TCP and TLS? Handshakes and you have a pretty high latency for small files and hence a low effective rate...
In one of the amazon documents they recommend to keep a pool of TCP connections hanging around and re-use them for multiple transfers, but IIRC that was for the REST API or so. Probably the tools you use do not offer such an option?
Which version of Macos? @dtaht mentioned in some other venue that apple does some wonky things with ECN (might be that they already try to phase out rfc3168 ECN to prepare for the L4S experiment).
Or also well possible, AWS might not offer ECN for transfers to/from s3 buckets.
I have tgevsame OS on intel, will try to see whether I can reliably enable ECN.
Now cake will honor ECN and mark packets CE instead of dropping them. However I believe the ovlerload BLUE component will drop when it engages, but I have not looked at the code recently.
The term "squash" we invented to describe stripping out dscp values. I no longer remember the difference between wash and squash, but in both cases they preserve the ecn bits. Depending on how well the l4s experiment goes we may have to invent a way to squash that selectively.
I have some data that shows that later OSXs are only negotiating ecn for https connections. So the only way to tell if it is working or not is to do a packet capture and/or inspect cake.
Yes, on overload, blue will start dropping packets. It's a feature that rarely engages but might be in your scenario... I use RFC3168-style ecn as a means to debug the qdisc and aqm...
Wash is cake's term for 'first look at the DSCP and sort the packet into the matching priority tin for the selected diffserv mode (for besteffort there only is a single tin so all dscps end up in the same tin) and then re-mark the dscp field not to leak your home dscps to your ISP, who might do silly things to dscps including dropping the packets completely. The idea is that the combination of diffserv modes and wash allows more different potentially helpful courses of action than a pure re-mark DSCP before cake does its priority tinning.
thx! yea, I viewed getting CS1 into wifi a bad thing at the time (and still do) but I wasn't allergic to trying to respect it somewhat on the ISP link. How long have we been at this now? 12 years? I'm ready for a rest home, when I cannot remember key design decisions like this, or where to find them in the doc....
OpenWrt by know allows user to specify their own qos_map, which should allow to move CS1 and LE back into AC_BE (or just CS1).
Not sure I would consider wash a key feature of cake ;), we really just cleaned up what you had originally prototyped in simple.qos by folding the DSCP remaking into cake (as fans of one stop shopping), it was natural to put it after tin sorting (who does not want prioritization at all just uses besteffort) as the opposite putting wash before tin sorting but allowing different diffserv modes just results in empty tins
Interesting, because I have seen numerous posts that suggest that all offloads should be disabled and SQM works better in that case. Is there truth in that?