Netfilter "Flow offload" / HW NAT


#1

Looks like Linux is finally getting hardware NAT support:

http://patchwork.ozlabs.org/cover/852523/

More information here:

https://lwn.net/Articles/738214/

The code is new so no drivers will support hardware offloading off the bat. Still, the code is 2.5 times faster than the regular software path.

The patch applies cleanly to 4.14 which seems to be the next LEDE/OpenWrt kernel version.


Optimized build for the TP-Link C2600 / Netgear R7x00 / Linksys EA8500
Archer C7 v4 support
#2

This is awesome news! How big of a chance does this have to be merged into LEDE? The fastpath implementation was rejected, because it bypassed a lot of Kernel processing of the traffic AFAIK.


#3

All these "lets fast-forward by bypassing the kernel classification of packets" schemes will invalidate QoS tools to some extent. so none of these will probably be on by default.

But if this gets into the upstream Linux kernel (which the Fastpath has not done yet), then the functionality might be selectable at least, although it might not be on by default.


#4

That's good to hear. Choices are always nice to have. Will enabling this have any implications on the security of the router? Speed is nice to have, but I value my security higher than some additional routing performance.


#5

@Mushoz

This will likely be mainlined, so LEDE/OpenWrt will get it for free eventually. It's easily applied to 4.14 kernel so my guess is community builds will appear with it.

@hnyman

Hey man, arokh here back from the dead good to see some familiar names around here :slight_smile: I think soft/hard flow offloading is more likely to be useful when you got big pipes and then likely QoS is not so important. In any case, it seems that you can apply this flow offloading to specific traffic. If you have asymmetrical upload/download it would make sense to not do this for upload.

This code is done by Pablo Neira Ayuso which is the head of the Netfilter team. Looks promising!


#6

Has anyone played around with these patches just yet? Since the master branch of Lede now also supports the 4.14 kernel? @nbd , I noticed your message on the netdev mailing list stating you wanted to start playing with the patches on Lede / Openwrt ( https://www.mail-archive.com/netdev@vger.kernel.org/msg199872.html ). Is that something you happened to have done already? I'm really curious how difficult this is to get up and running on Lede and what the performance implications are going to be.


#7

With a gigabit fiber I found it was totally unusable for VOIP without extensive QoS, in fact much more so than for slower connections. One assumes this is probably because you can slam thousands of packets into a FIFO in a very short time, but particularly for WiFi connections you can't necessarily drain that packet queue at the same fast rate.

Anyway, my 2c is that if you're routing anything over a few hundred megabits you're going to want an x86 and extensive QoS. I'm sure some people, especially those who don't game and don't VOIP and don't skype and don't google hangouts or do anything latency sensitive will benefit a lot, but I doubt that it should be on by default.


#8

Try adjusting the TxQueue of your router’s network interfaces to a lower figure, like 8, and let the client TCP congestion control handle it automatically.

IMHO QoS is only useful for router uplink as your ISP is controlling the downlink rate.

Home routers are not powerful enough to effectively to shape traffic, IMHO.


#9

Depends on the bandwidth they have to shape, reasonable modern home routers shape a typical DSL 25/3 mbit connection just fine, and downstream shaping is helpful, it assigns the packet loss to the less important streams, thereby choosing which streams to slow down, whereas setting a short tx queue just means you'll lose packets indiscriminately which will kill your voice performance.

In my extensive experience over the last 5+ years of dealing with VOIP systems, anyone who has VOIP has no choice but to shape for that service if they want to actually have anyone be willing to talk to them. Voice is extremely latency and packet-loss sensitive.

I've actually arrived finally at a custom HFSC based script, and I route with an x86, and it works well.


#10

For a gigabit wan pipe:

8 packets * 8 bits/packet * 1500 bytes / (1e9 bits/s) = 100 microseconds so your suggestion overflows router buffers within the first 100 microseconds. Most likely you want a minimum of 10 times that value and more likely 50-100 times that value, which is in fact close to the 1000 packets default value. However, downstream, at a 20 mbit wifi connection between an AP and a mobile phone perhaps, the same 1000 packets will take:

1000*8 bits/packet * 1500 bytes / (20e6 bits/s) = 600 ms to empty

which just goes to show that for VOIP you really need end to end quality of service, starting by tagging DSCP on your router, shaping both directions of the router flow, then honoring that DSCP in your switches, and having an AP that takes advantage of the DSCP marks to set the WMM queues. There's really no other way, traffic shaping is an integral part of a modern network.


#11

future is bright :slight_smile:

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary


#12

Fantastic news. I cannot wait until this reaches the master branch.


#13

Wouldn't 600 ms make VoIP unusable? Imagine a queue of 1000 deep and the VoIP packet which just joined at the end of a full queue had to wait 600 ms just to be sent out into the wire.

Linux routers have TCP/UDP receive and transmit buffers, which can be adjusted to hold received packets before being sent into the transmit queue.

My limited understanding of Linux networking is as follows:

[1] Socket buffer -> [2] QDisc Queue -> [3] Network interface queue -> [4] Wire/Phy

So I would think that [2] and [3] should be as short as possible, i.e. just enough to saturate the wire/phy. [1] should be as big as possible to avoid dropped packets.

QDisc can be implemented in most consumer routers, but it would not be able to scale effectively.

IMHO, most consumer routers just do not have enough grunt and resources to effectively shape traffic. Also, it only ever make sense to shape uplink traffic in a home setup, as there's no effective way to shape downlink, since it's controlled by your ISP.

From my experience with my 50/20 mbps link, VoIP, video streaming and app updates downloading which saturates the downlink does not really present major issue, when the uplink is relatively free.

My router gets an A from DSLReports.com test as well without any QoS enabled, so I guess ISP plays a part as well? It could also be because my links are not considered fast. Anyway, I would think networking is more an art than science :stuck_out_tongue: so probably have to trial and error until we get a setup that works for our own use.


#14

Yes, that's the point, which is why you need prioritization in your switches and your APs. When an AP with WMM gets a VOIP packet with DSCP tag it will send it through the AC_VO queue right away instead of making it wait in a single 600ms FIFO.

In your notation
[2] doesn't need to be short, it just needs to be multi-lane so that low-latency packets like VOIP ones or game or whatever go into a short queue that gets serviced right away. No one cares if there's a 600ms delay in an all night torrent download.

At high bandwidth yes that's correct, many consumer routers will have difficulty with more than say 50 or 100mbit and a decent shaper. Some of the newer ones such as ARM based probably handle 200mbit to 400mbit with shapers.

This is a myth. The point of downstream shaping is that torrents and downloading banner ads and things don't care about hundreds of milliseconds of delays and/or dropped packets, and VOIP does. So if you need more bandwidth, you should try to figure out which streams to slow down. you do this by making sure the VOIP packets have priority over the torrents and banner ads, so that the banner ads build up backlogs in your qdisc, and then the size of this backlog is an indicator to something like fq_codel that this stream should drop packets, and so then the upstream sender sees the dropped packets and slows their sending rate. In other words, TCP and many other protocols are a feedback loop and you need to send feedback to the latency tolerant streams to slow down so your intolerant streams can have some bandwidth.


#15

Future is here :slight_smile:
Scroll all the way down for a first version of flow offload implemented by @nbd

https://github.com/lede-project/source/pull/1269#issuecomment-367056477


[GCC 7.2 BUILD] Optimized TP-Link Archer C7 V2 AC1750 LEDE Firmware
#16

Flow offload in trunk.
https://github.com/openwrt/openwrt/commit/820f03099894bd48638fb5be326b5c551f0f2b98


#17

What packages do I need build-in or install in firmware? How to configure it? Any drawbacks?


#18

You will need to compile a build for your device with kernel 4.14, the module required for flow offload and iptable rule to use said module.


#19

What kind of rule would you use if you want to offload all traffic?


#20

Add a rule to the FORWARD chain with: "-j FLOWOFFLOAD"