Routing Bottleneck

Mushoz · May 16, 2017, 7:19pm

I did a benchmark with my DIR-860L to test its NAT performance with two local machines. I was wondering if this test was a good way to test the NAT performance:

All settings not mentioned were set to their defaults
The default WAN interface was changed from DHCP to a static IP at 192.168.2.1
A client was connected to the WAN interface with a static IP of 192.168.2.2 and a iperf3 server was started
A second client was connected to the LAN side and got an IP through DHCP in the 192.168.1.x range
An iperf3 client was run on the second client and was connected to the server on the WAN side.

Results:

WAN <-> LAN was 900+ mbit in both directions (Gigabit speeds) with ~60-70% CPU load on the router
WAN <-> LAN speed was limited to ~600-700 mbit in both direction with fq_codel + qos_simplest running at around ~80% CPU load (probably not perfectly multithreaded, which prevented it from scaling all the way to 100% CPU)

I am now running the router in a real world scenario (You know, as an actual router ) and the following observations have me baffled:

I am reaching 500mbit down and the upload oscillates between 430-500 mbit on my 500/500 mbit connection. The ISP provided router would always do 500mbit in both directions with the help of hardware NAT.
Running fq_codel + qos_simplest would bring down the speed to around 350-400 mbit in both directions, regardless of ingress and egress settings. CPU load on the router would be 80%+

My question is: Why are my results that much worse compared to the artificial tests I did before? I have a few theories:

The real world scenario uses PPPoE to connect to my ISP. Is the PPPoE CPU overhead that large to explain the differences between the two tests? If so, is there any optimizations that I could apply?
The real world scenario uses IPv6, both on the LAN and WAN. Is the IPv6 CPU overhead that much higher than IPv4 which was used in the artificial test? And again, if so, are there are optimizations that I could apply?
Was my artificial test flawed in the first place and not representative of real world routing/NAT performance? What would explain the differences and how should I have tested the maximum NAT performance instead?

I'm looking forward to any explanations that could hopefully clear up some of my confusion. Thank you very much in advance!

moeller0 · May 16, 2017, 7:53pm

Side note, and you probably already know this, but for the sake of other's stumbling over this thread: Typically I recommend to monitor the idle value in top's output, if this goes close to zero its a good sign that you are CPU limited, sqm-scripts loads often cause high sirq load which is not accounted to either sys or user, but idle shows everything not accounted by the other columns.

Did you test a load that simultaneously saturated both ingress and egress?

When you say regardless of settings, how do things look if you set both directions to 100Mbps (to pick a value where the combined sum stays well below your reported actual thoughput numbers)

Mushoz · May 16, 2017, 7:59pm

Correct. I was stating the 100 - idle % as the load percentage. Sorry for any confusion caused.

I did not. For the artificial test I was running an iperf3 test one direction at a time. For the real world test, I was running dslreports.com speedtest which also tests one direction at a time.

Allow me to clarify: Setting egress and/or ingress higher would not allow the speed to increase above the aforementioned 350-400 mbit. Setting egress and ingress lower would cap the speed at just below the setting that was set, which shows that SQM is working correctly.

drbrains · May 17, 2017, 9:43am

Im sure you looked into tweaking settings a bit like they did on OpenWRT:
https://wiki.openwrt.org/toh/d-link/dir-860l

Hardware NAT would be great if I could get the source to compile on kernel 4.4 or 4.9!

Mushoz · May 17, 2017, 11:50am

I have looked at those tweaks, but I have not applied them yet. I'm first trying to figure out why a performance discrepancy exists between the two tests. After that has been sorted out, I can look at further optimizations. So I am really curious:

If my artificial test has been done correctly? If not, what is a good way to test NAT performance locally?
How big is the PPPoE CPU overhead? Can this explain the difference in performance between the two situations? Any tweaks I can apply to reduce the CPU overhead caused by PPPoE?
How big is the CPU overhead for IPv6 vs IPv4? Can this cause the difference between the observed performance? The artificial test was done with IPv4, while the real world test was using IPv6 routing on both the WAN and LAN side.

Mushoz · May 17, 2017, 8:30pm

I found out that PPPoE does have a decent CPU overhead, so maybe this explains the performance discrepancy. I would have to try a local NAT test with a PPPoE connection as well to confirm. I did find some documentation about pppoe that is used with the pppd daemon on Openwrt/LEDE over here: https://linux.die.net/man/8/pppoe

The interesting part is:

[QUOTE]-s

Causes pppoe to use synchronous PPP encapsulation. If you use this option, then you must use the sync option with pppd. You are encouraged to use this option if it works, because it greatly reduces the CPU overhead of pppoe. However, it MAY be unreliable on slow machines -- there is a race condition between pppd writing data and pppoe reading it. For this reason, the default setting is asynchronous. If you encounter bugs or crashes with Synchronous PPP, turn it off -- don't e-mail me for support![/QUOTE]

Where is the configuration for pppoe located on LEDE? I did find the config file for pppd at /etc/ppp/options, but I can't find the pppoe configuration file.Would love to try if this can improve performance over my PPPoE connection

drbrains · May 18, 2017, 8:31am

Lede is using a kernel side solution, not user space like the reference. CPU overhead should be minimal.

If your ISP allows, increase your Eth0 MTU to 1508 so your PPPOE MTU can be 1500. You might need the mini-jumbo frame enable patch to compile like that. This will keep all your packets at 1500, like on your local LAN test. Splitting packages will cause overhead. Not sure if it would cause that much overhead. I can't test myself at the moment.

moeller0 · May 18, 2017, 10:02am

This is most likely not going to make a big difference, as the router hopefully uses MSS clamping to make sure that all flows use 1492 as MTU instead of 1500. If drbrais is correct you would see lots of IPv4 fragments in your tests, but since you have problems with IPv6 flows (which do not support fragmentation) I am almost certain this is not your issue. I do agree that if the ISP allows baby-jumbo frames it makes sense to use them, as far as I can tell BT does allow them, everybody else does not...

Best Regards

Mushoz · May 18, 2017, 6:58pm

Yes, MSS clamping is enabled. The router doesn't currently support jumbo frames, but the master branch just received a commit 6 hours ago that implemented jumbo frames up to 2k for mt7621. It doesn't really help much with throughput though. I already applied the MTU test a few weeks ago myself, and I didn't see any improvements in throughput.

I was just thinking: NAT shouldn't really be needed when using IPv6. Is it possible to disable NAT for IPv6 routed traffic, while keeping it enabled for IPv4 routed traffic? I will have to look into this. The router not having to do NAT should help tremendously in getting the maximum performance possible out of the little box

If anybody has some tips how to do this, feel free to give me some pointers In the mean time, I will put my Google-foo to the test. Wish me luck!

moeller0 · May 18, 2017, 7:27pm

I venture the guess that your router is not even attempting to NAT the IPv6 traffic at all, since for the longest time linux did not allow NAT for IPv6 (the kernel since relented and allows some NAT for IPv6, but I believe that lede does not enable that out of the box, and rightfully so).

Best Regards

Mushoz · May 18, 2017, 7:32pm

That's good to hear But then it is even stranger that I am seeing lower speeds in the real world test, right? I thought that most of the CPU usage when routing traffic is NAT related, so if NAT is disabled the CPU usage should be quite low, but it isn't. What gives?

Mushoz · May 18, 2017, 9:03pm

Disregard my IPv6 comments. While the my PC is connecting through IPv6 with dslreports.com, the connections to the speedservers are actually IPv4 connections. I guess PPPoE is thus the only logical explanation for the performance discrepancy. I would have to retest locally and use a PPPoE connection as well to confirm.