Issue with Ipv6 (6in4) and SQM

I recently updated to 17.01.0-rc1 on a TP-Link C2600. It's working fine so far.

However, there is a old issue (present in previous snapshots builds) that is still bugging me. Whenever I enable SQM (I tried every qdisc and strategy), my Ipv4 works fine, and bufferbloat is properly eliminated. However, my Ipv6 traffic slows down to sub 50/s (from 4000k/s). There is no error in logs and there is no excessive CPU usage.

Configuration:

WAN on eth0 (cable)
Ipv6 using 6in4 and He.Net

Note: I tried 6to4 and the result is the same.

Note2: If I manually setup QoS and I only enable outgoing shaping it doesn't cause issues, it only cause issues when I enable incoming shaping.

When you are using 6in4, you are not actually using IPv6 at all from your router outwards. The router encapsulates IPv6 packets into IPv4 packets and those are sent to the HE.NET endpoint router.

@moeller0 likely knows better, but I guess that for smooth traffic you need some encapsulation overhead (like for ATM, PPPoE etc.)

But most users don't usually need much incoming shaping, so it is great that your outgoing shpaing works.
.

I believe we currently have no method to peek into encapsulated IP packets, we only ever see the outer TCP/IP headers and those are invariant in your case and hence all flows inside the tunnel will be treated as any other flow is and with per-flow fairness that really sucks. I believe that using cake with "nat dual-srchost" options might work a bit better, at least than the HE source IP will be not treated worse than any other external IP (but this is also a far cry from ideal). Another option might be to instantiate SQM on an internal interface before the tunnel, then sqm-scripts will at least see the IPv6 addresses and that might actually be enough to solve your issue...

Hi,

Tried nat dual-srchost, no difference.

After, at LOT of fiddling around. What works:

  • SQM on eth0
  • SQM on wan6
  • Extra filter =>
    tc filter add dev eth0 parent ffff: protocol ip prio 5 u32 match ip protocol 41 0xff flowid 1:1

What this does is in prevent redirect for the encapsulated IPV6 packets (protocol 41) in ifb4eth0. Thus actually making inbound IPV6 not managed. It is managed via it's on SQM on wan6. Since prio 5 is lower than the default 10, it is matched before the default redirect rule.


Results:
DslReport speedtest (Legend: Overall, Bufferbloat, Quality):

IPV4 Speedtest => A, A, A
IPV6 Speedtest => A little slower but => A, A, A


The only issue remaining, is that IF I have both IPV4 and IPV6 downloads saturating the link, the SQM can't reduce bufferbloat to the maximum. However it's still better than without: 80ms vs 200ms pings. Reference (one stream) = ~20ms pings. It's not a common use case however, and it's understandable since, the traffic is managed by 2 independent shapers instead of one.

Also, if encapsulated packets are a problem, should we add this "hack" to the UI and SQM scripts?

For example:


Advanced options:

Do not manage encapsulated IPV6 packets: Yes/No


We should probably do the same for other problematic packets. Better to have unmanaged packets that uber slow packets. We can always add another layer for this traffic.

+1 for that.

But here also should be short explanation in which case this option should be used. People should understand which problem this solves in practise. If you just put "Do not manage encapsulated IPV6 packets: Yes/No" the most people will just skip this over and open an issue thread.

Honestly, I believe this to be out of scope for sqm scripts. I think I will look into the flow dissector to figure out whether we could not reach cake to use the innermost IP header to pick the flow identifying information from. That would a) be generic for all/most tunnel types, b) not run into the problem with either sacrificing too much bandwidth or too little buffer control with two independent shapers for IPv4 and v6, and c) not confuse normal users with highly specialized options.
You can of course simply add your filter to one of the .Qos scripts in your system, but for general use I am not yet convinced that would be a good solution. But please note, @tohojo is the person to convince to get new features into sqm-scripts all I can say is that I rather spend my time on the flow dissector without any ETA.

Two points: The kernel flow dissector will automatically decapsulate 6in4 packets already (unless something is broken, in which case we should fix it). So no need to add a switch for it.

Second, the issue with IPv6 getting almost no throughput even though IPv4 works is most likely not related to SQM. I have experienced similar issues, and simply turning off GRO on the underlying interface (ethtool -K ethX gro off) fixes it if it's the same issue you're seeing.

@tohojo

I tried the ethtool tuning, and removing my other "hacks" and it seems to work.

Can someone explain why this is an issue?

Not really, no. But we should probably find out :wink:

I did not make the connection with sqm before. Does it only occur when sqm is turned on? And if so, what if you turn off downstream shaping? (Without the gro fix)

I can confirm that IpV6 without SQM was working well (not just cake any shaping that relied on a ifb interface caused an issue). Turning off GRO, fix everything.

Perfect, this fixes my problems with SQM as well! Youtube now behaves normally :smiley:

2 Likes

Can you confirm that this happens with fq_codel as well as cake? What previous kernels did you use?

We really should be inspecting all the way into the headers here, even on GRO.

That said, for a gateway not running very fast on that interface, disabling GRO is of no harm. Perhaps sqm-scripts should
default to turning it off on the interface it is running on. GRO entering the system is ok? (this is really hard to test)

I thought that any giant packet entering the system will be kept intakt, so unless using cake GRO should be disabled on all interfaces. I have been pondering this, maybe we need another option/checkbox to disable offloads on the sqm interface and/or all interfaces?

Hi @lede-0x7f,

Not sure if my problem is related to what you experienced. I'm using a ZBT-WG3526 with 17.01.0 r3205-59508e3 stable.

I also have a HE 6in4 tunnel and the router was doing dumps and reboots sometimes even several times per day.

I decided to start disabling all kind of things, initially thought it was something from collectd, but actually I think may be a combination of collectd and SQM. Right now after 4 days without SQM (and collectd enabled again), got no reboots or dumps ...

I plan to enable SQM again to try if something changes with your suggestion. By the way my link is via a GPON ONT with Gigabit interface to the LEDE router. WAN is 50 Mbps up/down.

In my case, the WAN port is eth0.6 (eth0.1 LAN and eth0.3 VoIP) via PPPoE, so I should use eth0.6 instead of eth0 ?

The tc filter, should go into rc.local or somewhere else so it comes up in case of a reboot, etc.?

Have you finally disabled GRO ? (I tried that in case it was the cause of the reboots, no success) Any other offload settings ?

What queue discipline and queue script do you recommend?

Thanks !

If eth0.1 and eth0.1 are truly just VLAN interfaces on the same physical port from SoC to switch then you should not instantiate SQM on eth0 (that will shape both directions to the minimum of the configured bandwidths). If you use PPPoE the best interface probably would be pppoe-wan. Sorry for the delay in noticing this....