Thanks @plater. I'm not on @dave14305's level - hardly anyone is. @dave14305 any chance you could elaborate a little on why the new version doesn't work and why the old version works for the benefit of @plater and I and other interested readers? I'd also be interested if you see any other way to implement the ct approach whilst still covering this edge case, since it seems that's still 5-10% faster according to @plater's testing (at least the version that doesn't work!).
The new version relies on ct state
, which changes during the course of the connectionâs lifetime. I read this article so that I appear smarter (I think @ldir posted this somewhere a long time back):
https://thermalcircle.de/doku.php?id=blog:linux:connection_tracking_3_state_and_examples
For the current implementation to work, the connection state has to be ânewâ or âuntrackedâ when itâs being sent out the wan interface. Itâs designed around marking traffic on egress to wan and restoring the DSCP on ingress from wan. This covers most scenarios just fine.
But a port forward from the Internet to our local LAN is âbackwardsâ in regards to our design. The initial packet comes in on the wan interface, with a ct state of ânewâ, and gets forwarded to an internal IP. Since it is coming IN on the wan interface, itâs not captured by our rule watching for the outbound interface. (oifname $ul_if ct state newâŚ
).
When the internal LAN client replies to the new connection, it will get routed out the wan (upload) interface, but as soon as the conntrack system sees a reply to the original ânewâ packet, it will change the ct state
to âestablishedâ. So by the time the packet reaches the oifname $ul_if ct state newâŚ
rule, the outbound interface matches, but the ct state does not.
Donât believe everything you read on the InternetâŚ
I donât share the desire to hang on to the ct approach since it already demonstrates fragility. The conditional bit in act_ctinfo lets you keep track of whether the connection has already been dealt with.
I canât really see how such a subtle change in a rule would contribute to a noticeable performance difference. He may see a difference, but probably not because of the rule itself.
Oh boy. I must apologize. I retested with the idea that the performance difference I was seeing could in part be due to the fact that the DSCPâs are now being altered when they previously werenât and not so much because the lack of ct hurt performance. I hadnât considered that, sorry.
I re-ran the test with the port closed for a more equal comparison. I donât exactly know how to isolate this for a scientific like result but the difference is negligible. I honestly canât see much if any difference in performance now. I guess that makes the decision easier.
Splendid explanation. Thank you.
So if thereâs no performance difference isnât a good way forward just to switch to:
oifname wan ct state new,untracked,established goto classify-and-store-dscp
What would be the effect of adding the ct direction reply
here?
Wait no. I'm sorry, I tested
oifname wan ct state new,untracked goto classify-and-store-dscp
vs
oifname wan goto classify-and-store-dscp
There's little to no difference between the two.
Edit, I restested again with
oifname wan ct state new,untracked,established goto classify-and-store-dscp
Unless something went wrong with my restore or I had two rules enabled at once last night (possible) I'm not seeing too much difference between the three now. At this point maybe someone else can confirm?
By adding established
, you now process (nearly) every outbound packet, even if itâs already been marked and saved. Plus you have the tc filter on egress already restoring the DSCP from conntrack. So itâs extra, duplicative work, mangling packets many times.
OK so it seems like the special bit approach is still the best.
Absent any objections, Iâll revert to:
And @plater thanks for raising all of this. Itâs been super interesting. And Iâve definitely learnt a couple of things.
Remember you can now combine the old way into:
ct mark set ip dscp or 128
ct mark set ip6 dscp or 128
Edit: but there is a bug in nftables 1.0.8 that makes it display wrongly. So keep it as separate lines until 1.0.9.
Display wrongly but still work fine? If so, Iâm inclined to go with it!
Just display.
Does this look OK?
@plater does it work for you too?
Yes. Iâve been running with these exact changes for a couple of days. Everything works great.
Thank you and @dave14305 for putting up with me!
On the contrary, this has been a very enjoyable and helpful exchange. Thank you for helping improve cake-qos-simple.
@moeller0 you might be interested in reading from here if you haven't already. Turned out that using 'ct state new, untracked' on packets directed to wan to work out whether to classify tracked connections (classification upon connection creation rather than classification of every packet) breaks in the case of connections opened in the other direction (from outside to inside rather than from inside to outside). The solution is to go back to using a conditional bit on the conntrack.
@dave14305 in writing the above summary for @moeller0 it just dawned on me that another solution might have been instead to additionally include iifname wan ct state new, untracked
- would that have worked too? That is, classify not only egress packets, but also ingress packets associated with new connections. Though perhaps that's less efficient because it means evaluating not only all egress packets but also all ingress packets.
Sure, but you would still be left wondering if any other ct states might cause issues.
hello in this script when
are we supposed to see the tos or is it only for the download part? thanks
# First check correct flows and DSCPs correctly set by your LAN client on upload
tcpdump -i wan -vv always 0x00 ???
# Second check correct flows and corresponding DSCPs are getting set by router on download
tcpdump -i ifb-wan -vv example cs4 0x80 ?
With tcpdump you can use "(ip[1]!=0)" to filter on non-zero TOS values.
Here is an example:
root@OpenWrt-1:~# tcpdump -i ifb-wan -v "(ip[1]!=0)"
tcpdump: listening on ifb-wan, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:57:03.365488 IP (tos 0x80, ttl 54, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.3.853 > 192.168.0.2.39254: Flags [S.], cksum 0x9c63 (correct), seq 1450840744, ack 1678205163, win 65160, options [mss 1320,sackOK,TS val 3645466303 ecr 1224120091,nop,wscale 13], length 0
This means a packet from Cloudflare's family DNS service was sent to my LAN with tos 0x80, i.e. cs4 (voice).
Here is the correspondence between TOS and DSCP:
ok thanks but if you do tcpdump -i wan -vv do you also have the correspondence or is it normal to have 0x00?
Normal with the default settings in cake-qos-simple owing to the use of the 'wash' setting in cake in respect of the upload options - see here:
cake_ul_options="diffserv4 triple-isolate nat wash no-ack-filter noatm overhead 0"
https://man7.org/linux/man-pages/man8/tc-cake.8.html
With 'wash':
root@OpenWrt-1:~# tcpdump -i wan -v port 853
10:02:01.469332 IP (tos 0x0, ttl 65, id 5791, offset 0, flags [DF], proto TCP (6), length 52)
192.168.0.2.59426 > 1.0.0.3.853: Flags [.], cksum 0xc1d3 (incorrect -> 0xfe3a), ack 3837, win 1002, options [nop,nop,TS val 2053242796 ecr 555599014], length 0
With 'nowash':
root@OpenWrt-1:~# tcpdump -i wan -v port 853
10:03:32.399027 IP (tos 0x80, ttl 65, id 32670, offset 0, flags [DF], proto TCP (6), length 52)
192.168.0.2.54100 > 1.1.1.3.853: Flags [.], cksum 0xc2d4 (incorrect -> 0x834f), ack 493, win 1002, options [nop,nop,TS val 1224509170 ecr 3023558283], length 0
Here is my tcpdump invocation that will show any packet not CS0/best effort
tcpdump -i pppoe-wan -v -n '(ip and (ip[1] & 0xfc) >> 2 != 0)' or '(ip6 and (ip6[0:2] & 0xfc0) >> 6 != 0)'
replace pppoe-wan
with the desired interface, and != 0
with the test you are after, note the main difference to @Lynx's version is in masking out the ECN bits, and in adding IPv6. Also note that for IPv6 tcpdump reports class
instead of tos
, but both contain the DSCP bitfield.
For completeness the following operates on the 2 bit ECN bitfield:
tcpdump -i pppoe-wan -v -n '(ip6 and (ip6[0:2] & 0x30) >> 4 != 0)' or '(ip and (ip[1] & 0x3) != 0)' # NOT Not-ECT
it will report all ECT(0), ECT(1) and CE marked packets, replace != 0
for specific values, e.g for IPv6:
tcpdump -i pppoe-wan -v -n 'ip6 and (ip6[0:2] & 0x30) >> 4 == 0' # Not-ECT
tcpdump -i pppoe-wan -v -n 'ip6 and (ip6[0:2] & 0x30) >> 4 == 1' # ECT(1)
tcpdump -i pppoe-wan -v -n 'ip6 and (ip6[0:2] & 0x30) >> 4 == 2' # ECT(0)
tcpdump -i pppoe-wan -v -n 'ip6 and (ip6[0:2] & 0x30) >> 4 == 3' # CE