Max active connections (nf_conntrack_max) artificially fixed to 16384?

vpelletier · February 15, 2022, 12:39am

By default, only /etc/sysctl.conf (and not /etc/sysctl.d/*) is copied over during system upgrades, so I would still go with /etc/sysctl.conf.

I dug a bit into nf_conntrack_max. The kernel formulas in net/netfilter/nf_conntrack_core.c are (pseudo-code):

if hashsize module parameter is not provided (and it is not in 21.02 as far as I can find)
- nf_conntrack_htable_size = min(256k, max(1k, ram_size_in_bytes / 16k / sizeof(struct *)))
- nf_conntrack_max = 1 * nf_conntrack_htable_size
if hashsize is provided
- nf_conntrack_htable_size = hashsize
- nf_conntrack_max = 8 * nf_conntrack_htable_size

So, on a 256MB, 32bits device, nf_conntrack_max = nf_conntrack_htable_size = 256M/16k/4 = 4096. 4 times less than with the openwrt-provided override. And indeed:

# sysctl net.netfilter.nf_conntrack_buckets`
net.netfilter.nf_conntrack_buckets = 4096

Some more notes: the kernel documentation and comments mention that with nf_conntrack_htable_size == nf_conntrack_max, the average number of entries per hash table slot will be 2, because each connection needs 2 entries. With openwrt's default, in the case of my device, nf_conntrack_max = nf_conntrack_htable_size * 4, so an average of 8 entries per slot. I have not benchmarked anything, but to me this means that openwrt should rather consider increasing net.netfilter.nf_conntrack_buckets to the same value as nf_conntrack_max. And maybe increase nf_conntrack_max in the process.

Going further, my feeling is that the kernel should maybe expose the 16k divisor applied to total ram size: this would allow using the in-kernel formula by just expressing that this distribution is geared towards handling a large number of connection rather than trying to leave a lot of free ram for userland programs (which is my interpretation of this default 16k).

lleachii · February 15, 2022, 3:26am

What do you mean by expose the 16k divisor...you mean code a formula recalculated based on how the developers arrived at 16k at 4 MB RAM circa 7 years ago?
To be clear, that what the 16k setting was (for a 4MB device - allotting about 25% to it)..so this would be a special patch to the Kernel???

BTW, that was all noted in "unset" - i.e. the original information we're discussing. It's good you verified the math though!

Also...e.g. on a 4 GB device...this would allow 16 million connections...why would a normal use case OpenWrt device be statefully tracking 16 million connections???

Recall: That is many ports than a single IP on WAN can provide on TCP and UDP.

Additionally, there's only about 65,000 ports per single IP anyways, and the router will reserve some for its use and the Kernel.

(I can only imagine to also record something like Netflow...which would be a program you mentioned.)

_bernd · February 15, 2022, 7:49am

A firewall or nat gateway has the ability for more then 64k connections. It's to early for me to do proper math but you have 32k src ports for 64k dst ports per ip (round about). And iirc it's even slightly more on Linux. But I had a gateway running with about 2 to 3 million connections tracked and synct by conntrackd by only add justing the bucket size and such. So this 64k connection limit is some how a myth. Iirc this was explained in the Linux network doc too.

Edit: if I'm still not half asleep it should be that for two hosts alone 2*10^9 connection can be tracked. (32k*64k), if we assume defaults for outgoing connections which should be >32k or something like this.
And that's only single protocol. So for TCP and UDP it's 4*10^9.... But you may got my point

Edit: and in case of a nat gateway you habe also the orig src IP and orig src port and orig dst IP and orig dst port. So what's called? Quadruple? I assume you will ran out of memory far more early then reaching any theoretical connection limits if your settings for bucket size and connection max is high enough.

vpelletier · February 15, 2022, 10:50am

I mean the 16k divisor in the formula the kernel uses to compute the default value of nf_conntrack_htable_size. In case I was being ambiguous 16k != 16kB. 16k as in 16 * 1024, without unit. I do not know how it should be exposed, though:

another sysctl would probably cause ambiguous situations, where userland tweaks this coefficient and also adjusts the values it produces.
a module parameter is ugly: I think it would require a restart to change, as it seems very unlikely for nf_conntrack to be possible to unload on a router. And if it is modifiable without unload/reload, I think it has the same issues as a sysctl knob.
...maybe a .config entry then ? Just as, in another domain, the tick frequency is adjustable depending on the type of use (responsiveness for desktops, throughput for servers), this would allow trading memory for conntrack entries.

A patch yes, but hopefully an upstream one .

Oops, did not read the whole backlog, sorry.

As an anecdotal data point, I had openwrt run out of conntrack entries in a sustained manner, with a handful of machines in the LAN: a single web page in a single tab in a single browser got an ad which was trying to phone home. Doing so it tried resolving some domain name, which failed for whatever reason. It did that in a busy loop, issuing DNS queries as fast as the browser would let it, which turned out to be faster than conntrack entries would expire.

The easiest solution in this case was to install an ad blocker, of course, but I am annoyed that my router would let itself be essentially DoS'ed by a single ad.

_bernd · February 15, 2022, 11:23am

It's been a while but to change connection max and bucket size you need not to unload to kernel just set it during run time.

vpelletier · February 15, 2022, 11:47am

Yes, with sysctl, but this is not what I am suggesting: I am suggesting changing a hard-coded code constant used to derive the values from ram size into something which would at least not be hard-coded (configurable at build-time), if not a variable (configurable at run-time).

EDIT: IOW, I think it would be nice if openwrt did not have to hardcode a total conntrack size, which is bound to cause limits both ways, but instead could tweak the coefficient defining the conntrack size based on device ram size. I guess 1 entry per 64kB (128kB on 64bits archs) of ram size (which is what the current kernel does) was picked to keep enough free ram for userland on desktops and servers, but is not enough for openwrt-style uses (low ram devices which can spend a larger proportion of it on managing connections). So rather than having to run behind the kernel and fix the consequences of its estimations, why not teach it to produce better estimates to begin with ?

lleachii · February 15, 2022, 1:46pm

Why are you mad at your router - that you have a single IP from your ISP (with only ~64,000 ports)?

Any router woulda been DDoSed...perhaps you're missing my point.

I realize that it sounds good to increase that number, I was noting that there was another logical "limitation" to whatever it is someone's trying to accomplish by raising this value.

Since a single IP only has ~65,000 ports (and the router reserves use of some), I'm not sure what anything greater than about ~40,000 would truly accomplish???

Except if one was being DDoSed...then I'm not sure why someone would want to track those connections (hence wasting router resources).

_bernd · February 15, 2022, 1:52pm

Let. me. repeat. You can have far more then 64k connections even your nat gateway has only 64k ports.

Each host on your lan comes with a src IP and src port and wants to connect to a dst IP and dst port.
Even we assume all your hosts want to connect to the same IPs http Server, conntrack is capable of establishing and keep track of more then 64k connections. And even if the nat gateway hits these limits it will just start the delete connects starting with the oldest one.

lleachii · February 15, 2022, 1:54pm

I understand that, it's a 4-way pair...it seems everyone is missing my point...by simply overlooking the question about tracking (and the resources of why you're tracking).

That tends to happen when someone just wants a thing, but didn't think about the consequences.

Maybe I should also note, you only need to connection track in a router running NAT...or firewalling
You can also tell it not to track by a firewall rule

(Most routers that can handle that don't track...or they're a specially-built firewall device.)

And correction, you can only have 64k tracked connections when set to 64k. It will drop previous ones for new ones.

_bernd · February 15, 2022, 1:59pm

I see. Cool we agree. But OpenWrt is not just used my small family networks. I would prefer the Linux defaults too that it's based on available memory.
Just a feeling but 16k could be hit fast on a mid sized network, like a co-working space or something similar.

lleachii · February 15, 2022, 2:11pm

In the "olden days" of P2P...I tested a commercial router moboard....my friend would come over and fire up her Limewire...Bearshare, Morpheus, etc...The router would keep crashing into a halt on old kernels...guess what?

The board itself couldn't handle 16k. I was gonna place this on a radio tower...she saved me!!!

That device has 64 MB RAM: https://openwrt.org/toh/zinwell/zw4400

So all devices are not build the same.

So can see how a medium-sized family can easilly use 16k connections in Prime Time - I agree.
I understand how especially 1 and 10 Gbps users may be feeding other downstream devices, etc.

That's the original issue with unset, default is too low. @vpelletier suggested we patch the kernel to use a formula that more closely calculates to 16k per ~64 MB or RAM.

vpelletier · February 15, 2022, 2:27pm

Are you sure about this ? AFAIK, conntrack is for all connections, even non-NATed ones.
EDIT: I just remembered about the UNTRACKED connection state (The packet is not tracked at all, which happens if you explicitly untrack it by using -j CT --notrack in the raw table. from iptables-extensions(8)), which may be what you are referring to.
My understanding is that once conntrack is enabled in any rules (or maybe even just loaded at all ? I did not check), all connections are tracked. In any case, my firewall settings are stock openwrt with the corresponding packages installed (luci-firewall and some more, I can check which if it is needed, but I think it is beyond the point), I did not add any conntrack rule (or any rule, actually) beyond the ones which are present by default (not sure what gets installed by packages, and what gets applied when creating the interfaces themselves).

Anyway, I came back to try to give a very crude estimation of how much memory a conntrack entry takes. Again, this is very crude, especially there are certainly more data which can be put in every conntrack entry, but I wanted a general idea. So I ran this on my router to get CSV-ish output:

 while :; do free_ram="$(free | grep '^Mem' | sed 's/.* \([0-9]*$\)/\1/')"; conntrack_count="$(sysctl -n net.netfilter.nf_conntrack_count)"; printf '%i,%i\n' "$free_r
am" "$conntrack_count"; sleep 1 || break; done

And this a few times in parallel on a LAN computer, 192.168.1.1 being the openwrt LAN IP:

#!/usr/bin/env python3
import socket
for index in range(500):
  with socket.create_connection(('192.168.1.1', 80)):
    pass
  print(index, end='\r')

The result is the following plot:

Which gives a rough value of 1.5MB for 3.5k~4k conntrack entries, or around 375~450 bytes per entry.

If it matters, I measured with net.netfilter.nf_conntrack_buckets = 65536 and net.netfilter.nf_conntrack_max = 65536 (but I am nowhere near these values, and at worst it would eat a larger fixed base amount for the hashtable structure). If the estimation above holds, I should be able to use this many conntrack entries and use around 30MB of RAM, which seem reasonable to me. To get the kernel to produce these values the 16k divisor would be replaced with 1k, which would give 8k conntrack entries on hardware with 32MB of RAM (would this be considered a regression ?), 16k conntrack entries with 64MB of RAM, 128k with 512M of RAM, and reach the kernel's 256k entries maximum with 1GB of RAM (everything assuming 32bits pointers). I'm not saying anyone need this many entries, just trying to see what a 1k divisor would look like.

Again: this is very crude, I am opening connections to the built-in http server which may decide to allocate stuff on top of what the kernel is doing for conntrack. But I think it hence gives a reasonable upper bound of what memory the conntrack may be using.

EDIT: FWIW, the rough chronological order follows the plot in an anti-clockwise direction: going up on the rightmost branch during the test, and then down in the leftmost branch after all python scripts exited, and conntrack entries expired.

vpelletier · February 15, 2022, 2:36pm

Wow, yeah this is another aspect of this problem. Any idea where this came from ? This being hardware-dependent would point at offloading logic... maybe ?

lleachii · February 15, 2022, 3:41pm

Yes, very sure. Example: I have a tunnel with a /24. If I disable, I cannot firewall. This is good because I can forward the traffic to another firewall.

On versions before 17, I had to explicitly enable connection tracking on the zone to firewall the traffic on the OpenWrt.

I know...unless disabled. (but my use case enters multiple IPs, needing more than 16k connections, etc.)

Yep, that's been calculated.

But accurate!

NAPI added to the kernel helped slightly...but it's odd that a Intel IXP chip couldn't handle it.

EDIT: I think the RAW firewall table was mentioned to accept without tracking - as I recall, that's correct (as one method). I use this table to RAW drop from internal (i.e. invalid source IP forwarded outbound to my /24 example tunnel), so it it not tracked via Netflow.

I also use it to drop any routed packet that the Kernel would respond to with an ICMP packet not already blocked, hence using CPUs (e.g. TTL Exceeded In Transit which traceroute uses, wonder why hops on some commercial routers are missing when tracerouting) . See: http://www.cisco.com/web/about/security/intelligence/ttl-expiry.html

vpelletier · February 19, 2022, 6:11am

...except this is not true. I just built a qemu image with a tentative kernel patch and removed the OpenWRT sysctl override for nf_conntrack_max. With $ qemu-system-arm -nographic -M virt -m 256 -kernel bin/targets/armvirt/32/openwrt-armvirt-32-zImage-initramfs I am getting:

# sysctl net.netfilter.nf_conntrack_buckets
net.netfilter.nf_conntrack_buckets = 62464
# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 249856

So neither of these have the expected values...

First the uninteresting one: 62464 != 65536 buckets. Part of this is because of the total amount of available ram which is not quite 256MB. I'm not versed well enough into the details to tell what uses it exactly and why it would be excluded from totalram_pages()'s return value. free tells me I have 249528kB total memory in the end.
249528k / 1024 / 4 = 62382, which is 82 entries short of the actual value: the nf_ct_alloc_hashtable function updates nf_conntrack_htable_size to fill up entire pages. At 4 bytes per bucket and 4kB pages, this means 1024 buckets per pages, and 62382 is 82 buckets short from a full page boundary, so these get added, and we get to 62464.

Second, the important one: 249856 is 4*62464 and not the unity I expected. Turns out, this is because max_factor used to be 4 until vanilla commit d532bcd0b2699d84d71a0c71d37157ac6eb3be25.

Anyway, I have prepared a first version of the changes, in need of a review and some hand-holding.