Nftables chokes on very large sets

tievolu · September 25, 2023, 10:14pm

I have a firewall script that runs on firewall reload/restart to setup stuff for DSCP tagging - about 20 sets of various types (MACs, ports, IPs) with fewer than 30 elements each, about 20 chains, and about 100 rules. It's not the most efficient thing in the world because it doesn't add the sets and rules atomically (not yet anyway, it's a work in progress) - it currently runs lots of individual nft commands.

That script usually completes in about three seconds, but it runs really slowly in the presence of the 80K set, and basically stalls forever in the presence of the 200K set.

I also have dnsmasq configured to add individual IP addresses to various sets based on hostname and I suspect the performance of those operations tanks as well, although I have no way of measuring that directly.

DBAA · September 25, 2023, 11:17pm

Look at or install banIP and see how it handles this.

Note it uses country blocks from: IPdeny IP country CIDR blocks which might differ from the ones you are using.

tievolu · September 26, 2023, 9:01am

Thanks for this. Somehow I've gone my whole OpenWrt career without hearing about this package! I'm not at home today but I'll definitely give it a go when I get a chance.

There are far fewer IP ranges in the IPdeny sets - e.g. only 65,855 ranges in the unaggregated US set vs 159,383 for the equivalent db-ip.com data. I have no idea which data is better in terms of accuracy but clearly the IPdeny sets will perform better, especially the aggregated sets.

In any case I still think there's an issue here worth investigating. Switching to a different approach might help my particular requirement right now, but it still seems that something starts to go very wrong inside nftables after populating sets with >100K elements. I would expect lots of memory usage, and maybe a performance dip while actually populating the table, but once that's done the performance of nftables shouldn't be affected, especially if the set isn't even being used. The presence of the set certainly shouldn't affect the performance of operations on unrelated sets.

tievolu · September 27, 2023, 10:00am

I've been doing some more testing with interesting results.

Firstly, setting the set policy to "memory" instead of "performance" seems to make the nftables "auto-merge" setting work properly, and the CIDRs are merged in a similar way (maybe exactly the same way?) as they when the set is defined using the UCI firewall configuration file, resulting in a set of 80K elements. Ironically this results in much better performance than using the default "performance" setting...

Secondly, if leave the set policy as "performance", but split the 200K set into 10 sets of ~20K each, the performance is about the same as having a single 20K set. So the problem does seem to be related to the size of the largest set, not the total number of elements in all sets.

I'm currently trying BanIP with the much smaller geoip dataset from IPdeny (about 30K elements for the same country list). I've already hit issues with accuracy (an IP I know is definitely in the UK is listed as located in the US), but hopefully I can work around this and I won't have to pursue my home-rolled solution any further. Maybe I'll need a combination of banIP / home-rolled by generating a custom block list for BanIP from a more accurate location database.

moeller0 · September 27, 2023, 10:13am

Welcome to geoIP This is really at best a heuristic... that does not really map well onto how the internet actually operates. Granted for many things is clearly "better than nothing" but precise, robust or reliable it is not. But knowing that does not really help if the situation is bad enough to demand some action...

tievolu · September 27, 2023, 10:19am

Yeah, I fully appreciate the limitations of geoip databases, especially the free versions

I can just drop the US ranges for now, but I've asked on the banIP thread about whether it's possible to create a custom local feed that I can keep up to date myself based on the "more accurate" db-ip database.

EDIT: It turns out the "more accurate" db-ip database classifies that UK IP as being located in the Netherlands, so it's wrong as well but it "works" in my scenario.

moeller0 · September 27, 2023, 10:30am

Welcome back to the EU then, we missed you!

lleachii · November 10, 2023, 6:55am

"No way to do the same" - what in OpenWrt 21.02?

Can you provide more information?

tievolu · November 23, 2023, 5:20pm

Just a quick update to say that I'm still having issues in this area.

As mentioned a few posts back, setting the set policy to "memory" instead of "performance" improves things a lot, but nft's performance does still start to degrade in the presence of very large sets.

I abandoned my attempts at a home-rolled solution and switched to using BanIP (which uses the "memory" set policy by default), using blocklists on WAN-input and WAN-forward. Generally speaking this has been an excellent solution.

Everything is fine if I use a selection of blocklists that have a total of around 70,000 elements, but when I add the very large firehol_level4 set as well, for a total of ~220,000 elements, any subsequent nft command involving any set runs very slowly and ping response from the WAN starts to suffer. These effects are similar to the issues I reported before when using large (but not quite this large!) sets in "performance" mode earlier in this thread.

The degraded ping response from the WAN can be seen in the thinkbroadband latency monitor graph below:

The large set was in place for most of the graph, until I removed it near the end and you can see that the ping times returned to normal.

Note that this is all running on a relatively powerful x86 box. I imagine the consequences would be much more severe on a standard router device, unless this issue is somehow specific to x86?

Anyway, I'm happy enough to proceed without the firehol_level4 set. I just thought I'd provide an update here.

tievolu · November 24, 2023, 2:23pm

As an experiment I used strace to try and get some more insight.

When the firehol_level4 set is present (total of 220K elements in various sets), a simple nft command to add five elements to an unrelated set (in a different table), running under strace, took 14 seconds to complete and called mmap 50,000 times, with a similar number of munmap calls.

Without the firehol_level4 set (but still with 77K elements in other sets), the same nft command took 4 seconds and mmap was called "only" 15,000 times. This seems to be the only significant difference between the traces.

I know nothing about the innards of nftables so I'm not really sure what this means, but clearly nft is doing a lot more of something when that large set exists.

EDIT: Obviously this doesn't provide any information about the elevated ping times from the WAN - I don't know how to get any insight into that aspect.

DBAA · November 24, 2023, 7:46pm

There have been changes to Netfilter nftables (nft) related to improving the performance of set loading and modification since the 5.15 kernel release that might not have been backported. Could be worthwhile to check on a Linux distribution that has both 5.15 and 6.10 (currently what OpenWRT will move to next) available and see if there is a difference.

tievolu · November 25, 2023, 11:09am

I don't think I'm going to be able to do that testing, but it's good to know that there might be some improvements in the pipeline.

I'll probably just revisit this after upgrading to the next version of OpenWrt.

ashleycawley · December 18, 2023, 2:17pm

Just a heads-up, on AlmaLinux 8 systems running firewalld and nftables, we recently saw a big performance decrease of ipset handling and firewall reloading when handling a large ipset (in our case 12k IP addresses).

With an old / prior version of nftables we were seeing firewalld reload times of just a few seconds, after an update to nftables and restart of the firewalld subsequent reloading took a matter of minutes on some lower spec machines. This affected hundreds of systems in our case. I know it isn't OpenWRT but I just thought I'd share our findings in case an update to nftables is also negatively affecting its performance on other platforms.

For reference on our OS the "bad" update went out around 22nd Sept 2023.

Issue Discussion: https://github.com/AlmaLinux/almalinux-deploy/issues/186#issuecomment-1857895214

equid0x · May 3, 2024, 7:44pm

Not trying to necro the thread here but I'm wondering if anyone ever found any resolution for this? I am running banIP on an ipq807x (Xiaomi AX3600) and see the exact same problem with IPSets and NFTables. Loading or reloading a large IPSet takes many minutes (like 20-30) while pegging the CPU the entire time. I had to adjust settings to break the IPSet into 1024 element chunks or the router will easily go OOM. I've been writing this post on an off and while I'm writing this it has been 21 minutes and the IPSets are still reloading as we speak. Surely this behavior can't be normal? Even if the aggregate set has 300K IPs in it (not actually sure how many just now) it shouldn't be consuming 100s of MB to do this, no? It seems like there has to be some kind of memory allocation bug here.

Also, after about 7 days uptime I have noticed that the router will simply stop passing traffic to certain internal IPs. No idea why. Seem to be IoT type devices for the most part that are always dropping and reconnecting. I have seen it also on a phone or a laptop if I'm screwing around with the WiFi and the devices keep reconnecting. Not sure if its related at all but a reboot seems to fix it.

Edit to add:

Active Feeds

allowlistv4MAC, allowlistv6MAC, allowlistv4, allowlistv6, bogonv4, cinsscorev4, countryv4, deblv4, feodov4, dshieldv4, etcompromisedv4, greensnowv4, proxyv4, iblockspyv4, ipthreatv4, sslblv4, threatv4, threatviewv4, talosv4, torv4, urlvirv4, urlhausv4, turrisv4, voipv4, yoyov4, webclientv4, blocklistv4MAC, blocklistv6MAC, blocklistv4, blocklistv6

Last Run

action: reload, log: logread, fetch: uclient-fetch, duration: 21m 6s, date: 2024-05-03 15:46:59

System Information

cores: 4, memory: 80, device: Xiaomi AX3600, OpenWrt 23.05.3 r23809-234f1a2efa

brada4 · May 3, 2024, 8:13pm

this was about nftsets, ipset is duplicated in memory to be used in nftables.

brada4 · May 3, 2024, 8:55pm

can you run
nft -c -d netlink -f xx.nft
nft -c -o -d netlink -f xxx.nft
and try to determine whats broken.

equid0x · May 4, 2024, 4:15am

Both commands return 0 byte files. We may be seeing weirdness because the device has gone on an oom rampage in my absence.

I may be confusing myself where it comes to IPSets. The banIP scripts are clearly calling the nft command to add an element. Each addition is taking a few seconds of processing time.

-- I looked into this further and I found that there is a bad interaction with zram and nearly constant swapping during process reload. After disabling zram, the situation is significantly improved. However, it still sometimes takes more than 1 second to process a single 512 entry nft table insertion, with occasional command completions up to 3.5 seconds. This still seems like it can't be right.

brada4 · May 4, 2024, 4:48am

Set ban_splitsize to something like 8192 (lines) to load ruleset in smaller chunks, as opposed to all at once. It is a lot of script, but i hope to get to it.

brada4 · May 4, 2024, 5:11am

you nned to set ban_split 1 too
the idea is that huge transaction is sent using api-sized chunks to kernel while the atomic transaction is grown to total list size in kernel, then copied to list-sized permanent list, thus eating 3x list size at its maximum, with the last 2 listsizes being non-swappable. The idea is to pass only api-sized atomic transactions to kernel to keep memory allocations to sustainable size. TBD how much decoration is added to IP subnet line in text list and to accurately estimate size of one command and make it default. Next idea is sort to aggregate which would involve shell bitops and probably other pains complicating script.

stangri · May 6, 2024, 12:15am

Maybe post in a dedicated banip support thread? Especially if your issue is not nft set-specific and also applies to ipsets which this thread is not about?