Nice to get this after investing 20 minutes of my life to help humanity
Also now I understand why there are no recent messages posted on that board...
This is the message that "looked spammy":
Hello, I'm noticing some odd things regarding performance and memory use with large sets (with kernel 5.15.137, nftables 1.0.8).
When importing an ip list into a set, there is a large transient spike in memory consumption. I observed it briefly use 180MB of memory when importing the US ipv4 ip list (which has around 56k ip subnets). This is most likely the same behavior described in Bug 1584 on the bugzilla. I just want to confirm that this is still a major problem for memory-constrained embedded devices such as routers.
I found that using auto-merge helps this to some degree but it still causes a transient spike of at least 50MB in memory consumption. In addition, it is possible to mitigate this somewhat by splitting the ip list into chunks and feeding those one by one to nftables. This does not solve the issue completely but helps somewhat to reduce the magnitude of the spike (I don't have exact numbers, but seeing that this way I can avoid hitting OOM condition in some cases).
Would be nice if some mechanism to control this behavior was implemented and exposed via a flag, even if using it has performance impact. Because this basically makes or breaks geoip and similar applications on a huge number of embedded devices on the market, many of them still ship with just 128MB or 256MB of memory.
Populating sets gets progressively slower with each added set. It is not obvious to me why populating a set is affected by having another unrelated set loaded but this is the apparent behavior. This also seems to affect the magnitude of the memory consumption spike. So for example, when having 16 sets loaded, 50MB free memory is no longer enough to load another set which has about 8k elements.
No such issues existed with iptables, however OpenWRT switched to nftables, and now I don't see how to implement geoip blocking without imposing unreasonable hardware requirements. I'm sure these issues will get solved with time, I just wanted to let the devs know that this is needed.
[ I wanted to add sample code I'm using to load the sets, but then the mail filter rejects it because it thinks that it's html ]
@antonk you may want to check out @dibdot's banip package. As far as I understand it uses large nft sets as well. Or maybe @dibdot can chime in on performance of banip.
Thanks. I think this is a kernel-level problem, so I would be surprised if @dibdot has come up with a userspace solution to it but if they did then I'll be interested to hear of course.
I'm thinking that an interim solution (until netfilter fixes memory consumption and the fix lands in OpenWRT) would be to add the iptables-nftables-compat package. As far as I can tell on my Linux Mint machine, that package allows the use of ipsets (rather than native nftables sets) and that doesn't suffer of any of the aforementioned issues. The way it links up with ipset is undocumented, or I couldn't find any documentation for it, but I see that (via iptables commands) I can attach the ipsets to rules (which transparently get translated into nftables rules) and they work fine, despite no nftables sets apparently exist when issuing the command "nft list sets".
Some time measurements of adding a set, in this case it's for GB but the specific country doesn't matter of course. The methodology is simply add the same set several times (changing the name of the set every iteration) and measure the time via /usr/bin/time.
Note that the effect is across families, i.e. it slows down when adding a number of ipv4 sets and then when adding ipv6 sets, it continues to get slower despite no prior ipv6 sets exist in the system. Also the nft CLI interface becomes visibly laggy as subsequent application of a rule involving one of these sets takes a couple of seconds on its own.
As to memory, I've observed it munch about 300MB when running this, can't tell if sampling by 'top' coincided with the peak in memory consumption.
40 iterations for each family (ipv4 and ipv6). The numbers are in seconds.
Also tested running 'nft list chain inet $geotable $geochain" in each iteration. This command on its own takes only marginally less time than adding a set. This chain doesn't even exist, so this is the time it takes for nft CLI to simply check the table and come up with an error.
Update: I wanted to report the issue of nftables CLI getting increasingly laggy with every added set on netfilter's bugzilla, but turns out you can't just open an account there, you need to send an email asking for an account. So I did that. So far no response. I also slightly re-phrased my email directed to their mailing list to try and bypass their spam filter. This time didn't get a rejection email reply, however so far the message isn't showing up on their board. [found the rejection response in my spam folder... slightly ironic]
Wait and see, I guess.
banIP uses a different syntax for filling in Sets and listing tables (listing the entire rule set and then parsing the table information) - this is much faster and has no performance penalty.
I've slightly modified your "bug" iteration script, e.g.:
#!/bin/sh
curl "https://www.ipdeny.com/ipblocks/data/countries/gb.zone" | tr '\n' ',' > "test.set"
nft delete table inet test 2>/dev/null
nft add table inet test
for i in $(seq 1 40); do
echo "Iteration #$i"
printf %s "Adding a set: "
printf '%s\n' "add set inet test testset${i} { type ipv4_addr; flags interval; auto-merge; policy memory; elements={ $(cat test.set) }; }" | /usr/bin/time -f %es nft -f -
printf %s "Running 'nft list ruleset': "
/usr/bin/time -f %es sh -c "nft -t list ruleset 1>/dev/null"
printf '\n'
done
Test output (tested on a BananaPI R3):
Iteration #1
Adding a set: 0.31s
Running 'nft list ruleset': 0.02s
Iteration #2
Adding a set: 0.31s
Running 'nft list ruleset': 0.02s
[...]
Iteration #39
Adding a set: 0.31s
Running 'nft list ruleset': 0.02s
Iteration #40
Adding a set: 0.31s
Running 'nft list ruleset': 0.02s
I can confirm that your script doesn't induce the bug in my setup as well. Not sure where is the fundamental difference though. Reading the file with cat directly as opposed to piping elements through 'tr' into nft is slightly faster, that's understandable. However the main issue is not the speed of populating elements but rather the fact that after populating elements in a set, nftables becomes less and less responsive with each iteration (as evidenced by the increasing time it takes to get a response from nft list tables command). So I'm starting to suspect that elements get populated differently with your method and this somehow avoids the bug. Do you have some insight into what's happening?
Edit: I just noticed that you replaced the command nft list tables with nft -t list ruleset and that command somehow doesn't induce a progressively larger lag, while nft list tables does (with your test script as well). So I suspect that some commands cause nftables to re-process all accumulated sets, and some don't. nft list tables does, while nft list ruleset doesn't. Your method of populating the sets doesn't, while my method does. I still really don't understand why and this looks like a bug to me, or perhaps a couple separate bugs in nftables.
Edit2: I attached your version of the test script to the bug on bugzilla, I hope you don't mind.
So eventually by trial and error, and following the tip from @dibdot, I came up with workarounds for the nftables bug which pretty much avoid all slowness issues with the current version of nftables available in OpenWRT (1.0.8). I believe that memory use is also reasonable this way. About 40MB of free memory seems enough for 22 large sets (including ipv4 and ipv6). About 20MB out of that is used temporarily when applying the sets.
A quick update. After having done some more experiments with nftables, I am noticing some performance issues with it still, although the workaround definitely helps.
There has also been some progress with the reported bug. A patch for at least one command (nft list tables) has been submitted. Apparently, the root cause is that indeed, some commands cause nftables to "fetch" elements from the kernel when they are not supposed to. That includes the (soon to be fixed) nft list tables command, the (acknowledged but not yet fixed) nft add element command, and the (not yet acknowledged nor fixed) nft -t list table and nft -t list set commands. Not sure that this list is exhaustive.
In addition, memory footprint of nftables sets is much more significant than iptables+ipset, but patches which fix this have AFAIK already been submitted and accepted, so hopefully the next nftables release will come with a big improvement in this area.
Update: bug with nft -t list table and nft -t list set commands has been acknowledged, patch submitted by the devs. Which only leaves the nft add element command without a fix, for now. The netfilter dev said that it's a more complicated fix than the others but they are leaving the issue open because they believe that it is possible to fix.
I'm not posting links to the specific PRs because, well, turns out that they leak my email address, which I'm not very happy about, and I'd rather not help this leaking.
It's not really leaking mail addresses, as you've mailed to a public mailing list - of course the mailing archive will display the mail address, as will the linux git repository once/ if it gets merged. It's public (including the mail address), but not leaking covertly.
This is not the case. I did email to another netfilter mailing list, and there my email address was properly hidden. Reporting the bug, however, was what caused my email address to be published, as it is inside patch messages issued by the devs. I didn't send those messages, and when reporting the bug, I couldn't have guessed that this is going to happen. On the Bugzilla, the email addresses of the posters are also properly hidden.
If you search for https://lkml.kernel.org/r/<message_id of your previous mails>, respectively the plain message ID in your favourite search engine, you're likely to get results.
Ah true. Didn't expect this either. Viewing the rendered message on the board makes it look like the email address is hidden, but it is included in the raw message. Personally, I think they should do something about this.