So I'm in the process of converting my project from iptables to nftables, and at this point I'm testing performance. The project basically imports geoip ip lists into the firewall to create either a blacklist or a whitelist.
So far it looks like nftables is significantly worse for this purpose, at least with regards to memory. I can't make a 1:1 comparison because my router only runs iptables while my VM openwrt installation only runs nftables, but FWIW on my router with 128MB of memory, importing the US ip list (which has around 65k addresses) into an ipset is no issue at all and I can barely notice the difference in memory consumption. While on my VM with nftables, having 64MB free memory is not enough to make this import - it runs out of memory every single time. Looks like a big step backwards. That is when using "policy memory" when creating the nft set.
I also tried to split the ip list into chunks of 10k elements each and feed those one by one into the set. It chokes after 50k elements regardless.
P.s. enabling auto-merge works around this because that basically shrinks the set form 65k elements to just 5.8k (!). I don't know how it does that and honestly an order of magnitude makes me worry that it may be removing something which should not be removed. Regardless, while the good old 'ipset' utility doesn't do such magic, it could easily handle huge sets with barely any impact on memory.
I use adblock on a router with 64 MB of RAM and about 80 thousand domains are loaded. There is very little memory, so I install additional packages zram-swap and kmod-lib-lzo and additional virtual memory appears, which is more than enough.
This is the solution I found for myself, you can try, I generally advise installing these packages on any router, since I understand that this is something like a swap file like on Windows, I could be wrong, but in any case, extra memory will not hurt.
The maximum I was able to load was 250 thousand domains in this situation and the router continued to work
And I caught your idea, yes, I agree, the memory consumption has become a little more, but the performance seemed to me to have improved, so the zram solution is not bad in any case for routers with a small amount of memory
But adblock uses dnsmasq host names and @antonk is using nftsets of IP addresses, completely different mechanisms involved here.
@antonk What does your set definition look like? Are you enabling the interval flag? (That should consolidate any overlapping IP ranges, which might reduce memory foot print.) Look for 'interval' here https://wiki.nftables.org/wiki-nftables/index.php/Sets
I'm thinking something like this:
set geoip_ipv4 {
type ipv4_addr
# counter
flags timeout, interval
timeout 7d
gc-interval 6h
comment "GEOIP: Block list for IPv4 geoip ranges."
}
# import $iplist_file
{
printf '%s\n' "add set inet $geotable $new_ipset \
{ type ${family}_addr; flags interval; policy memory; }"
printf '%s\n' "add element inet $geotable $new_ipset { "
tr '\n' ',' < "$iplist_file"
printf '%s\n' " }"
} | nft -f - || die_a "Failed to import the iplist from '$iplist_file' into ipset '$new_ipset'."
Resulting set looks like this:
set US_ipv4_2024-01-28_geoip-shell {
type ipv4_addr
flags interval
}
Re. memory consumption, to me it looks like the kernel optimizes the sets during import. That optimization process, while transient, consumes memory and probably that's what's causing the issue. Once the import is complete, I'm observing a modest reduction of 3.5MB in free memory for the largest ip list which is US ipv4.
It's probably a good thing to have the sets optimized, but the fact that this imposes an additional memory requirement as compared to iptables+ipset is unpleasant.
I'm thinking that a workaround for memory-constrained devices would be to split large ip lists into separate sets. Obviously this is less than ideal performance-wise, but then when considering the set optimization in nftables, maybe the end result will be not much worse than iptables performance-wise?
P.s. also tried writing to a file and importing from it, makes no difference.
Monitoring free memory with top while the import is in progress, I am indeed noticing a transient drop of about 50MB, that is with auto-merge enabled... Without auto-merge, I'm noticing a transient drop of about 180MB in free memory (had to add more memory to the VM to have this complete).
I think it would be good if someone from the OpenWRT dev team filed a feature request with Netfilter to implement some control over the import process in order to reduce these short-term memory spikes.
Ah, you beat me to it! I recalled having read about this same issue somewhere, and just dug that bug out.
I suspect those patches will make it into nftables v1.0.10 (or whatever it will be called), but for now we're stuck with v1.0.9. Maybe if you pile on, add a comment on that bug, it might spur a sooner release...
I doubt that a first comment from a new account opened by a random dude will affect them much
A request from the OpenWRT dev team would probably have a bit more weight to it.
Have you tried importing in non-atomic chunks? Say do nft add ... { 100-odd elements } as many times as needed? There's another bug I recall where someone was reporting this sort of thing was very slow, but maybe you can tolerate that. Ok, found it, maybe only tangentially related https://bugzilla.netfilter.org/show_bug.cgi?id=1706
So I tried splitting into chunks of different sizes, and it looks like it does help somewhat, up to a certain point after which it actually becomes slightly worse. At least going from 500 elements to 200 elements makes it fail slightly earlier (in terms of applied elements. time-wise it's much slower). With 256MB memory assigned to the VM, I couldn't get it to complete the import without auto-merge no matter what. Going up to 292MB, I can now import the whole US ip list without auto-merge if split into 30k up to 40k chunks (the exact number varies slightly across reboots etc). I'd approximate that with 40k chunks it utilizes about 120MB for the import.
To sum it up, splitting into 2-3 large chunks helps somewhat, splitting into smaller chunks doesn't make sense in my experience.
As far as atomic/non-atomic goes, to my understanding, all nftables operations are atomic, regardless of how you spell them. I'm using the command | nft -f - interface because it's more convenient in a shell script. Based on some other bugs I've read, I'm pretty sure that issuing nft add commands won't make any difference in terms of performance.
Well, that's what I'm doing already, just not passing hundreds/thousands of ip's via a command line as your suggestion implies but rather piping that into nft -f -.
# read $iplist_file, split into chunks, feed to nftables
i=1; max=$ip_cnt; incr=4000; while [ $i -le $max ]; do
last_el=$((i+incr-1))
[ $last_el -gt $max ] && last_el=$max
echo "importing elements from $i to $last_el"
{
printf '%s\n' "add set inet $geotable $new_ipset \
{ type ${family}_addr; auto-merge; flags interval; policy memory; }"
printf '%s\n' "add element inet $geotable $new_ipset { "
tail -n "+$i" "$iplist_file" | head -n "$incr" | tr '\n' ','
printf '%s\n' " }"
} | nft -f - || die_a "Failed to import the iplist from '$iplist_file' into ipset '$new_ipset'."
i=$((i+incr))
done
Looks like there is an additional/related bug with nftables. Adding more ipsets gets progressively slower, and at some point it fails with OOM, even with auto-merge enabled and using the compacted ip lists from ipdeny. So for instance, I added individual sets US_ipv4 US_ipv6 CA_ipv4 CA_ipv6 GB_ipv4 GB_ipv6 DE_ipv4 DE_ipv6 NL_ipv4 NL_ipv6 JP_ipv4 JP_ipv6 KR_ipv4 KR_ipv6 AR_ipv4 AR_ipv6 AU_ipv4 AU_ipv6.
That goes slower with each set but eventually succeeds (only if I split the sets into chunks). Now trying to add and populate new sets for BR_ipv4 and BR_ipv6, it manages to add the ipv4 list but chokes with OOM on the ipv6 list (which has only 8736 ip's). That despite having 50MB of free memory before the import.
Here's a thought, have you tried setting the size on the set at creation? nftables may be doing "keep the table small" and reallocating it as it grows, whereas if you specify size up front, then it should malloc just once.
Since you've already calculated ip_cnt, use it directly (or add 10 or something for a bit of headroom):
nft add set ... "{
type ${family}_addr;
size $((ip_cnt));
auto-merge;
flags interval;
policy memory;
}"
for chunking algorithm; do
nft add element <chunk>
done
Good thought but unfortunately that doesn't work with intervals. Tried that and also tried flags constant - both don't work with intervals
{ type ipv4_addr; auto-merge; flags interval; size 890; policy memory; }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/stdin:2:1-14097: Error: Could not process rule: Too many open files in system
I'll definitely be dreaming about going back to iptables+ipset... I know this won't happen but man, I miss the rock-solid implementation that "just works". I'm sure that eventually nftables will get there, but when? And how are we supposed to implement geoip in the meanwhile?
Update: turns out that size does work with flags interval, provided that the size parameter is large enough. Experimentally, having it 4*ip_cnt is large enough for all ip lists I tried. Not sure why it needs to be that large. I suspect that the kernel expands some of the ranges while optimizing them. Anyway, alas this doesn't actually make any difference to performance and memory consumption.
Huh, I'm sure I read a bug at one point about size and ranges being incompatible. Can't blame it on lack of coffee this late in the day.
Anyhow, maybe post on the user list netfilter@vger.kernel.org ? I subbed a few years ago, and that's where the nft devs and some very expert users answer hard questions.