I wanted to share with you a shell script I wrote for ipv4 and ipv6 subnets aggregation. This is quite a specific thing which probably will be mainly useful to developers or sysadmins. I needed this functionality for my project but I think it can be useful as a stand-alone tool. I don't think this functionality has been available as a portable shell script before, so here it is, ready to use.
This basically builds on top of my other little side project which I also published here before - get-subnet (now called trim-subnet). The aggregation script utilizes trim-subnet.sh as a library to perform part of the calculation so both scripts are required for the functionality.
Basically the script attempts to calculate an efficient configuration for subnets given as an input by trimming down each input subnet to its mask bits and removing subnets that are encapsulated inside other subnets on the list. Designed for easier automated creation of firewall rules, but perhaps someone has a different application for this functionality.
The algorithm is honestly very primitive, it basically iterates over bytes of each subnet and compares them to corresponding bytes in other subnets on the list. So it is quite slow. Works great for a small number of subnets though, which is good enough for me.
If someone good with math wants to offer a faster solution which doesn't impose additional dependencies, I will welcome your contribution
If the purpose is to output the subnet that will encompass all of the listed subnets, I'm not sure I understand why this is outputting 192.168.1.0/24 and then 192.169.0.0/16... it seems that it totally ignored 192.168.0.0/16 as an input, and it still provided two subnets, rather than the subnet that combines 192.168.0.0/16 (of which 192.168.1.1/24 is a subset) and 192.169.0.9/16... I would think it the output would be 192.168.0.0/15.
Nope, you are right - looks like there is a bug
I'll go through the code to see what's causing this.
I just wrote it in the past couple of hours, so you can see this as an alpha version
Ok, I think I see what's the issue there. It's not doing all the comparisons, so it's basically ignoring the wider subnet.
Update: it's the sort -un command. It is supposed to remove duplicates but for some reason it removes non-duplicates... Working on the fix.
The purpose BTW is to eliminate subnets that are encopassed. Not to find a larger subnet that would encompass all subnets on the list. Maybe aggregate is an incorrect word for this?
I personally have an application for this, which is why I wrote this script. When creating a whitelist-based geoip blocking, I want to whitelist all local subnets. So I have to deal with multiple entries like fdxx:xxxx:xxxx:10::7bf/128 and fdxx:xxxx:xxxx:10:xxxx:xxxx:xxxx:xxxx/64.
Now I could just add all entries to the whitelist but I think that this is lame. So this script allows to eliminate smaller subnets that are encompassed.
That's good to know. Also, i knew about similar functionality in ipset. However i thought that creating an ipset (or nftables set) for a handful of subnets (in most cases this will be 2 or 3) is a waste of memory. Maybe I'm wrong.
I mean, obviously i am using ipsets for the geoip whitelist but i think there should be a separate firewall rule for local networks, and creating a set for it seemed wasteful.
Updated the description, so hopefully it's more clear now what the script actually does.
Also, if you think that the functionality could be extended to a nearby area which would be useful, I'm open to hear suggestions. @psherman
Found a few ways to optimize these scripts, so now they work about 2.5x faster. Still takes almost 2 seconds to aggregate 3 ipv6 subnets on my old router CPU. But it's better than 5 seconds as it was before
if i read correctly, in this part (actually parts as it is repeated) (25[0-5]|(2[0-4]|1[0-9]|[1-9]|) there is an empty pattern at the end: here|). seems to me as a typo. am i wrong and there is a reason to capture empty string?
Actually, turns out that it's not a typo. It makes the capturing group (2[0-4]|1[0-9]|[1-9]|) optional, which is required to match a single digit in that octet of the ipv4 address. That's a peculiar way to design a regex and I wasn't even aware of this since I basically copied it verbatim from the stackoverflow answer and just adapted to the ERE syntax (the link to the answer is in the code comment btw). At the time, I tested so many different regex options that I couldn't go much into detail of each one, I was just happy to find this one which was performing 40x faster than some other ones (including the shortest and sexiest one in the same answer), while still seemingly being completely water-proof. The regex speed doesn't really matter in this script, but I'm utilizing the same regex to validate lists containing thousands of ip addresses in my other scripts, and there it makes a huge difference.
yes, i see now that one digit octet will be matched this way but there will be two capturing groups which may create problem if you process capturing groups.
you don't need a one-liner regexp per se, as it is a script. there are other options to find x.x.x.x format and validate using script language - but that's purely personal preference.
strangely though (25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])\. according to regex101.com site matches on 256 too.
this regexp creates capturing groups which i doubt you use, so maybe you can even fine tune by using non-capturing groups (?:) - but that should be measured, may not give any performance benefit at all.
In theory yes, in practice I don't think implementing ip validation through shell logic can compete performance-wise with a well-designed regex.
If you take the complete regex ^((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])\.){3}(25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])$ (I added the anchors in the beginning and in the end because I'm adding them in the code anyway) and check on that website in ECMAscript flavor (which should match ERE syntax? afaik), it doesn't match any numbers above 255 in any octet. Even just the portion you quoted, on its own and without anchors, doesn't match 256 for me. Not sure which of us is doing something wrong.
Yeah, I should test this sometime, thanks for the tip.
Also, since we've had a pretty good chat about code here, I'd like to ask if you would agree to test another script related to the same mother project that I'm working on. It's here, in the comment marked as Solution. I just need some statistics to see if the heuristics I found actually work in environments other than mine. All you need to do is copy the shell code into a new .sh file (like find-local-subnets.sh), download the getsubnet.sh script (which that code depends on) to the same directory, run sh find-local-subnets.sh (no root required) and check the output. Would be very thankful if you agreed.
The website is trying to be helpful by displaying which portion of the input does match. However as long as the complete input is not highlighted, it means that there is no actual match in reality.