Optimization of adblock-lean

sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/; s/\([^#]\)#.*$/\1/g; s/\(#\|[ \t]*\)$//; /^\(#\|$\)/d; s/\(^address=\|^server=\)/local=/; $a\' "blocklist.1" | awk -F'/' 'NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' "allowlist" -

1 Like

What does \1 mean though?

And just to clarify, unfortunately my command runs slightly slower than yours after I fixed the regex.

Edit:

So with the following sed snippet the command actually runs about 4% faster than the original, while also being shorter. Not sure it does everything the original command is doing though, as I don't understand the aforementioned part of it (that part doesn't actually remove.comment lines, at least not with the version of sed installed on my computer).

sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/; s/#.*$//; s/[ \t]*$//; /^$/d; s/\(^address=\|^server=\)/local=/; $a\'

TLDR for everyone, the awk remove allowlist method remains unchanged (no faster method can be found to date). The sed "sanitise" blocklist may have a small tweak available.

@antonk thanks for reviewing everything so far.
Your tweak running on my router (netgear r7800): 1m 22s
My existing command line on router: 1m 21s

I'll call this even speed, as linux command "date" only rounds to the nearest second and therefore is probably just a rounding error.

However I'm going to review your code and see what the difference is. and I'l also get back to you on what the 1/g part does. We've been refining adblock-lean over the last couple of years now, and I pulled together these commands from all sorts of forums, help from stackexchange peeps etc. So I can't remember exactly what 1/g does. It might even be not-needed at all.

2 Likes

I can confirm, when run in my x86 VM with OpenWrt, the performance is identical. On my computer I do get a consistent 4.5% improvement. So looks like GNU sed has some optimization which busybox sed doesn't have.

Edit: turns out the limitation is not in Busybox sed but rather in Busybox awk. gawk performs much better in my OpenWrt VM, and with it I can actually see performance difference when altering the sed command. Simply switching from awk to gawk brings a whopping 47% performance increase. So perhaps you could recommend users to install gawk, and change your code to use gawk if it's present.

Edit2: using GNU sed brings an additional 16% performance improvement in my VM.

Edit3: Not sure why but now I'm getting better results with busybox sed (about 12% faster). So disregard Edit2 :slight_smile:

1 Like

Awesome, and gawk speed is especially interesting. I'll test/confirm over the following days. Adblock-lean philosophy is to use default Openwrt packages only. But yeah can't ignore a 47% speed improvement.

1 Like

I'm not suggesting making gawk a dependency. However some users who have the ~273kB to spare could as well be made aware of the performance gain, and all your code needs to support this is a little bit of detection.

P.s. I saw a 47% increase when using a much larger allowlist (which basically includes all entries the blocklist has). When using the sample allowlist posted by Lynx, I'm seeing a 56% performance gain.

Fantastic analysis gents! Seems like our existing lines are already well optimised and hard to beat. Nice work in arriving at these @Wizballs.

Yes as a design philosophy we’ve always tried to leverage only base system stuff. We could start to check for dependencies and leverage those if present, but that feels like a big step and also introduces some additional complexity.

We had a similar discussion about uclient-fetch and curl.

It's really not a big deal and very easy to implement. I can help with this if you like.

1 Like

Testing with 1.5m million blocklist, 1.5 thousand allowlist. Removed the sed component, to isolate only the gawk component speed variance.

Well damn, huge speed increase confirmed by me.
gawk 0m 29s
awk 1m 20s

nearly a three-fold increase for this standalone component.

Output seems identical. Strangely enough, piping sed | gawk seems to have even more performance gains than sed | awk. Need to do more confirmation testing. But I'm switching to gawk haha.

@Lynx How do you feel about adding a little spice to adblock-lean

if gawk exist then party with gawk sed | gawk.
if gawk <> exist, then hang out with sed | awk.

1 Like

I’m a bit torn on it. On the plus side we can improve speed with optional dependency but on the minus it potentially means opening up scope for inconsistency and harder maintenance since one change becomes two changes for two separate commands. Also it means we might start doing this for other bits of the code. There is a simplicity I like about sticking with the original dependency free commands.

Ah @Wizballs just read your post above since we posted at approximately the same time.

So I think you’re in favour of checking and leveraging if it exists? Easy enough to do. Are the commands otherwise pretty much the same just different binaries?

For small/negligable gains I agree, stick to default packages. But this is anything but a small gain!

and the command line between awk and gawk is identical apart from one letter (g).

Whichever way you decide to go, I can't help but consider that my install will run gawk for this operation. It's leagues faster than awk.

Yep, awk syntax is luckily very much standard (unlike sed), and as long as you don't use some unique GNU extensions (which you don't), code should always be compatible. AFAIK.

1 Like

This one I'm not torn on, not even the slightest bit. The gains are just too big to ignore. And plus, adblock-lean will continue to function 100% perfectly well whether the user has gawk installed or not. Aaaand it will saving me editing the script every time there is an update :wink:

Yep the commands are fortunately, 100% identical.

1 Like

I just realized that the performance figures I posted were incorrect. What I was calculating was time reduction, not performance increase. So my gawk vs awk results are closer to @Wizballs 's results: about 2x speed increase.

1 Like

OK I’m sold. I can implement (if I know the exact commands for each) and otherwise perhaps either of you could make a pull request against the improve-process-flow branch? Really appreciate your interest and input here to our little project @antonk!

2 Likes

The commands can stay the exact same as current in adblock-lean. There might be minor tweak to review but there is zero difference in the output or speed

I don't mind doing that in case @Wizballs doesn't want or have time for it (this may take me some time though because I'll want to get acquainted with the project's structure before offering patches).

Ah, is the call the same? I mean if you install gawk does then a call to awk go through the gawk binary?

Not in my setup, at least. After gawk installation it's available with the gawk call, while calling awk still brings up Busybox awk.

No, but the only thing needed to be changed is the binary called is gawk instead of awk. So one letter different. One option is maybe put the trailing command into a variable. And then call the variable command from either awk or gawk. But I'm only guessing

1 Like