Optimization of adblock-lean

Lynx · July 5, 2024, 10:01am

How would the awk line:

awk -F'/' 'NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' "/tmp/allowlist" -

be amended by incporating -v use_allowlist="${use_allowlist}" and amending the pattern such that:

if use_allowlist is set to "1", then the above existing (allowlist) processing is performed; and
otherwise, the input is simply printed to the output

Context here:

github.com

lynxthecat/adblock-lean/blob/64d8fd471b7bd665f8636b017d38c26b582f9ab7/adblock-lean#L278C1-L278C7


      
          						awk -F'/' 'NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' "/tmp/allowlist" -

Namely, it would seem nicer to have awk handle the condition rather than use the ugly if statement in shell.

antonk · July 5, 2024, 10:43am

I think this should do it.

awk -F'/' -v use_allowlist="${use_allowlist}" 'use_allowlist==123 {print $0; next} NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' "/tmp/allowlist" -

That said, IMO calling awk just to read and print the file line by line is much less efficient than the native shell method.

Lynx · July 5, 2024, 11:02am

Thanks for your response!

Just so I understand, how does the use_allowlist==123 work?

Ah, interesting. But isn't:

# 1 Convert to lowercase; 2 Remove comment lines and trailing comments; 3 Remove trailing address hash, and all whitespace; 4 Convert to local=; 5 Add newline
sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/; s/\([^#]\)#.*$/\1/g; /^#/d; s/#$//; s/^[ \t]*//; s/[ \t]*$//; /^$/d; s/\(^address=\|^server=\)/local=/; $a\' "/tmp/blocklist.${blocklist_id}" |
if [[ "${use_allowlist}" == 1 ]]
then
	awk -F'/' 'NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' "/tmp/allowlist" -
else
	cat
fi > "/tmp/blocklist.${blocklist_id}.new"
mv /tmp/blocklist.${blocklist_id}.new /tmp/blocklist.${blocklist_id}

super ugly? Is there a way to improve this and perhaps avoid the call to cat?

antonk · July 5, 2024, 12:33pm

Oops, this was copy-pasted from my test. Should read as use_allowlist==1.
Basically this is a shortened if statement. If use_allowlist equals 1 then print current line, skip the rest of the commands, go to next line.

Edit: Bruh, you also wanted it to be NOT equal 1. So use_allowlist!=1

antonk · July 5, 2024, 12:37pm

Shell code is not pretty in general but I don't see anything particularly ugly in this one. Think about performance. Your native shell code performs one if evaluation, then prints the entire file. The awk command performs as many if evaluations as there are lines in the file, and it reads and prints the file line by line. Now this is ugly, from my perspective.

antonk · July 5, 2024, 12:48pm

And ye, I don't see any way to make it look more elegant or avoid using cat.

Edit:

Looking closer at the awk command, I'm wondering if it can be improved for better performance. Looks like the idea is to eliminate duplicate ipv4 addresses. What i don't understand is why it needs to split each address into octets and then run a for loop on each address making 4 comparisons with [1,2,3,4] octets rather than comparing the whole address.

@Lynx

Lynx · July 6, 2024, 11:22am

Really appreciate your input here! For context, see here:

And here:

Any thoughts @Wizballs?

Wizballs · July 6, 2024, 11:30am

Because if the user wants to allow eg google.com, you also want to automatically allow all subdomains such as calendar.google.com, gmail.google.com

And at the same time, you don't want to accidentally allow eg Google.com.fake.com

Sure, full line matches are much faster, but dont achieve what I just outlined here.

But always happy to try suggestions/improvements to this part!

antonk · July 6, 2024, 12:24pm

Oh I incorrectly assumed that it was comparing ip addresses rather than URLs. I'd like to further understand this command though. Could you perhaps post sample input files /tmp/blocklist.${blocklist_id} and /tmp/allowlist?

Lynx · July 6, 2024, 4:44pm

Sorry for the delay - here are the example files:

Wizballs · July 6, 2024, 9:27pm

I'll drop in some short examples here if that also helps. Keep in mind we are often running a blocklist of maybe 1,000,000 entries and an allowlist of 1,000 entries

allowlist

google.com
aiidoge.smurfs-fun.com
aiieer.mangnut2.com
aiiegro-info.com

blocklist

local=/calendar.google.com/
local=/gmail.google.com/
local=/keep.google.com/
local=/google.com.fake.com/
local=/fakegoogle.com/
local=/aiidoge.smurfs-fun.com/
local=/aiieer.mangnut2.com/
local=/aiiegro-info.com/
local=/aiiegro-lokalnie.pl/
local=/aiiegrolokainie.pl/
local=/aiienxchain.com/

output using current awk

llocal=/google.com.fake.com/
local=/fakegoogle.com/
local=/aiiegro-lokalnie.pl/
local=/aiiegrolokainie.pl/
local=/aiienxchain.com/

antonk · July 6, 2024, 10:24pm

I gotta say, I tried to optimize the awk command, but I found that I can't improve on existing code, at least when running on the sample files you posted.

Wizballs · July 6, 2024, 11:14pm

I also cannot find anything faster. This command does not have to be limited to awk however, so there may still be a better way.

I updated the short lists above to include fakegoogle.com which the current awk command also correctly handles.

Wizballs · July 7, 2024, 1:07am

I read over this and thought ok, we can wrap all the allowlist entries in both local=/xxx/ and also .xxx/ (so essentially doubling the allowlist enties, but who cares if faster overall right). And it would achieve the same result as current awk.

before doubling the entries, I tried as is, in the current list formats:
1,500 allowlist entries in example.com format
1.4 millon blocklist entries in local=/example.com/ format

current awk: 1m 20s

grep -v -i -f allowlist blocklist > outputfile
I cancelled it at 27 mins, and it was only 15% through processing, based on the outputted file size.

Why?
Awk is using hash value to compare.... much much faster
grep I believe is just comparing individual elements (no hashing)

So yeah. Feel free to speed test this, but my router is still cooling down

@dave14305

dave14305 · July 7, 2024, 1:20am

Full disclosure: my personal whitelist has 1 entry.

antonk · July 7, 2024, 1:21am

I'm just testing on my computer. Less scientific but at least I don't have to wait so long.

I was thinking that integrating the sed command into awk might speed things up, but the opposite is true. sed is just much faster. So instead here is a slightly shortened sed command which at least on my computer speeds things up by about 20% with the sample inputs.

sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/; s/$[^#]$#.*$/\1/g; s/$#$\|[ \t]*$//; /^$#\|$$/d; s/$^address=\|^server=$/local=/; $a\'

I don't know what some parts of this command do, so those parts I didn't touch. For instance, what does this do: s/$[^#]$#.*$/\1/g ? I'm confused by \1 here. And what does this do $a\?

Wizballs · July 7, 2024, 1:22am

ahah ok, grep can handle 1 entry just fine

Wizballs · July 7, 2024, 1:27am

sed being 20% faster than awk is pretty much what we found in most cases, and therefore use sed wherever possible. Ok cool let me try your sed version on the router itself - always good to test final performance on the SoC itself.

s/([^#])#.*$/\1/g
This removes # comment lines, and also trailing comments eg
example.com # trailing comment

$a
adds a new line at bottom of list, helpfull when combining multiple files

antonk · July 7, 2024, 1:28am

Eh, my bad: I messed up this part s/$#$\|[ \t]*$//. Should be s/$#\|[ \t]*$$//. Unfortunately, with this fixed, it actually runs slightly slower

Wizballs · July 7, 2024, 1:32am

Can I please grab the full command you are running, so that I end up running the exact same as you?