Adblock-lean: set up adblock using dnsmasq blocklist

But doesn’t awk have to see everything at some point so doesn’t that require more or less total decompression?

yes, that was my goal to show you some idea how to get rid of those memory problems - I think, even small devices could load large blocklists. Be carefully in reduction of memory - we have two memory domains here - the /tmp filesystem, and the memory used by processes (like awk)
In you original code, you are using a lot of space on /tmp - and that is my code reducing - but not the ram a process like awk does on deduplication - this needs some more ideas...

1 Like

Isn’t the memory used by awk etc not pretty small though so the main issue is reducing memory footprint in tmp? And the biggest file is the combined blocklist. And if awk has to see that uncompressed to function then is there a way around that? With zram-swap is it possible to force a certain process to operate entirely within the zram-swap?

Looks like low hanging fruit already would be to compress existing blocklist file early on rather than after new one generated:

So, swapping is a mechanism to get more ram for processes. I think, /tmp is not using swap - but, I'm not shure...
You could test with a dd:

dd if=/dev/zero of=/tmp/test bs=1M count=20
dd if=/dev/random of=/tmp/test2 bs=1M count=20

so, what eats up more ram on /tmp? I think, both use the same, because /tmp is not using swap...

about gzip blocklist: My idea was the other way around - generate a new blocklist.new.gz - if all was ok, overwrite with "mv blocklist.new.gz blocklist.gz" and extract to blocklist for dnsmasq

1 Like

tmpfs will use swap (as well), if available. It's like anything other held in RAM, except that it can't be evicted under OOM conditions.

Thanks for that! But, /tmp has a fixed size - on boot, it detects the amount of memory and uses about 50% for it - is that correct? So, in my case, /tmp has about 60mb on my Archer C7v4, because it has about 128mb ram...

tmpfs uses 50% of the detected RAM by default, that can be changed, but I haven't checked (and doubt) that procd offers access to that setting.

For mem usage of awk dedup, I found a solution to show peak mem:

awk '!seen[$0]++;END{system("egrep \"^Name|^Vm\" /proc/$PPID/status >&2")}'

In my case, with default lists, it is:

VmPeak:    26804 kB

man proc -> /proc/pid/status -> VmPeak: Peak virtual memory size.

1 Like

Cool command to grab this value. It should be about 2.5x the Blocklist size, is that what you are observing?

10.6M  dnsmasq

you are right - but, why not 1x Blocklist? What is the theory behind?

I did a test with sort -u, in one window:

gunzip -kc 1.gz 2.gz | { cat - ; sleep 5 ; } | sort -u | wc -l

and this line in the other window:

while :;do for p in $(pidof sort); do grep VmPeak /proc/$p/status; done; done

-> VmPeak: 4204 kB
...so, we should use "sort -u" instead...

Edit:
sort -u: 15s
awk seen: 34s
:flushed:

awk has to load the whole file as a hash table, and then compare every single line against every other line. The hash method is extremely fast, but is all loaded upfront and not line by line. That's all I know.

Flip if we can more than halve CPU time and reduce memory by 85% that would be incredible. Is this too good to be true?

From my own testing sort -u seems slightly slower on my RT3200:

awk:
real    0m 4.25s
user    0m 3.01s
sys     0m 1.19s

sort -u:
real    0m 4.72s
user    0m 4.60s
sys     0m 0.07s

Confirmed, I think this is the answer. Nice one

without any compression and monitoring via: htop -d 1

16mb blocklist, every line has been doubled

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
8 seconds, 24mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 29mb VIRT memory use

8mb blocklist, all unique and no doubles:

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 24mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
1 seconds, 15mb VIRT memory use

Doesn't this reflect a mixed bag though - e.g. sort in your second command there had 29mb VIRT vs awk's 24mb VIRT? And in my run sort -u took longer? I'm using busybox awk by the way.

I'll try a real word test of hagezi end oisd combined next. Just need 30 mins

1 Like

Mine was real-world above with both combined. 4 to 5 seconds seems nice with my RT3200! But more testing is needed.

@mth404 thank you for coming on with all of these great ideas. This has been fun.

With dnsmasq once it has been loaded with a blocklist in /tmp can't that blocklist actually be deleted? So an alternative might be simply to compress the blocklist following restart?

I'm still interesed to know if piping decompressed data from compressed file through awk or sort actually results in less memory usage for the full run or not.

1 Like

Ok both OISD and Hagezi on an r7800 (mine is limited to 1400mhz instead of default 1700mhz, easily covers my 100/40 connection)

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
9 seconds, mb 44mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
7 seconds, 28mb VIRT memory use

I suspect a highly unorganised list(s) might impact sort -u moreso, but it does seem to be the better option in most circumstances

This was discussed waaaaay back in the early days, but it was generally agreed to leave it in /tmp/dnsmasq.d/ incase of an unexpected dnsmasq restart.

There is also this feature of dnsmasq to consider.

--conf-script=<file>[ <arg]
Execute <file>, and treat what it emits to stdout as the contents of a configuration file. If the script exits with a non-zero exit code, dnsmasq treats this as a fatal error. The script can be passed arguments, space seperated from the filename and each other so, for instance --conf-dir=/etc/dnsmasq-uncompress-ads /share/ads-domains.gz

with /etc/dnsmasq-uncompress-ads containing

set -e
zcat ${1} | sed -e "s:^:address=/:" -e "s:$:/:"
exit 0

and /share/ads-domains.gz containing a compressed list of ad server domains will save disk space with large ad-server blocklists.

4 Likes