Adblock-lean: set up adblock using dnsmasq blocklist

Lynx · November 19, 2023, 11:55pm

But doesn’t awk have to see everything at some point so doesn’t that require more or less total decompression?

mth404 · November 19, 2023, 11:58pm

yes, that was my goal to show you some idea how to get rid of those memory problems - I think, even small devices could load large blocklists. Be carefully in reduction of memory - we have two memory domains here - the /tmp filesystem, and the memory used by processes (like awk)
In you original code, you are using a lot of space on /tmp - and that is my code reducing - but not the ram a process like awk does on deduplication - this needs some more ideas...

Lynx · November 20, 2023, 12:02am

Isn’t the memory used by awk etc not pretty small though so the main issue is reducing memory footprint in tmp? And the biggest file is the combined blocklist. And if awk has to see that uncompressed to function then is there a way around that? With zram-swap is it possible to force a certain process to operate entirely within the zram-swap?

Looks like low hanging fruit already would be to compress existing blocklist file early on rather than after new one generated:

github.com

lynxthecat/adblock-lean/blob/a6c70be867680ba1373c7e607269c8636e8bbea7/adblock-lean#L326


      
          	then
          		log_msg "New blocklist file check passed."
          	else
          		log_failure "New blocklist file check failed."
          		rm -f /tmp/blocklist
          		exit
          	fi
          
          	if [[ -f /tmp/dnsmasq.d/blocklist ]]
          	then
          		gzip -c /tmp/dnsmasq.d/blocklist > /tmp/prev_blocklist.gz
          	fi
          
          	mv /tmp/blocklist /tmp/dnsmasq.d/blocklist
          
          	restart_dnsmasq
          
          	if check_dnsmasq
          	then
          		log_msg "The dnsmasq check passed with new blocklist file."
          		log_success "New blocklist installed with good line count: ${good_line_count}."

mth404 · November 20, 2023, 12:09am

So, swapping is a mechanism to get more ram for processes. I think, /tmp is not using swap - but, I'm not shure...
You could test with a dd:

dd if=/dev/zero of=/tmp/test bs=1M count=20
dd if=/dev/random of=/tmp/test2 bs=1M count=20

so, what eats up more ram on /tmp? I think, both use the same, because /tmp is not using swap...

about gzip blocklist: My idea was the other way around - generate a new blocklist.new.gz - if all was ok, overwrite with "mv blocklist.new.gz blocklist.gz" and extract to blocklist for dnsmasq

slh · November 20, 2023, 12:31am

tmpfs will use swap (as well), if available. It's like anything other held in RAM, except that it can't be evicted under OOM conditions.

mth404 · November 20, 2023, 7:14am

Thanks for that! But, /tmp has a fixed size - on boot, it detects the amount of memory and uses about 50% for it - is that correct? So, in my case, /tmp has about 60mb on my Archer C7v4, because it has about 128mb ram...

slh · November 20, 2023, 7:17am

tmpfs uses 50% of the detected RAM by default, that can be changed, but I haven't checked (and doubt) that procd offers access to that setting.

mth404 · November 20, 2023, 7:52am

For mem usage of awk dedup, I found a solution to show peak mem:

awk '!seen[$0]++;END{system("egrep \"^Name|^Vm\" /proc/$PPID/status >&2")}'

In my case, with default lists, it is:

VmPeak:    26804 kB

man proc -> /proc/pid/status -> VmPeak: Peak virtual memory size.

Wizballs · November 20, 2023, 8:10am

Cool command to grab this value. It should be about 2.5x the Blocklist size, is that what you are observing?

mth404 · November 20, 2023, 9:07am

10.6M  dnsmasq

you are right - but, why not 1x Blocklist? What is the theory behind?

mth404 · November 20, 2023, 9:21am

I did a test with sort -u, in one window:

gunzip -kc 1.gz 2.gz | { cat - ; sleep 5 ; } | sort -u | wc -l

and this line in the other window:

while :;do for p in $(pidof sort); do grep VmPeak /proc/$p/status; done; done

-> VmPeak: 4204 kB
...so, we should use "sort -u" instead...

Edit:
sort -u: 15s
awk seen: 34s

Wizballs · November 20, 2023, 9:25am

awk has to load the whole file as a hash table, and then compare every single line against every other line. The hash method is extremely fast, but is all loaded upfront and not line by line. That's all I know.

Lynx · November 20, 2023, 10:15am

Flip if we can more than halve CPU time and reduce memory by 85% that would be incredible. Is this too good to be true?

From my own testing sort -u seems slightly slower on my RT3200:

awk:
real    0m 4.25s
user    0m 3.01s
sys     0m 1.19s

sort -u:
real    0m 4.72s
user    0m 4.60s
sys     0m 0.07s

Wizballs · November 20, 2023, 10:25am

Confirmed, I think this is the answer. Nice one

without any compression and monitoring via: htop -d 1

16mb blocklist, every line has been doubled

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
8 seconds, 24mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 29mb VIRT memory use

8mb blocklist, all unique and no doubles:

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 24mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
1 seconds, 15mb VIRT memory use

Lynx · November 20, 2023, 10:27am

Doesn't this reflect a mixed bag though - e.g. sort in your second command there had 29mb VIRT vs awk's 24mb VIRT? And in my run sort -u took longer? I'm using busybox awk by the way.

Wizballs · November 20, 2023, 10:29am

I'll try a real word test of hagezi end oisd combined next. Just need 30 mins

Lynx · November 20, 2023, 10:31am

Mine was real-world above with both combined. 4 to 5 seconds seems nice with my RT3200! But more testing is needed.

@mth404 thank you for coming on with all of these great ideas. This has been fun.

With dnsmasq once it has been loaded with a blocklist in /tmp can't that blocklist actually be deleted? So an alternative might be simply to compress the blocklist following restart?

I'm still interesed to know if piping decompressed data from compressed file through awk or sort actually results in less memory usage for the full run or not.

Wizballs · November 20, 2023, 11:21am

Ok both OISD and Hagezi on an r7800 (mine is limited to 1400mhz instead of default 1700mhz, easily covers my 100/40 connection)

awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
9 seconds, mb 44mb VIRT memory use

sort -u /tmp/oisd.txt > /tmp/oisd.out
7 seconds, 28mb VIRT memory use

I suspect a highly unorganised list(s) might impact sort -u moreso, but it does seem to be the better option in most circumstances

Wizballs · November 20, 2023, 11:23am

This was discussed waaaaay back in the early days, but it was generally agreed to leave it in /tmp/dnsmasq.d/ incase of an unexpected dnsmasq restart.

dave14305 · November 20, 2023, 11:35am

There is also this feature of dnsmasq to consider.

--conf-script=<file>[ <arg]
Execute <file>, and treat what it emits to stdout as the contents of a configuration file. If the script exits with a non-zero exit code, dnsmasq treats this as a fatal error. The script can be passed arguments, space seperated from the filename and each other so, for instance --conf-dir=/etc/dnsmasq-uncompress-ads /share/ads-domains.gz

with /etc/dnsmasq-uncompress-ads containing

set -e
zcat ${1} | sed -e "s:^:address=/:" -e "s:$:/:"
exit 0

and /share/ads-domains.gz containing a compressed list of ad server domains will save disk space with large ad-server blocklists.