But doesn’t awk have to see everything at some point so doesn’t that require more or less total decompression?
yes, that was my goal to show you some idea how to get rid of those memory problems - I think, even small devices could load large blocklists. Be carefully in reduction of memory - we have two memory domains here - the /tmp filesystem, and the memory used by processes (like awk)
In you original code, you are using a lot of space on /tmp - and that is my code reducing - but not the ram a process like awk does on deduplication - this needs some more ideas...
Isn’t the memory used by awk etc not pretty small though so the main issue is reducing memory footprint in tmp? And the biggest file is the combined blocklist. And if awk has to see that uncompressed to function then is there a way around that? With zram-swap is it possible to force a certain process to operate entirely within the zram-swap?
Looks like low hanging fruit already would be to compress existing blocklist file early on rather than after new one generated:
So, swapping is a mechanism to get more ram for processes. I think, /tmp is not using swap - but, I'm not shure...
You could test with a dd:
dd if=/dev/zero of=/tmp/test bs=1M count=20
dd if=/dev/random of=/tmp/test2 bs=1M count=20
so, what eats up more ram on /tmp? I think, both use the same, because /tmp is not using swap...
about gzip blocklist: My idea was the other way around - generate a new blocklist.new.gz - if all was ok, overwrite with "mv blocklist.new.gz blocklist.gz" and extract to blocklist for dnsmasq
tmpfs will use swap (as well), if available. It's like anything other held in RAM, except that it can't be evicted under OOM conditions.
Thanks for that! But, /tmp has a fixed size - on boot, it detects the amount of memory and uses about 50% for it - is that correct? So, in my case, /tmp has about 60mb on my Archer C7v4, because it has about 128mb ram...
tmpfs uses 50% of the detected RAM by default, that can be changed, but I haven't checked (and doubt) that procd offers access to that setting.
For mem usage of awk dedup, I found a solution to show peak mem:
awk '!seen[$0]++;END{system("egrep \"^Name|^Vm\" /proc/$PPID/status >&2")}'
In my case, with default lists, it is:
VmPeak: 26804 kB
man proc -> /proc/pid/status -> VmPeak: Peak virtual memory size.
Cool command to grab this value. It should be about 2.5x the Blocklist size, is that what you are observing?
10.6M dnsmasq
you are right - but, why not 1x Blocklist? What is the theory behind?
I did a test with sort -u, in one window:
gunzip -kc 1.gz 2.gz | { cat - ; sleep 5 ; } | sort -u | wc -l
and this line in the other window:
while :;do for p in $(pidof sort); do grep VmPeak /proc/$p/status; done; done
-> VmPeak: 4204 kB
...so, we should use "sort -u" instead...
Edit:
sort -u: 15s
awk seen: 34s
awk has to load the whole file as a hash table, and then compare every single line against every other line. The hash method is extremely fast, but is all loaded upfront and not line by line. That's all I know.
Flip if we can more than halve CPU time and reduce memory by 85% that would be incredible. Is this too good to be true?
From my own testing sort -u seems slightly slower on my RT3200:
awk:
real 0m 4.25s
user 0m 3.01s
sys 0m 1.19s
sort -u:
real 0m 4.72s
user 0m 4.60s
sys 0m 0.07s
Confirmed, I think this is the answer. Nice one
without any compression and monitoring via: htop -d 1
16mb blocklist, every line has been doubled
awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
8 seconds, 24mb VIRT memory use
sort -u /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 29mb VIRT memory use
8mb blocklist, all unique and no doubles:
awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
6 seconds, 24mb VIRT memory use
sort -u /tmp/oisd.txt > /tmp/oisd.out
1 seconds, 15mb VIRT memory use
Doesn't this reflect a mixed bag though - e.g. sort in your second command there had 29mb VIRT vs awk's 24mb VIRT? And in my run sort -u took longer? I'm using busybox awk by the way.
I'll try a real word test of hagezi end oisd combined next. Just need 30 mins
Mine was real-world above with both combined. 4 to 5 seconds seems nice with my RT3200! But more testing is needed.
@mth404 thank you for coming on with all of these great ideas. This has been fun.
With dnsmasq once it has been loaded with a blocklist in /tmp can't that blocklist actually be deleted? So an alternative might be simply to compress the blocklist following restart?
I'm still interesed to know if piping decompressed data from compressed file through awk or sort actually results in less memory usage for the full run or not.
Ok both OISD and Hagezi on an r7800 (mine is limited to 1400mhz instead of default 1700mhz, easily covers my 100/40 connection)
awk '!seen[$0]++' /tmp/oisd.txt > /tmp/oisd.out
9 seconds, mb 44mb VIRT memory use
sort -u /tmp/oisd.txt > /tmp/oisd.out
7 seconds, 28mb VIRT memory use
I suspect a highly unorganised list(s) might impact sort -u moreso, but it does seem to be the better option in most circumstances
This was discussed waaaaay back in the early days, but it was generally agreed to leave it in /tmp/dnsmasq.d/ incase of an unexpected dnsmasq restart.
There is also this feature of dnsmasq to consider.
--conf-script=
<file>
[ <arg]
Execute<file>
, and treat what it emits to stdout as the contents of a configuration file. If the script exits with a non-zero exit code, dnsmasq treats this as a fatal error. The script can be passed arguments, space seperated from the filename and each other so, for instance --conf-dir=/etc/dnsmasq-uncompress-ads /share/ads-domains.gzwith /etc/dnsmasq-uncompress-ads containing
set -e
zcat ${1} | sed -e "s:^:address=/:" -e "s:$:/:"
exit 0and /share/ads-domains.gz containing a compressed list of ad server domains will save disk space with large ad-server blocklists.