Optimization of adblock-lean

I need to retest everything with allowlist running again due to a testing mistake earlier on. I'll post the results later on.

This is handy for memory testing, and I have coreutils-sleep installed to use resolution eg 0.1 secs instead of busybox minimum 1s

minmem=$(grep MemAvailable /proc/meminfo | awk '{print $2}'); wh
ile true; do clear; memavail=$(grep MemAvailable /proc/meminfo | awk '{print $2}
'); if [ "$memavail" -lt "$minmem" ]; then minmem=$memavail; fi; echo "Current:
$memavail"; echo "Minimum: $minmem"; sleep 0.1; done

Compression enabled, no allowlist:
Total processing time inc dnsmasq restart: 1m7s
Dnsmasq restart time: 27s
Min available memory during 2nd, 3rd etc run: 253,000 kB
Available memory after dnsmasq restart: 299,000 kB

Compression disabled, no allowlist
Total processing time inc dnsmasq restart: 58s
Dnsmasq restart time: 27s
Min available memory during 2nd, 3rd etc run: 217,000 kB
Available memory after dnsmasq restart: 280,000 kB

1 Like

As you can see, huge memory savings with compression enabled, with only a small performance impact during processing.

Compression enabled, using allowlist:
Total processing time inc dnsmasq restart: 1m17s
Dnsmasq restart time: 27s
Min available memory during 2nd, 3rd etc run: 244,000 kB
Available memory after dnsmasq restart: 294,000 kB

Compression disabled, using allowlist
Total processing time inc dnsmasq restart: 1m8s
Dnsmasq restart time: 27s
Min available memory during 2nd, 3rd etc run: 216,000 kB
Available memory after dnsmasq restart: 280,000 kB

1 Like

Huge and small are relative terms. 40MBs is a lot when you have 256MB, but negligible if you have 1GB. And that 9s saving makes 13% of 1m7s. So in some cases also not bad IMO. Everything considered, compressing by default is a good choice.

1 Like

This is true, good point. But yes definitely enable by default.

And also, everything appears to be working as intended in this PR

1 Like

@Wizballs also which lists were you testing with? Maybe i can infer something about used memory per downloaded byte from that. I've made a script which tries to do memory use profiling with /usr/bin/time but the results I'm getting seem way off. Like 8-9 used bytes per downloaded byte.

Curious to try profiling with this.

1 Like

I use Hagezi pro (4.4MiB) and tif (19.8MiB)

1 Like

I think the PR is ready to merge?

Also another idea I had is to have a standard selection of list names which adblock-lean will understand, with a new option like "standard_lists", where you could put values like "hagezi_tif hagezi_pro hagezi_medium_tif" etc. These would be automatically translated to URLs. In addition, keep the current options "blocklist_urls" and "allowlist_urls" but change their names to "custom_blocklist_urls" and "custom_allowlist_urls". Would probably make it easier for the user to set it up.

1 Like

Sure thing. Merged!

1 Like

I was about to report that processing time jumped by ~20s, but I realise this is just including restarting dnsmasq:

Restarting dnsmasq.
Waiting for dnsmasq initialization.
Restart of dnsmasq completed.

Processing time for blocklist generation and import: 1m:21s.

Disabling compression reduces this just a tad:

Restarting dnsmasq.
Waiting for dnsmasq initialization.
Restart of dnsmasq completed.

Processing time for blocklist generation and import: 1m:14s.

But not enough for me to disable compression - I'll just keep it enabled.

BTW should /tmp/adblock-lean get deleted on exit?

Any other processing optimizations we can think of to reduce processing time even further?

Yes, my thinking was that dnsmasq restart is what is doing the actual import of the blocklist. Before that, we just have a file. So if we want to report the time it takes to generate and import the blocklist, we need to include that time.

The only thing which gets stored in that directory is the lock file, and it gets deleted on exit. Leaving the directory in place doesn't do any harm IMO. If you think it should be deleted, we could just do rm -rf "${abl_pid_dir}" in rm_lock() to make it delete the directory along with the file.

I'm not sure but maybe there is a way to pipe the blocklist directly into dnsmasq (without resorting to dnsmasq restart). We would still need to gzip or move the file into /tmp/dnsmasq.d so further dnsmasq restarts pick it up. But maybe (?) this will speed it up somewhat. Probably negligibly though.

Another way would be to make a direct pipe from parts processing into complete blocklist processing (rather than saving parts to intermediate files). The downside is that if there is an issue with one of the parts, we can only find out about it after we complete the whole procedure, and then everything (starting with parts download) would need to be retried. Not ideal IMO.

Otherwise the only way I can think of would be finding ways to reduce processing performed on the parts and/or on the final blocklist. Going again through the code, I can't see anything which makes sense to remove. Perhaps the case conversion tr could me made conditional - if the list is known to be well-behaved in this regard then don't do case conversion. This won't save much time anyway. Then I suppose some parts of the sed sanitization could be made conditional, if we know that given list doesn't need them.

Interesting.

Recall that to make the blocklist compression work we have a script that dnsmasq calls to extract the blocklist to stdout, which dnsmasq reads in.

So we could modify that script to involve passing the content directly to dnsmasq.

Just random thought here but we could generate the file parts then have dnsmasq extract them with final processing.

But I'm not sure this is a great idea.

One wants to minimize dnsmasq downtime and restart time because whilst dnsmasq is restarting lookups are blocked. And it presumably wouldn't be great if any dnsmasq restart involved having to do a lot of processing (e.g. for interface or networking restart that involves restarting the dnsmasq restart service). One nice feature about our existing setup is that the blocklist file whether compressed or not is always there, ready for dnsmasq to quickly read in on restart.

I'm confused. How can you pass the content more directly than piping it into dnsmasq?

I didn't mean that. I was just thinking of a mechanism to feed in the data to dnsmasq since that's what the decompression script held within the dnsmasq.d folder does. You might have had another mechanism in mind?

I was thinking something like cat blocklist | dnsmasq but I don't know if this works.

There is always the dnsmasq interface:

--conf-script=[ <arg]
Execute , and treat what it emits to stdout as the contents of a configuration file. If the script exits with a non-zero exit code, dnsmasq treats this as a fatal error. The script can be passed arguments, space seperated from the filename and each other so, for instance --conf-dir=/etc/dnsmasq-uncompress-ads /share/ads-domains.gz

with /etc/dnsmasq-uncompress-ads containing

set -e

zcat ${1} | sed -e "s:^:address=/:" -e "s:$:/:"

exit 0

and /share/ads-domains.gz containing a compressed list of ad server domains will save disk space with large ad-server blocklists.

https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

We leverage this here:

See this script gets called on dnsmasq restart and at the moment feeds in the compressed blocklist to dnsmasq on stdout.

This mechanism could be manipulated to actually process and feed in blocklist content. Perhaps conditionally based on the arguments mentioned in the --conf-script description above. So regular dnsmasq restart does something different to dnsmasq restart with special adblock-lean flag(s).

Alternatively, the dnsmasq script could be clever and alter its processing such that if there are intermediate files, then process them into blocklist and feed to dnsmasq in parallel, and otherwise just feed the blocklist file into dnsmasq (on additional restarts independent of adblock-lean).

In any case, we could have the service script generate the intermediate files, then restart dnsmasq to have our special dnsmasq script processes the intermediate files whilst feeding directly into the dnsmasq binary.

That is, our processing is spread across:

  • the service script in /etc/init.d/adblock-lean; and
  • our custom dnsmasq intiialization script inside /tmp/dnsmasq.d/ or wherever.

See what I mean? Albeit as you can see this gets a little complicated.

1 Like

So reading the dnsmasq manual again, I got some ideas.

Idea 1: rather than defining each blocklist entry individually, define all of them at once:
local=/ads1.com/ads2.com/ads3.com/.../
The manual says that multiple domains are supported for an entry, and it doesn't say that there is a limit on how many domains are specified. Not sure if it will be OK with hundreds of thousands of domains, but we can test this. Also not sure if dnsmasq would then process dns requests to these domains in the same way as currently and if the performance of dns resolving would be affected. So if this works, performance would need to be tested. The rationale is simply producing smaller files, which also means a bit less processing to do on those files. On the downside, some of our sed commands will need to be changed or replaced since sed can only loop through newline-delimited entries. We could potentially use tr to replace slashes with newlines. So there are some technical difficulties as well - they're solvable but it's unclear how much performance gain will be achieved in the end. Anyway, first thing is to see if dnsmasq will accept the one long entry at all.

1 Like

Definitely worth testing. I found out that wish bash declaring all variables in one line or modifying them in one line is much faster than doing so over separate lines, and similarly bunching up arithmetic operations into one big operation is also faster.

Perhaps the same might be true for dnsmasq when it comes to reading in and processing.

Worth testing surely.

1 Like