Optimization of adblock-lean

antonk · July 11, 2024, 12:00pm

That sure sounds like a great idea, but doing this in a shell script is way above my paygrade

Wizballs · July 11, 2024, 12:36pm

I think part of the allure of parallel processing is solving the problem itself

Something important to consider is the order of blocklist processing. Having the blocklist files download in order from biggest to smallest gives gives the fastest sort -u speed (essentially processed in the same order). Only guessing, but it probably has less sorting to do.

times for: "Performing dnsmasq --test on the blocklist file parts and sorting and merging the blocklist lines into a single blocklist file":

17 seconds with my largest blocklist part first (Hagezi tif, 750k lines)
20 seconds with hagezi tif listed last

Lynx · July 11, 2024, 1:03pm

That reminds me of the script on the main thread that first downloaded and sorted the files and lines into different files in dependence upon the starting letters and then passed those on for further processing. So that way the files are even size rather than say two big and ten small files.

Here is the bit that I am thinking about:

# tolower Input and split lines by first character to different output files
# stores output files to $1_.._$2, e.g. "/tmp/adblock-lean-wd/ablbl_.._httpsabc.new.gz"
abl_split_compress()
{
  [ -z "${SPLIT_MODULO}" ] && echo "FATAL: SPLIT_MODULO is empty!!!" >&2
  nice -n 19 awk -v NPMODULO=${SPLIT_MODULO} -v NPPREP="$1" -v NPPOST="$2" 'BEGIN{
    for(i=0;i<=255;i++){_ORD[ sprintf("%c",i) ]=i} # init _ORD
    for(x=0;x<NPMODULO;x++){
      o="" NPPREP sprintf("%02x",x) NPPOST # convert to hex, 10 -> 0a
      system("mkfifo \"" o "\"") # make named pipe
      system("nice -n 19 gzip < \"" o "\" > \"" o ".new.gz\" &") # open named pipe for read
      printf("") > o # open named pipe for write
    }
    system("sleep 1") # give OS some time to start all sub processes and connect pipes
  };{
    S=tolower($0)
    o="" NPPREP sprintf("%02x", ( _ORD[ substr(S, 1, 1) ] % NPMODULO ) ) NPPOST
    print S > o # write to named pipe - depending on first char
  };END{
    for(i=0;i<NPMODULO;i++){
      o="" NPPREP sprintf("%02x",i) NPPOST
      close(o) # close named pipe
    }
    system("sleep 1") # give OS some time to finish all sub processes
    system("sync")
    for(i=0;i<NPMODULO;i++){
      o="" NPPREP sprintf("%02x",i) NPPOST
      system("rm \"" o "\"") # remove named pipe file
    }
  }'
}

This is quite a piece of work!

Perhaps we can adopt something along these lines.

antonk · July 11, 2024, 2:01pm

Could you elaborate what you mean and what you want achieve?

That awk script is quite something...

Also, I still don't know what dnsmasq is actually doing and which files it loads. It's there a good explanation of this somewhere?

Lynx · July 11, 2024, 2:18pm

So dnsmasq is responsible for converting domain names to IP addresses by sending queries to a resolver and caching along the way. So a first lookup will get sent out to e.g. 1.1.1.1 and then a second lookup of the same thing will just get retrieved from the cache. Some time ago, dnsmasq introduced a new capability to read in a blocklist file so that it would instead return e.g. NXDOMAIN when such queries are sent - see here:

https://thekelleys.org.uk/dnsmasq/CHANGELOG

It implemented this in an extremely efficient manner that basically instantly rendered obsolete many existing adblocking solutions that that were generally extremely clunky and memory intensive. Now it was possible to set up enormous blocklists on even very weedy devices. Then later on dnsmasq expanded this capability further even to support compressed blocklists so that the blocklist can be stored in compressed form (albeit uncompressed for loading into memory upon dnsmasq initialization).

Popular dnsmasq blocklists now exist like these from hagezi:

And as you can see they differ tremendously in size.

So this one is for Windows telemtry:

github.com

hagezi/dns-blocklists/blob/main/dnsmasq/native.winoffice.txt

# Title: HaGeZi's Windows/Office Tracker DNS Blocklist
# Description: Blocks Windows/Office native broadband tracker that track your activity.
# Homepage: https://github.com/hagezi/dns-blocklists
# License: https://github.com/hagezi/dns-blocklists/blob/main/LICENSE
# Issues: https://github.com/hagezi/dns-blocklists/issues
# Expires: 7 days
# Last modified: 10 Jul 2024 22:49 UTC
# Version: 2024.0710.2249.45
# Syntax: DNSMasq v2.86 or newer
# Number of entries: 335
#
local=/in.appcenter.ms/
local=/applicationinsights.azure.com/
local=/dc.applicationinsights.azure.cn/
local=/dc.applicationinsights.azure.us/
local=/in.applicationinsights.azure.cn/
local=/in.applicationinsights.azure.us/
local=/live.applicationinsights.azure.cn/
local=/azsc-bingads-centralus.centralus.cloudapp.azure.com/
local=/inference-app-gateway.eastus2.cloudapp.azure.com/

This file has been truncated. show original

Then this one is the so-called PRO list:

https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/pro.txt

Even though dnsmasq greatly helped reduce memory footprint, an overarching theme of these blocklist games is to facilitate use on low memory devices - see:

As @Wizballs pointed out earlier on, one issue with this file size discrepancy is that processing is uneven so you call a lot of binaries for processing the little files. And it also complicates minimizing total memory use throughout the processing. Since it's desirable to delete files to free up memory.

The script above takes X input blocklists of different sizes, then sorts all lines from them into Y buckets based on first character. So the Y buckets are now of even size. Now it then feeds the Y buckets through further processing, deleting each bucket as it goes along to free up memory during processing.

Consider this line of adblock-lean:

github.com

lynxthecat/adblock-lean/blob/e34f9a7e8e1b54390bbbabda710224b340d210b3/adblock-lean#L417


      
          
          	rm -f /tmp/dnsmasq_err
          
          	{
          		[[ "${use_allowlist}" == 1 ]] && sed 's~.*~server=/&/#~; $a\' /tmp/allowlist
          		rm -f /tmp/allowlist
          
          		for blocklist_file_part_gz in /tmp/blocklist.*.gz
          		do
          			gunzip -c "${blocklist_file_part_gz}"
          			rm -f "${blocklist_file_part_gz}"
          		done
          	} | sort -u | tee >(dnsmasq_test_output=$(dnsmasq --test -C - 2>&1); [[ $? != 0 ]] && "printf $dnsmasq_test_output" > /tmp/dnsmasq_err) > /tmp/blocklist
          
          	if [[ -f /tmp/dnsmasq_err ]]
          	then
          		log_msg "The dnsmasq --test on one of the blocklist file parts failed."
          		log_msg "Last dnsmasq --test error:"
          		log_msg "$(cat /tmp/dnsmasq_err)"
          		return 1
          	else

Notice the use of gzip and also how files are deleted as they are fed through, to help free up memory.

Aims of adblock-lean from my perspective are:

minimize memory footprint;
minimize processing time;
facilitate use of ever larger blocklists on ever weedier devices; and last, but not least,
entertain @dave14305 by creating something that is obscenely overengineered for what it needs to be.

@Wizballs may have some further thoughts on the above.

antonk · July 11, 2024, 2:31pm

Thanks for this explanation. Specifically, which files generated by adblock-lean are loaded by dnsmasq? I'm asking to better understand the script, but also because currently the files generated by adblock-lean have generic names like "blocklist" and "allowlist" which IMO should be replaced with something more unique like "adblock-lean-blocklist" or "abl-blocklist", both for the user to have an easier time figuring out what they're looking at in the filesystem, and in order to avoid possible filename collisions with other pieces of software. So I started working on a PR renaming these files but I'm not sure how to handle file names in the /tmp/dnsmasq.d folder.

As to the above awk script, I'm still not sure what your idea is.

adopt something along these lines

is not specific enough.

Lynx · July 11, 2024, 2:35pm

root@OpenWrt-1:~# cd /tmp/dnsmasq.d/
root@OpenWrt-1:/tmp/dnsmasq.d# ls -alh
drwxr-xr-x    2 root     root         100 Jul 11 12:00 .
drwxrwxrwt   21 root     root         520 Jul 11 12:01 ..
-rw-r--r--    1 root     root        6.5M Jul 11 12:00 .blocklist.gz
-rw-r--r--    1 root     root          54 Jul 11 12:00 .extract_blocklist
-rw-r--r--    1 root     root          59 Jul 11 12:00 conf-script

.blocklist.gz is a compressed file with lots of blocklist lines:

root@OpenWrt-1:/tmp/dnsmasq.d# gunzip -c -d .blocklist.gz | head

bogus-nxdomain=1.0.1.0
bogus-nxdomain=1.0.1.1
bogus-nxdomain=1.0.1.4
bogus-nxdomain=1.0.215.176
bogus-nxdomain=1.0.218.19
bogus-nxdomain=1.0.218.230
bogus-nxdomain=1.0.4.0
bogus-nxdomain=1.0.4.4
bogus-nxdomain=1.0.5.0
root@OpenWrt-1:/tmp/dnsmasq.d# gunzip -c -d .blocklist.gz | tail
local=/zzzz662.cyou/
local=/zzzzaaaa.ddns.net/
local=/zzzzd6.icu/
local=/zzzzza.zapto.org/
local=/zzzzzzzz23.weeblysite.com/
local=/zzzzzzzzzzz.no-ip.biz/
local=/zzzzzzzzzzzzz.com/
local=/zzzzzzzzzzzzzz.no-ip.biz/
server=//#
server=/js.monitor.azure.com/#

The conf-script tells adblock what to do to extract the file:

root@OpenWrt-1:/tmp/dnsmasq.d# cat conf-script
conf-script="busybox sh /tmp/dnsmasq.d/.extract_blocklist"

And .extract_blocklist is the extraction script:

root@OpenWrt-1:/tmp/dnsmasq.d# cat .extract_blocklist
busybox gunzip -c /tmp/dnsmasq.d/.blocklist.gz
exit 0

The above is how it is because it fits within the dnsmasq framework for working with files into its directory structure.

Lynx · July 11, 2024, 2:38pm

So rather than feed in different files with different lengths through a processing pipe, it is possible to download all the files, sort them into e.g. 10 evenly sized files, then feed those evenly sized files through further processing and delete them as things go on. Just toying with ideas here, since as I mentioned above how to maximise efficiency of downloading and processing requires some careful thought. I'm not sure what the best way is. I don't have a concrete idea in my mind.

A downside of that sorting and bucketing approach is it eats up a lot of processing just to help lower memory consumption. Now for sure script taking a bit longer at 5am may not be an issue, whereas memory consumption is, but helping reduce both CPU and memory is ideal.

Really I think what we are trying to figure out is how to most optimally download and process a set of blocklists into a compressed dnsmasq file format, whilst performing all the sanity and safety checks we are already employing to help safeguard against nefarious online blocklist entries and the like. For example, one bad dnsmasq entry could render the network inoperable if it causes dnsmasq to crash. Or another could redirect users that try to visit a domain e.g. mybank.com to a hijacked alternative version. That's why we have all these checks in place.

antonk · July 11, 2024, 2:58pm

This sounds bad enough to make me think about adding another check which would essentially run a regex to validate that each entry is a valid domain/subdomain... Unfortunately, each additional check will hamper the performance. But maybe an existing regex could be enhanced.

Which part of the code do you think is the most memory-intensive?

Lynx · July 11, 2024, 3:00pm

I think maybe the huge sort to remove duplicates and pass through the dnsmasq test? Since at that point everything is completely decompressed. Ideally there we’d remove big evenly sized chunks rather than variable size files. Hence the bucketing approach described above.

antonk · July 11, 2024, 3:12pm

Does passing dnsmasq test guarantee that there will be no dnsmasq crash when loading the entries?

In the meantime, this regex seems good enough for rough domain/subdomain validation:
^(?!.{256})(?:[a-z0-9](?:[a-z0-9-_]{0,61}[a-z0-9])?\.)+(?:[a-z]{1,63}|xn--[a-z0-9]{1,59})$
(taken from here

https://stackoverflow.com/questions/7930751/regexp-for-subdomain

and added support for underlines)

Lynx · July 11, 2024, 3:21pm

I'm not entirely sure. And for sure it wouldn't safeguard against basic lookups to certain addresses working (hence why we check for amazon.com, microsoft.com and something else, post restart). Since a blocklist could go nuts an add way too much.

antonk · July 11, 2024, 3:25pm

Well, I suppose I'll leave the idea of adding another domain/subdomain regex validation, for now. If people start to report dnsmasq crashes then the regex above will be available.

antonk · July 11, 2024, 3:27pm

@Lynx do you have a preference for filename prefix?

Lynx · July 11, 2024, 3:33pm

Maybe adblock-lean_blocklist.gz and adblock-lean_extract_blocklist? Not sure if the conf-script name can be altered. Maybe @Wizballs has a view here.

antonk · July 11, 2024, 3:53pm

How does dnsmasq know to read the conf-script file? I looked through the adblock-lean code and I'm not seeing calls to dnsmasq with the --conf-script option.

Lynx · July 11, 2024, 3:58pm

I think perhaps dnsmasq knows to look for conf-script inside /tmp/dnsmasq.d. This is essential because dnsmasq may be restarted outside the scope of adblock-lean, e.g. upon certain hotplug events.

We also add support for compression by giving dnamsq access to /bin/busybox:

check_blocklist_compression_support()
{
        if ! dnsmasq --help | grep -qe "--conf-script"
        then
                log_msg "The version of dnsmasq installed on this system does not support blocklist compression."
                log_msg "Blocklist compression support in dnsmasq can be verified by checking the output of: dnsmasq --help | grep -e \"--conf-script\""
                log_msg "Either upgrade OpenWrt and/or dnsmasq to a newer version that supports blocklist compression or disable blocklist compression in config."
                return 1
        fi

        addnmount_str=$(uci get dhcp.@dnsmasq[0].addnmount 2> /dev/null)

        for addnmount_path in ${addnmount_str}
        do
                printf "%s" "$addnmount_path" | grep -qE "^/bin(/*|/busybox)?$" && return 0
        done

        log_msg "No appropriate 'addnmount' entry in /etc/config/dhcp was identified."
        log_msg "This is leveraged to give dnsmasq access to busybox gunzip to extract compressed blocklist."
        log_msg "Add: \"list addnmount '/bin/busybox'\" to /etc/config/dhcp at the end of the dnsmasq section."
        log_msg "Or simply run this command: uci add_list dhcp.@dnsmasq[0].addnmount='/bin/busybox' && uci commit"
        log_msg "Either edit /etc/config/dhcp as described above or disable blocklist compression in config."
        return 1
}

dnsmasq is jailed for security reasons that I don't entirely understand. We have to break free from the jail to give support for gzip to decompress the blocklist. How much of a security threat this poses I don't understand. @dave14305 maybe you can elaborate on this?

antonk · July 11, 2024, 4:09pm

I'm not sure about the risk but perhaps we could limit dnsmasq's access to some specific functionality of busybox? If all it needs is gunzip then perhaps creating a symlink pointing to busybox gunzip and changing the entry in /etc/config/dhcp to list addnmount '<symlink_path> will be sufficient.

Or even this way: change the /etc/config/dhcp entry to list addnmount /etc/init.d/adblock-lean, then change the conf-script file contents to adblock-lean extract_blocklist, then add a new action to the adblock-lean script with this code:

busybox gunzip -c /tmp/dnsmasq.d/.blocklist.gz
exit 0

This way the only escape from jail will be call to adblock-lean, so less potential security vulnerabilities (IMO).

BTW could you explain this part:

if [[ "${action}" != "help" && "${action}" != "gen_config" ]]
then
	load_config
fi

The action variable is not assigned a value anywhere in the script, as far as I can tell.

Lynx · July 11, 2024, 4:19pm

That sounds pretty clever to me. I didn't think of it.

antonk:

Or even this way: change the /etc/config/dhcp entry to list addnmount /etc/init.d/adblock-lean, then change the conf-script file contents to adblock-lean extract_blocklist, then add a new action to the adblock-lean script with this code:
busybox gunzip -c /tmp/dnsmasq.d/.blocklist.gz
exit 0

Wouldn't that require busybox access though? I confess I don't fully understand how the jail aspect works.

antonk · July 11, 2024, 4:21pm

I expect it to simply restrict calls to external binaries/scripts, unless specifically permitted. So once permitted to call a script, I expect no more jail for that script.