That sure sounds like a great idea, but doing this in a shell script is way above my paygrade
I think part of the allure of parallel processing is solving the problem itself
Something important to consider is the order of blocklist processing. Having the blocklist files download in order from biggest to smallest gives gives the fastest sort -u speed (essentially processed in the same order). Only guessing, but it probably has less sorting to do.
times for: "Performing dnsmasq --test on the blocklist file parts and sorting and merging the blocklist lines into a single blocklist file":
17 seconds with my largest blocklist part first (Hagezi tif, 750k lines)
20 seconds with hagezi tif listed last
That reminds me of the script on the main thread that first downloaded and sorted the files and lines into different files in dependence upon the starting letters and then passed those on for further processing. So that way the files are even size rather than say two big and ten small files.
Here is the bit that I am thinking about:
# tolower Input and split lines by first character to different output files
# stores output files to $1_.._$2, e.g. "/tmp/adblock-lean-wd/ablbl_.._httpsabc.new.gz"
abl_split_compress()
{
[ -z "${SPLIT_MODULO}" ] && echo "FATAL: SPLIT_MODULO is empty!!!" >&2
nice -n 19 awk -v NPMODULO=${SPLIT_MODULO} -v NPPREP="$1" -v NPPOST="$2" 'BEGIN{
for(i=0;i<=255;i++){_ORD[ sprintf("%c",i) ]=i} # init _ORD
for(x=0;x<NPMODULO;x++){
o="" NPPREP sprintf("%02x",x) NPPOST # convert to hex, 10 -> 0a
system("mkfifo \"" o "\"") # make named pipe
system("nice -n 19 gzip < \"" o "\" > \"" o ".new.gz\" &") # open named pipe for read
printf("") > o # open named pipe for write
}
system("sleep 1") # give OS some time to start all sub processes and connect pipes
};{
S=tolower($0)
o="" NPPREP sprintf("%02x", ( _ORD[ substr(S, 1, 1) ] % NPMODULO ) ) NPPOST
print S > o # write to named pipe - depending on first char
};END{
for(i=0;i<NPMODULO;i++){
o="" NPPREP sprintf("%02x",i) NPPOST
close(o) # close named pipe
}
system("sleep 1") # give OS some time to finish all sub processes
system("sync")
for(i=0;i<NPMODULO;i++){
o="" NPPREP sprintf("%02x",i) NPPOST
system("rm \"" o "\"") # remove named pipe file
}
}'
}
This is quite a piece of work!
Perhaps we can adopt something along these lines.
Could you elaborate what you mean and what you want achieve?
That awk script is quite something...
Also, I still don't know what dnsmasq is actually doing and which files it loads. It's there a good explanation of this somewhere?
So dnsmasq is responsible for converting domain names to IP addresses by sending queries to a resolver and caching along the way. So a first lookup will get sent out to e.g. 1.1.1.1 and then a second lookup of the same thing will just get retrieved from the cache. Some time ago, dnsmasq introduced a new capability to read in a blocklist file so that it would instead return e.g. NXDOMAIN when such queries are sent - see here:
https://thekelleys.org.uk/dnsmasq/CHANGELOG
It implemented this in an extremely efficient manner that basically instantly rendered obsolete many existing adblocking solutions that that were generally extremely clunky and memory intensive. Now it was possible to set up enormous blocklists on even very weedy devices. Then later on dnsmasq expanded this capability further even to support compressed blocklists so that the blocklist can be stored in compressed form (albeit uncompressed for loading into memory upon dnsmasq initialization).
Popular dnsmasq blocklists now exist like these from hagezi:
And as you can see they differ tremendously in size.
So this one is for Windows telemtry:
Then this one is the so-called PRO list:
https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/pro.txt
Even though dnsmasq greatly helped reduce memory footprint, an overarching theme of these blocklist games is to facilitate use on low memory devices - see:
As @Wizballs pointed out earlier on, one issue with this file size discrepancy is that processing is uneven so you call a lot of binaries for processing the little files. And it also complicates minimizing total memory use throughout the processing. Since it's desirable to delete files to free up memory.
The script above takes X input blocklists of different sizes, then sorts all lines from them into Y buckets based on first character. So the Y buckets are now of even size. Now it then feeds the Y buckets through further processing, deleting each bucket as it goes along to free up memory during processing.
Consider this line of adblock-lean:
Notice the use of gzip and also how files are deleted as they are fed through, to help free up memory.
Aims of adblock-lean from my perspective are:
- minimize memory footprint;
- minimize processing time;
- facilitate use of ever larger blocklists on ever weedier devices; and last, but not least,
- entertain @dave14305 by creating something that is obscenely overengineered for what it needs to be.
@Wizballs may have some further thoughts on the above.
Thanks for this explanation. Specifically, which files generated by adblock-lean are loaded by dnsmasq? I'm asking to better understand the script, but also because currently the files generated by adblock-lean have generic names like "blocklist" and "allowlist" which IMO should be replaced with something more unique like "adblock-lean-blocklist" or "abl-blocklist", both for the user to have an easier time figuring out what they're looking at in the filesystem, and in order to avoid possible filename collisions with other pieces of software. So I started working on a PR renaming these files but I'm not sure how to handle file names in the /tmp/dnsmasq.d folder.
As to the above awk script, I'm still not sure what your idea is.
adopt something along these lines
is not specific enough.
root@OpenWrt-1:~# cd /tmp/dnsmasq.d/
root@OpenWrt-1:/tmp/dnsmasq.d# ls -alh
drwxr-xr-x 2 root root 100 Jul 11 12:00 .
drwxrwxrwt 21 root root 520 Jul 11 12:01 ..
-rw-r--r-- 1 root root 6.5M Jul 11 12:00 .blocklist.gz
-rw-r--r-- 1 root root 54 Jul 11 12:00 .extract_blocklist
-rw-r--r-- 1 root root 59 Jul 11 12:00 conf-script
.blocklist.gz is a compressed file with lots of blocklist lines:
root@OpenWrt-1:/tmp/dnsmasq.d# gunzip -c -d .blocklist.gz | head
bogus-nxdomain=1.0.1.0
bogus-nxdomain=1.0.1.1
bogus-nxdomain=1.0.1.4
bogus-nxdomain=1.0.215.176
bogus-nxdomain=1.0.218.19
bogus-nxdomain=1.0.218.230
bogus-nxdomain=1.0.4.0
bogus-nxdomain=1.0.4.4
bogus-nxdomain=1.0.5.0
root@OpenWrt-1:/tmp/dnsmasq.d# gunzip -c -d .blocklist.gz | tail
local=/zzzz662.cyou/
local=/zzzzaaaa.ddns.net/
local=/zzzzd6.icu/
local=/zzzzza.zapto.org/
local=/zzzzzzzz23.weeblysite.com/
local=/zzzzzzzzzzz.no-ip.biz/
local=/zzzzzzzzzzzzz.com/
local=/zzzzzzzzzzzzzz.no-ip.biz/
server=//#
server=/js.monitor.azure.com/#
The conf-script tells adblock what to do to extract the file:
root@OpenWrt-1:/tmp/dnsmasq.d# cat conf-script
conf-script="busybox sh /tmp/dnsmasq.d/.extract_blocklist"
And .extract_blocklist is the extraction script:
root@OpenWrt-1:/tmp/dnsmasq.d# cat .extract_blocklist
busybox gunzip -c /tmp/dnsmasq.d/.blocklist.gz
exit 0
The above is how it is because it fits within the dnsmasq framework for working with files into its directory structure.
So rather than feed in different files with different lengths through a processing pipe, it is possible to download all the files, sort them into e.g. 10 evenly sized files, then feed those evenly sized files through further processing and delete them as things go on. Just toying with ideas here, since as I mentioned above how to maximise efficiency of downloading and processing requires some careful thought. I'm not sure what the best way is. I don't have a concrete idea in my mind.
A downside of that sorting and bucketing approach is it eats up a lot of processing just to help lower memory consumption. Now for sure script taking a bit longer at 5am may not be an issue, whereas memory consumption is, but helping reduce both CPU and memory is ideal.
Really I think what we are trying to figure out is how to most optimally download and process a set of blocklists into a compressed dnsmasq file format, whilst performing all the sanity and safety checks we are already employing to help safeguard against nefarious online blocklist entries and the like. For example, one bad dnsmasq entry could render the network inoperable if it causes dnsmasq to crash. Or another could redirect users that try to visit a domain e.g. mybank.com to a hijacked alternative version. That's why we have all these checks in place.
This sounds bad enough to make me think about adding another check which would essentially run a regex to validate that each entry is a valid domain/subdomain... Unfortunately, each additional check will hamper the performance. But maybe an existing regex could be enhanced.
Which part of the code do you think is the most memory-intensive?
I think maybe the huge sort to remove duplicates and pass through the dnsmasq test? Since at that point everything is completely decompressed. Ideally there weโd remove big evenly sized chunks rather than variable size files. Hence the bucketing approach described above.
Does passing dnsmasq test guarantee that there will be no dnsmasq crash when loading the entries?
In the meantime, this regex seems good enough for rough domain/subdomain validation:
^(?!.{256})(?:[a-z0-9](?:[a-z0-9-_]{0,61}[a-z0-9])?\.)+(?:[a-z]{1,63}|xn--[a-z0-9]{1,59})$
(taken from here
https://stackoverflow.com/questions/7930751/regexp-for-subdomain
and added support for underlines)
I'm not entirely sure. And for sure it wouldn't safeguard against basic lookups to certain addresses working (hence why we check for amazon.com, microsoft.com and something else, post restart). Since a blocklist could go nuts an add way too much.
Well, I suppose I'll leave the idea of adding another domain/subdomain regex validation, for now. If people start to report dnsmasq crashes then the regex above will be available.
@Lynx do you have a preference for filename prefix?
Maybe adblock-lean_blocklist.gz
and adblock-lean_extract_blocklist
? Not sure if the conf-script name can be altered. Maybe @Wizballs has a view here.
How does dnsmasq know to read the conf-script file? I looked through the adblock-lean code and I'm not seeing calls to dnsmasq with the --conf-script
option.
I think perhaps dnsmasq knows to look for conf-script inside /tmp/dnsmasq.d. This is essential because dnsmasq may be restarted outside the scope of adblock-lean, e.g. upon certain hotplug events.
We also add support for compression by giving dnamsq access to /bin/busybox:
check_blocklist_compression_support()
{
if ! dnsmasq --help | grep -qe "--conf-script"
then
log_msg "The version of dnsmasq installed on this system does not support blocklist compression."
log_msg "Blocklist compression support in dnsmasq can be verified by checking the output of: dnsmasq --help | grep -e \"--conf-script\""
log_msg "Either upgrade OpenWrt and/or dnsmasq to a newer version that supports blocklist compression or disable blocklist compression in config."
return 1
fi
addnmount_str=$(uci get dhcp.@dnsmasq[0].addnmount 2> /dev/null)
for addnmount_path in ${addnmount_str}
do
printf "%s" "$addnmount_path" | grep -qE "^/bin(/*|/busybox)?$" && return 0
done
log_msg "No appropriate 'addnmount' entry in /etc/config/dhcp was identified."
log_msg "This is leveraged to give dnsmasq access to busybox gunzip to extract compressed blocklist."
log_msg "Add: \"list addnmount '/bin/busybox'\" to /etc/config/dhcp at the end of the dnsmasq section."
log_msg "Or simply run this command: uci add_list dhcp.@dnsmasq[0].addnmount='/bin/busybox' && uci commit"
log_msg "Either edit /etc/config/dhcp as described above or disable blocklist compression in config."
return 1
}
dnsmasq is jailed for security reasons that I don't entirely understand. We have to break free from the jail to give support for gzip to decompress the blocklist. How much of a security threat this poses I don't understand. @dave14305 maybe you can elaborate on this?
I'm not sure about the risk but perhaps we could limit dnsmasq's access to some specific functionality of busybox? If all it needs is gunzip then perhaps creating a symlink pointing to busybox gunzip
and changing the entry in /etc/config/dhcp to list addnmount '<symlink_path>
will be sufficient.
Or even this way: change the /etc/config/dhcp entry to list addnmount /etc/init.d/adblock-lean
, then change the conf-script file contents to adblock-lean extract_blocklist
, then add a new action to the adblock-lean script with this code:
busybox gunzip -c /tmp/dnsmasq.d/.blocklist.gz
exit 0
This way the only escape from jail will be call to adblock-lean, so less potential security vulnerabilities (IMO).
BTW could you explain this part:
if [[ "${action}" != "help" && "${action}" != "gen_config" ]]
then
load_config
fi
The action
variable is not assigned a value anywhere in the script, as far as I can tell.
That sounds pretty clever to me. I didn't think of it.
Wouldn't that require busybox access though? I confess I don't fully understand how the jail aspect works.
I expect it to simply restrict calls to external binaries/scripts, unless specifically permitted. So once permitted to call a script, I expect no more jail for that script.