Adblock-lean: set up adblock using dnsmasq blocklist

Lynx · May 1, 2023, 4:50pm

OK I've implemented support for: a) curl; and b) multiple blocklist files.

If curl is available, curl is used, and otherwise wget is used. The benefit of curl is that the size of the blocklist can be checked before actually downloading the blocklist file.

@Wizballs and @dave14305 how important is it to keep wget compatibility? If not very important, dropping it would help simplify the code.

Multiple blocklist file URLs can now be specified and all successfully downloaded blocklist file parts will be merged and duplicates removed using @Wizballs's awk solution to form a single blocklist file that will then be used as before. This approach seems to me a good compromise between facilitating multiple blocklist sources and not introducing excessive additional complexity.

I tested using:

blocklist_urls="https://big.oisd.nl/dnsmasq2 https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/pro.txt"

And whereas OISD big only gives:

New blocklist installed with good line count: 294959.

OISD big plus Hagezi multi pro gives:

New blocklist installed with good line count: 677515.

And using:

I see:

@Wizballs the use of multiple blocklist file parts complicates the issue of file size. I see three options:

a single maximum file size and make sure that a) no one part exceeds the maximumum file size and b) the merged total does not exceed that maximum file size;
a merged maximum file size (for the merged parts) and a single part maximum file size (for any one part); or
a global merged size and multiple part maximum file sizes (possible but trickier to implement).

What are your thoughts on the three options above?

OISD big plus Hagezi multi pro gives a merged blocklist file size of just shy of 20 megabytes.

Wizballs · May 1, 2023, 8:40pm

That was fast - nicely done! I'll can test tonight after work and see how it goes.

My intention is to continue using oisd, along with another list for cookie poppus, maybe fanboy annoyances - not sure yet. I only used Hagezi temporarily for duplicate processing/testing etc.

My thoughts on this, are just testing the input files only via the variables set at top.
so eg OISD I would continue using max file 20mb, min good line count 100,000.
Fanboy annoyance list I would ideally apply different value eg max file size 5mb, min good line count 25,000. Whatever the output will be from multiple files, will be. Therefore leave the final number of lines up to each user to manage in relation to their hardware limitiations.

Personally I'll be using curl, so not worried either way. I can see the benefit to leaving wget in there though, adblock-lean will still run if someone forgets to install or doesn't want to for whatever reason.

Lynx · May 1, 2023, 9:01pm

So sounds like you favour just one file size? If so, does it make sense to you that we simply just check that: a) any one part is less; and b) the merged total is less? And leave it at that? And similarly just one line count for the merged total?

@Wizballs your awk-based merge without duplicates solution seems super fast by the way! And in general the new merging code seems like it's working, but clearly this needs some more testing and there's surely some refinement and tweaking still needed.

Anyone else have thoughts on the above or wget vs curl? I'm tempted to just drop out wget and force all users to use curl. But if even just one person objects to this, we should keep in wget.

dave14305 · May 1, 2023, 9:26pm

I would try to keep the script functional with no additional packages required. Stick to your lean guns. The wget binary is compiled with limited features compared to other platforms I’ve used wget on, so creativity may be required.

I don’t know where the size obsession comes from, besides the Y chromosome, but you could also compare the variance in size from old list to new list instead of some fixed value.

Wizballs · May 1, 2023, 10:11pm

lololol. Actually there is a reason for it. A couple of times OISD size increased to 25mb for a few days (usual is ~8mb), probably a compile error or something. And then some peeps with low ram routers had inoperational routers. Therefore, this general check put in place...

Wizballs · May 1, 2023, 10:21pm

I still think, check any one part is less, is the main thing needed, this can be set in curl download command line of course. But a total size check after won't hurt one bit. And both user-configurable? edit oh yes minimum line count afterwards to check everyting downloaded ok..

Every sed or awk command I've put forward so far were the fastest solution of several options that I tested (real-world testing on r7800) Eg the sed line checks are actually about 20% faster than awk in a similar implementation. But for de-duping, awk was waaaay faster then any sed command I tried.

Lynx · May 2, 2023, 2:35pm

You are right.

Actually I think I have just found a way to use wget to probe file size before downloading:

wget --spider -S https://example.com 2>&1 | sed -ne '/Length/ {s/ *Content-Length: *\([0-9]\+\)/\1/ ; p}'

https://www.mail-archive.com/busybox@busybox.net/msg27925.html

So we can hopefully leverage this to dispense with curl.

Update: this won't work unfortunately with the busybox wget presently in OpenWrt.

Wizballs · May 2, 2023, 7:54pm

Just gave the updated script with multi-list a test run, and again - nice work!
Everything is running as intended, and even with two huge lists the complete refresh takes under 30 seconds on r7800
I did revert back to using just oisd (ie single list) in the configuration, and that is running nicely also. I'll do more crack testing at some point but just wanted to report this so far.

Lynx · May 2, 2023, 9:56pm

I have reworked the update based on @dave14305's excellent point above about keeping things lean.

So rather than introduce curl to limit the downloaded file size, I have found a way to limit the downloaded file size using wget in combination with head.

The new download routine looks like this:

	blocklist_id=0
	for blocklist_url in ${blocklist_urls}
	do
		for retries in 1 2 3
		do
			wget "${blocklist_url}" -O- --timeout=2 2> /tmp/wget_err | head -c "${max_blocklist_file_part_size_KB}k" > "/tmp/blocklist.${blocklist_id}"
			if grep -q "Download completed" /tmp/wget_err
			then
				log_msg "Download of new blocklist file part from: ${blocklist_url} suceeded."
				blocklist_id=$((blocklist_id+1))
				continue 2
			else
				log_msg "Download of new blocklist file part from: ${blocklist_url} failed."
				sleep 5
			fi
		done
		log_msg "Exiting after three failed download attempts."
		rm -f /tmp/wget_err /tmp/blocklist.* 
		return
	done

New code here:

Wizballs · May 3, 2023, 11:05pm

That excellent. Good to know an oversize list blocker is inplace, without having to use curl.

Should we implement a white list to be pretty much feature complete? Would be the last step after de-duping. Possibly a user selectable file somewhere, whether stored on router or usb etc. Any line in the blocklist containing a whitelist line will be deleted. I probably won't have a chance to play until the weekend though.

Lynx · May 4, 2023, 7:11am

Yes nice idea! User can specify file location (e.g. /root/adblock-lean/whitelist) and then that file is read in and used to remove entries from the blocklist. Perhaps this could be merged with one of the existing processes like the duplicate removal, or alternatively just a new process.

adblock-lean - lean, and yet feature rich!

Wizballs · May 4, 2023, 10:37am

This has to be treated a little different than exact whole line matches used for de-duping. Ie needs to handle sub-domains. For example user might want to whitelist "google.com" but then all sub-domains need to be whitelisted also eg "calendar.google.com". I think because of this it might have to be a separate function, for now anyway. To make this work any blocklist entries ending in google.com needs to be removed. I'm not even sure how to do this yet lol, but will figure it out over the weekend.

Has a lot of checks, functions and protections jammed into less than 250 lines of code for sure!

a-z · May 6, 2023, 6:31pm

hello guys! I'm still super happy Adblock-Lean is running smootly and reloads everyday with the cron thing! I'm super happy, filtering is working and I enforced all the dns with https-dns-proxy package so it really forces even those clients with dns hard-coded like IoT google and alexa devices.!

thank you!

Wizballs · May 7, 2023, 11:02am

Quick update on the Allowlist. It's getting closer, and with some help from the stackexchange forums.

Fwiw, an exact line match from the Allowlist is really easy. But doing a subdomain wild card match with the list is a lot trickier.

@Lynx if you are happy to wait a bit longer, I'd rather get this correct first, instead of putting a temporary substandard part in.

Lynx · May 7, 2023, 7:57pm

Yeah let's wait. No rush.

And in the meantime, I've merged the multiple blocklist files commit from the testing branch into main:

Personally I'm now using:

github.com

lynxthecat/adblock-lean/blob/915fd30b7aad75a8e381eec2b4e2619229b5364c/adblock-lean#L12


      
          
          
# adblock-lean blocks ads using the highly popular oisd dnsmasq file and
          # offers a leaner solution to adblocking than the existing alternatives on OpenWrt
          
          
# Project homepage: https://github.com/lynxthecat/adblock-lean
          
          
# Authors: @Lynx and @Wizballs (OpenWrt forum)
          
          
# *** CONFIGURATION OPTIONS ***
          
          
blocklist_urls="https://big.oisd.nl/dnsmasq2 https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/pro.txt"
          max_blocklist_file_part_size_KB=20000
          max_blocklist_file_size_KB=25000
          min_good_line_count=100000
          
          
# *** DO NOT EDIT BELOW THIS LINE ***
          
          
export PATH=/usr/sbin:/usr/bin:/sbin:/bin
          export HOME=/root
          
          
START=99

Are you?

All seems fine for me in terms of general use; all adverts gone and everything still seems to work fine. Presumably this is a little more CPU heavy than just the OISD big list - after all, the combined list is much larger, but on my RT3200 it's no issue.

I'd be very curious about any findings on other devices.

Wizballs · May 15, 2023, 11:27am

@Lynx Got some awesome help from the stackoverflow community to properly implement an allowlist solution:

awk -F'/' 'NR==FNR { allow[$0]; next } { n=split($2,arr,"."); addr = arr[n]; for ( i=n-1; i>=1; i-- ) { addr = arr[i] "." addr; if ( addr in allow ) next } } 1' allowlist blocklist > output

Blocklist as per usual:

local=/randomsites.com/
local=/calendar.google.com/
local=/google.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

Allowlist format:

google.com
othersites.com
etc

New blocklist output:

local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

As you can see the allowlist google.com entry is correctly removing subdomains of google.com, ie calendar.google.com, while still leaving the fake (for example purposes) lines containing google substrings.

As it's using openwrt's awk (not gawk), will need to output to another file (again) as inplace editing is only available in gawk.

I think before executing the command, will need to check the both that the allowfile exists, and that the file size is bigger than zero bytes. Zero bytes seems to empty out the whole blocklist file, while even just one byte in the file, whether a single space or single carriage return doesn't have undesirable effects.

This solution is blazing fast for what it does! On r7800:
300k lines blocklist, 13 lines allowlist = 23.3 seconds
300k lines blocklist, 300k lines allowlist = 21.5 seconds

More allowlist lines is actually faster somehow!? And I don't think there would be a much faster method that exists, if at all.

Lynx · May 15, 2023, 5:31pm

Sounds good to me.

Are there any ways in which this might fail (aside from the zero bytes issue)?

@dave14305, or anyone else, any critical thoughts on the approach suggested by @Wizballs above?

Wizballs · May 15, 2023, 8:33pm

I think users need to run some amount of discretion, ie don't go using massive blocklists and massive allowlists on low ram routers. But besides that, i found no cracks/issues in various test runs.

As an interesting alternative, there is a method which is very fast for small allowlists, but runtime grows linearly with allowlist size. This is due (in my simple understanding) that awk above is using arrays, while this grep/sed method below is using an iterative loop. Some runtime test for grep/sed method is:
300k lines blocklist, 13 allowlist lines = 1.7 seconds (I mean, wow)
300k lines blocklist, 500 allowlist lines = 20.6 seconds
300k lines blocklist, 1000 allowlist lines = 40.4 seconds
300k lines blocklist, 2000 allowlist lines = 77.6 seconds

tmp=$(mktemp) && { sed 's|$|/|' allowlist > "$tmp" && grep -vwFf "$tmp" blocklist; rm -f "$tmp"; } > output

So there are some real options here:

Use the always reliable, consistent runtime method of awk
Check the allowlist line count. If under 500 lines use grep/sed, otherwise use awk
Always use grep/sed and inform the user that anything over 500 allowlist lines is going to blowout processing time.

Majority of use cases will likely be under 500 lines, or probably even zero lines. But as always I think a 'bulletproof' method is important...

Lynx · May 15, 2023, 9:13pm

Fantastic work in identifying a very promising looking solution.

Absent any issues with the awk-based approach that anyone can identify, I'll code that up to further facilitate allow lists in our adblock-lean! This is clearly a feature that certain users will value.

I'm still using the hybrid list by the way. How about you? And have you identified any specific exceptions to add to an allow list for your specific use case?

Wizballs · May 16, 2023, 6:15pm

I did try out OISD and Hagzei combined for a while, mostly to test the script etc (working well). I've gone back to OISD solo for now, with plans to add a cookie popup blocker. Looking at Fanboy annoyance list but he is only providing adblock or uBlock lists, not dnsmasq native syntax as we require. Not sure if I kindly ask him to add dnsmasq syntax, or run some conversion process somewhere else. No rush atm anyway.

So you are saying you aren't getting many/any broken sites running these two lists? Interesting... I might have to revisit and trial this again for myself. No broken sites due to dns blocking remains my priority for the household. But everyone has different needs/wants.

I'll run a handfull of essential sites in the allowlist including home automation such as lighting and CCTV just to make sure those are never ever accidentally blocked. Not everyone will use the allowlist, but will be great having it there for those that do want it.