Adblock-lean: set up adblock using dnsmasq blocklist

Lynx · November 18, 2023, 7:54am

Not sure. I like and use htop though?

Wizballs · November 18, 2023, 9:09am

htop doesn't monitor free /tmp space that I know of, so I just monitored it via the Luci status page...

And the winner is...... echo awk
Same memory usage as awk alone (approx 2.5 x the blocklist file size), but no increase at all in /tmp usage

@a-z Please ignore trying to use gawk inplace, it uses looooots of memory. Apologies.

Would you possibly mind real world testing echo awk with 23.05.2 and both blocklists causing you issues previously?

echo "$(awk '!seen[$0]++' /tmp/blocklist)" > /tmp/blocklist

a-z · November 18, 2023, 4:07pm

I did the changes, but hangs and didn't even show the echo changes.

using normal script with the OISD and Hagezi Multi Pro, the ram didn't raise to much, but I think htop doesn't show at all the memory but the 4 cores were working in the removing duplicates when it happened.

it's not a complete hang up, because I still have internet, is just the terminals are stuck and if I reach the router via GUI Luci at 192.168.1.1 the website is unaccessible.

after 5-7 minutes was able to enter again to the shell and access via LuCi, the adblock-lean service is running.

I think the new version of OpenWrt consumes more hardware resources than the previous version and that is why this router in particular struggles when after downloading both lists and deleting and cleaning the files.

Lynx · November 18, 2023, 6:49pm

Pity! I had hoped this would turn out to be our magic bullet.

a-z · November 18, 2023, 7:07pm

I think it worked well, but I think that in this particular case the recommendation would be to limit the blocking lists. It works very well for me with Hagezi Light which is less than 5 megabytes with 60k domains or Hagezi Multi Pro which is around 8 megabytes with 160k lists. I still have 20 megabytes free for other services. The truth is that it was a good idea to leave the lists open and optional.

Perhaps later it would be to structure the steps in the process of organizing and joining multiple lists, perhaps fragmenting those processes although it takes a little longer so that it does not consume all the resources of routers with limited hardware, but that would mean restructuring the script. But I don't see it as urgent or priority.

Perhaps what could be done is to expand the readme file and explain about routers with limited hardware, perhaps try smaller and lighter lists.

Also I'll run

WireGuard Client/Peer
SQM for Guest Wifi
DNS Encrypt to Hijack DNS to work with Adblock-Lean
Travelmate

and I want to test them with Tailscale so that will consume more ram.

Wizballs · November 18, 2023, 8:42pm

Thanks @a-z for testing these options even if they didn't work out. Good suggestion about adding an eg recommended low ram block list.

mth404 · November 19, 2023, 9:15pm

Hi, I just created an account to add some more options for those memory problems. I have some expirience in shell scripting, and wrote a very simple version of adblock-lean with a minimum amount of memory usage. To eliminate the temp-files, I did it with pipes.
You can even pipe shell functions. All files written to tmp are compressed. Unfortunately dnsmasq cant read gz and ignores named pipes - so, at least the final config has to be there uncompressed.

As written, it is a quite simple script mostly without error handling - not to use but to have some ideas how to solve those memory issues.

On my Archer C7v4, with both default lists, it took about 100sec including download time.

Fell free to test and send me some feedback/questions - cheers!

Script

#!/bin/sh

set -o pipefail
# Fail, if one pipe member fails - not only on error on last pipe member
# ( false | true; echo $? ) # -> 0
# ( set -o pipefail; false | true; echo $? ) # -> 1


exit_on_err()
{
  [ $1 -ne 0 ] && { echo "ERROR $1: $2" >&2 ; exit $1; }
  echo "OK: $2" >&2
}


dl_url()
{
  fn=$( echo "${1}" | tr -d "[/:.?&=]" )
  echo "start DL of $1" >&2
  # store download Output to a compressed file
  uclient-fetch "${1}" -O- --timeout=2 2> /tmp/abt/uclient-fetch_err | head -c "10m" | gzip > /tmp/abt/${fn}.new.gz
  res=$?
  [ $res -eq 0 ] || echo "dl_url FAILED: $1"
  if [ $res -eq 0 ]
  then
    mv /tmp/abt/${fn}.new.gz /tmp/abt/${fn}.gz
    grep "^Download completed" /tmp/abt/uclient-fetch_err >&2
  else
    rm /tmp/abt/${fn}.new.gz
    [ -f /tmp/abt/${fn}.gz ] && res=0
    echo "Download failed" >&2
  fi
  exit_on_err $res "dl_url $1"
  # download ok or backup file there - keep file and output to stdout
  gunzip -ck /tmp/abt/${fn}.gz
}

dl_all_urls()
{
  echo "start DL LIST" >&2
  for url in https://big.oisd.nl/dnsmasq2 https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/pro.txt
  do
    dl_url $url
  done
  echo "end DL LIST" >&2
}

clean_list()
{
  sed -n "s~^\(local\|address\)=/\([^/]*\)/.*~\2~p" | sed "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/"
}

remove_duplicate()
{
  echo "start dedup" >&2
  awk '!seen[$0]++'
  echo "end dedup" >&2
}

remove_allow()
{
  echo "start remove allow" >&2
  if [ -f /tmp/abt/allowlist ]
  then
    grep -F -x -v -f /tmp/abt/allowlist
  else
    cat
  fi
  echo "end remove allow" >&2
}

add_blacklist()
{
  echo "start add block" >&2
  cat
  [ -f /tmp/abt/blocklist ] && cat /tmp/abt/blocklist
  echo "end add block" >&2
}

format_final_list()
{
  sed "s~^.*$~local=/&/#~"
}

echo "start OVERALL" >&2

dl_all_urls | clean_list | remove_duplicate | remove_allow | add_blacklist | format_final_list | gzip > /tmp/abt/dnsmasq.new.gz
res=$?

if [ $res -eq 0 ]
then
  mv /tmp/abt/dnsmasq.new.gz /tmp/abt/dnsmasq.gz
  gunzip -ck /tmp/abt/dnsmasq.gz > /tmp/dnsmasq.d/adblock-lean
  /etc/init.d/dnsmasq restart

  echo "finished OVERALL with success" >&2
else
  echo "finished OVERALL with ERROR $res" >&2
fi

exit $res

Lynx · November 19, 2023, 11:00pm

Super cool! I really like the way you pipe those functions together. Does working in the compressed space not eat up more CPU cycles? So for devices with more memory the downside would be lengthier processing time?

Wizballs · November 19, 2023, 11:21pm

That is a nifty idea! Did you by any chance check the total peak memory usage for both methods? Would be great to compare

mth404 · November 19, 2023, 11:28pm

Hi, yes, but, that is the idea - save some ram but need to do more CPU work. We could add a Config Option like "RAM_VS_CPU".
e.g. 100 means, try hard to save RAM, 0 means, try hard to save CPU.
On each function, we could add different code options depending what is most important for the user - with a default to save RAM...

Lynx · November 19, 2023, 11:35pm

So I need some more knowledge of how tmp files work. When a command like awk works on a temp file, doesn’t it work directly on that file rather than copying it into memory? If so, how does piping gunzip into awk save memory? I’m asking out of ignorance here.

mth404 · November 19, 2023, 11:37pm

So, currently, the line

awk '!seen[$0]++'

takes the most RAM, because it holds the uncompressed (edit) deduplicated (/edit) list once in RAM. I'm thinking of better solutions, but, this could be tricky. The idea would be, that there is a lot redundanancy in e.g.
ad.abc.xyz.com
and ad.xyz.us
and store ad and xyz only once in memory - but, to get things together later, you need some infos and that costs also RAM...

mth404 · November 19, 2023, 11:40pm

So, I think no - awk reads a file and it does not matter if it is on tmp or not. It is doing its processing and output - you redirect it to a tmp file again, but you could also store to a usb stick...

As far as I know, the tmp filesystem eats about 50% of RAM of the Computer. The rest is available to execute programs like awk etc...

Lynx · November 19, 2023, 11:44pm

Is zram-swap not an option?

Wizballs · November 19, 2023, 11:46pm

The awk seen command is using the most memory out of any functions used here. I'm not sure compressed file will actually reduce the awk memory usage for this particular function, but it is worth checking total peak memory use both ways.

I believe awk creates a hash table in memory for the whole file, instead of straight byte comparison. Which is also why it's so much faster than pure byte comparison methods.

I could try a sed method to remove duplicates only for low ram devices. But I'd have to check if it even saves memory first. And it will be slower than awk by a margin.

mth404 · November 19, 2023, 11:48pm

Things will get better with it, but, with some more effort with the code, we could do even better than with zram

mth404 · November 19, 2023, 11:49pm

seen is not a comman, it is a simply array - store each line into the array and if it is not already in the array, print line - that is how this simple code works
Instead of seen you could also write "S" oer "some_array"

Wizballs · November 19, 2023, 11:50pm

Yes correct, seen just happens to be the commonly used variable name for this array and function.

Lynx · November 19, 2023, 11:53pm

Ah I think I get the gunzip approach now. The point is that by working with pipes gunzip doesn’t uncompress the whole thing it uncompresses chunk by chunk as the processing in awk ensues.

mth404 · November 19, 2023, 11:55pm

yes - as far as I know, each process has a 4kb input and a 4kb output buffer, and, maybe a 4kb pipe buffer inside the os - so, in sum about 12kb per pipe in place