Optimization of adblock-lean

This is not possible because the download pipes the data into the further processing chain. When the data starts to flow, all variables are already set.

What we could do is try to deduce the format from the URL, by having a list of known URLs and their corresponding formats.

Ah, personally I would go for the have one list per list type efficiency variant, but this is because you explained the alternatives well, including the benefit.
If I try to imagine how a naive user might look at this, I would tend towards the single mixed list approach instead, as this would buy the project some tolerance against to-be-expected beginner mistakes, saving support time and frustration.
But maybe adblock-lean users are advanced enough for that not to matter much?

(I use knot as DNS resolver (going to the root servers and avoiding my ISP's filtered DNS), so I am using adblock* simply because that supports knot/kresd out of the box, and my turris omnia is fast enough and has enough storage and ram; that is why I never tried adblock-lean and have no real opinion...)

*) I am on record about some unease about sourcing out decisions about what to allow and what not to allow into my network to third parties, but all in all I think this unease might be unfounded, I have not run into any non working URL that I actually wanted/needed to use...

2 Likes

Thank you, I think this is super well framed.

OK @antonk, I'm leaning towards the split blocklist_urls variable approach, but it'd be nice if we could bring everyone onboard.

@Wizballs, @Ree9 and @dave14305 has the posts above increased the attractiveness of the split variable approach at all?

I'm curious where everyone stands now:

  • Use unified bloklist_urls variable
  • Use split blocklist_urls variable
  • Abstain because the pros and cons seem to balance out
  • Abstain as a token of gentlemanly diplomacy
0 voters

After seeing @Wizballs numbers, I re-ran my tests with both busybox sed and gnu sed. These numbers are different from earlier, as I ran gen_config before each set of tests, and then switched from tif.mini to tif. So earlier tests were light + tif, while these tests are pro + tif.

busybox sed averages 26.3s vs 30.3s, which is about 15% difference.

gnu sed averages 23.6s vs 24s, which is about 1.5% difference.

If we weren't already recommending users install coreutils-sort and gawk for performance increases, I would lean towards the faster processing branch for the 15% savings.

But since we're already recommending coreutils-sort and gawk for performance increases, I would lean towards adding sed to the recommended list, which then means both branches are close enough to be considered tied in my eyes. So as a tie-breaker I'd go with whatever is simpler for a user, which would be the unified blocklist_urls variable.

faster processing - busybox sed
27s 26s 26s avg 26.3

faster processing - gnu sed
23s 25s 23s avg 23.6

unified format - busybox sed
30s 31s 30s avg 30.3

unified format - gnu sed
24s 24s 24s avg 24

EDIT: I voted after submitting this message, and I see I'm the lone dissenter. Story of my life :rofl:. I'm not at all against the split blocklist_urls option, and don't expect that the average OpenWrt user will be confused by it.

EDIT 2: gnu sed gives a healthy performance increase with either option for me, so I think recommending it along with coreutils-sort and gawk would make sense.

3 Likes

Probably most people will go with either the default or the lists recommended in the README. Then for the rest of the users, we can mention how to check list compatibility, which is not that complicated.

I agree. That said, as long as GNU sed is not an official dependency, we need to take into account that not everyone is using it.

1 Like

This list looks like a raw domains list at first glance. But it is more than twice as large as the preferred raw domains list format, due to whether subdomains were removed or not.

Recognizing a list that has subdomains removed is very important to optimize for memory consumption, since this script does not implement expensive subdomain removal like Adblock and Adblock-fast.

2 Likes

At least for Hagezi lists, the correct lists are easy to recognize by the "-onlydomains" filename suffix, and by this line in the comments inside the list:
# Syntax: Domains (without subdomains)

Seems you have a mandate for the split variable @antonk!

2 Likes

Ye, trying to navigate through all the transition logic rn.

1 Like

For those coming from master branch that supported only dnsmasq couldn’t we just move all the URLs to the dnsmasq variable?

1 Like

Yes, but I want to also make it easy to test, so it should restore the URLs which are already in the correct config entry.

So latest revision implements config file migration to /etc/adblock-lean, and URLs conversion for Hagezi and OISD urls (the OISD ones needed just very little code so no reason not to include them).

The way it works:
I won't go into much detail about the implementation, but basically it's supposed to recognize dnsmasq-formatted URLs in blocklist_urls/allowlist_urls etc, and offer to automatically convert them to equivalent raw domain URLs. Specifically it recognizes Hagezi lists and oisd lists. Unrecognized lists will automatically get moved to relevant dnsmasq* config options. The tif_ips Hagezi URL will get converted and moved to blocklist_ipv4_urls. It should keep other URLs intact, like if you already have something in dnsmasq_blocklist_urls, these should stay after the conversion.

If the user doesn't approve URL migration, all their dnsmasq-formatted URLs will get moved to dnsmasq_* relevant options.

This was quite a messy job and took me a ton of time to get it right, and there might be some bugs yet.

Once the URL transition is finished, the file /etc/adblock-lean/.new_urls_format is created, and that is later used to avoid nagging the user again.

Of course, needs testing.

1 Like

This is great, and I look forward to testing / trying to break it later on :wink:

Don't know why but I'm sort of relieved this is done. It's good just to know it's in the right spot now lol.

Can't thankyou enough, yet again!

2 Likes

P.s. the default local allowlist and blocklist file paths were also changed to /etc/adblock-lean and that will be written in the new config file after the update. This will take effect either with new installs or when the previously defined paths do not actually exist. If you had another path pointing to an existing file, this path should survive the update.

If you have another idea of where these files should be located by default, please suggest.

P.p.s. currently the implementation runs a sed command on each URL which needs conversion. Notably this is not the most efficient way of doing this but consumes the least lines of code. I may still work some more on this code to replace the sed commands with shell builtins, to make it faster.

1 Like

Sed conversion is surely fast anyway tho? And will probably only run one on most setups. Just trying to save you some work from that perspective...

It's pretty fast as is, but here we're running it on every single url individually. I mean, it still won't take longer than 0.3 seconds in total even in a very slow machine. I guess it's just my perfectionism speaking. Anyway, the replacement native shell code is already implemented, along with some bugfixes. Should push soon'ish.

Also I'm going to travel today for about a week, so not sure I'll be available here during that time.

1 Like

Pushed the updated revision.

2 Likes

Awesome, have fun! Enjoy your time away. I'll use the time for testing etc.

Much deserved break if that's what you are doing.

2 Likes

Barely caught the flight :slight_smile:
Also I feel that you need a break more than myself