antonk
781
I had a second idea but it turned out to be not viable, so for now that's it 
antonk
782
Actually, another idea is to switch to using raw domains lists and change the order in which lists are processed.
Currently (counting only the heavyweight steps) we have:
- Case conversion (tr) - not sure how resource intensive it is but counting it anyways
- Sanitization + conversion to
local=/domain/ (sed)
- Allowlist entries removal (awk - this is the slow one)
- Rogue entries check (sed)
If we switch to raw lists, we could do this:
- Case conversion (shorter lines to process)
- Sanitization (shorter lines to process)
- Allowlist entries removal (shorter lines to process and less internal awk commands)
- Rogue entries check (shorter lines to process)
- conversion to
local=/domain/
While this introduces a 5th processing step, it might well cut down on the processing time.
1 Like
antonk
783
^ This idea doesn't work, unfortunately. dnsmasq just says:
Name does not resolve at line 1 of stdin
So looks like there is a de-facto line length limit. Doesn't work even with shorter lists, for example with this one which only has 338 lines:
https://raw.githubusercontent.com/hagezi/dns-blocklists/main/dnsmasq/native.winoffice.txt
Although with this one I got a different error:
bad option at line 2 of stdin
Which is odd because that file only has one line (and a newline in the end).
Try piping the list of domains through xargs -n 20 and then adding the local=/ at the beginning, replacing spaces with /, etc. Keep increasing the number per line until you hit the limit.
2 Likes
antonk
785
Trying the 2nd idea now. I got it to work (currently only with blocklists). Wanted to compare the performance but there is an unexpected issue. Turns out that Hagezi lists with the same name have a vastly different number of entries in the 'domain' version vs the 'dnsmasq' version. For example the 'pro' list has 165381 entries in the dnsmasq version and 514896 entries in the domains version. The 'tif' list also has a different length (703,115 vs 1,233,022). The good (?) news is that for the 'tif' list, which is 1.75 times larger in the domains version, processing time is 12s vs 7s with the dnsmasq version, which is 1.71 times slower. So processing domains is probably a couple percents faster
Although with 1s precision it's impossible to tell, of course.
I'll put this branch on my github in a moment so you guys can test under more realistic conditions and with better chances to see performance difference. Although ideally we should come up with a list which has the same length between the domains and the dnsmasq versions.
antonk
787
Brilliant! Exactly the same list.
So I've put the experimental version on my Github:
https://github.com/friendly-bits/adblock-lean/tree/faster_processing
If you wanna test, please get rid of any allowlists for now (the code is not ready to process them).
On my VM, time is exactly the same, but then maybe it's just my download speed (?). Hard to believe that processing 13.1MiB of data takes the same time as 18.46MiB, even with the additional processing step. Or maybe just wishful thinking on my part.
Lynx
788
Should we support both list types and process accordingly?
Could you give example blocklist_url strings for the tests?
antonk
789
Lynx
790
Regular code:
Processing time for blocklist generation and import: 0m:42s.
Wildcard variant:
Warning: Rogue element: '0.beer' identified originating in blocklist file part from: https://raw.githubusercontent.com/hagezi/dns-blocklists/main/wildcard/tif-onlydomains.txt.
Heh - beer is not always a good thing then!
antonk
791
My bad - should work in the updated revision.
antonk
792
Also I got idea 1 to work when limiting to 5 source lines per local= entry.
Apparently dnsmasq has a hard limit of somewhere around 1024 characters for each line. And apparently some lists have some very long domain names. So it would error out with 8 source lines but passes with 5 source lines.
Again, can't see performance improvement on my VM, but it does cut the resulting (uncompressed) file size from 18.46MiB to 14.17MiB, so about 23% reduction in size. I'd guess it should also be somewhat faster on a real router. Waiting for you guys to test the performance with the other optimization before updating the code to include this one.
1 Like
Lynx
793
OK works now and ran twice but each time same processing time:
Processing time for blocklist generation and import: 0m:42s.
1 Like
antonk
794
Could you try the updated revision? It combines 5 lines into 1.
Also if you could try with GNU sed, just to have the complete picture.
1 Like
Lynx
795
New wildcard code without sed:
Processing time for blocklist generation and import: 0m:39s.
New wildcard code with sed installed:
Processing time for blocklist generation and import: 0m:32s.
And now existing code with sed installed:
Processing time for blocklist generation and import: 0m:40s.
So big improvement I think.
1 Like
antonk
796
So I gather that we are getting 8s reduction with both GNU sed and without (and GNU sed is faster by additional 7s). That in addition to size reduction in the final blocklist file. Seems like a progress, no?
Just need to test domain resolution performance now.
Also I want to point out that this is a sketch, not really the final version. Probably in the final version we'll need to do the conversion in generate_and_process_blocklist_file(), otherwise sort -u is basically useless.
Lynx
797
Definitely progress.
Am I testing in the right way in that with the new code I should be using the wildcard list?
If so, with busybox sed then old code (regular list) gave 42s and new wildcard code gave 39s - so 3s reduction.
But with GNU sed, new wildcard code gives a big jump down to 32s.
antonk
798
Yes, you are.
Yes, for some reason I remembered you saying 47s with the current master and Busybox sed. So my above calculation is wrong. 3s reduction with wildcard domains code + combining 5 lines into 1. And additional 7s reduction with GNU sed and the new code.
Still more changes to the code will follow, so the preformance might change again. In the meantime, maybe you could put together some simple script to test domains resolution time? I suppose you could use an allowlist as the source for the domains to test.
antonk
800
Maybe you need to test for domains which are especially slow to resolve (if that's a thing? I don't have enough background knowledge on this), if those are in the list then remove them to get a more representative result.