Optimization of adblock-lean

Lynx · March 17, 2025, 10:08pm

But there is still the blocking issue. Yes it won't block at the beginning, but it will block when pipe is full (65,536 bytes?).

My round robin idea wouldn't suffer from this issue though because downloaded data is always passed on for processing.

antonk · March 17, 2025, 10:14pm

You are correct when saying that awk is much faster for certain tasks, and we are using awk in some processing steps. Calling a binary has its own overhead, though, so we are only doing this where it makes sense. Anyway, when it gets to coordinating I/O, awk is much less capable than shell as it is not really intended for this purpose.

Feel free to share your ideas. Also if you can write an awk script, you are not necessarily significanly less skilled in programming than myself or @Lynx. We just probably have a bit more experience.

antonk · March 17, 2025, 10:19pm

We could, at least in theory, if we wrote adblock-lean in C or Rust or another "real" programming language. Then we could write sophisticated processing methods which hop from one input to another. With shell, processing tools available to us are mainly sed, grep and awk, which do not have such capabilities. Trying to split up the data stream and process it in chunks is an idea I tried, and it was actually slower rather than faster.

Lynx · March 17, 2025, 10:25pm

How about incorporating a technique along these lines:

Explanation here:

So I haven’t studied this in detail but it seems that it’s possible to have awk issue system calls to manage pipes and write to them.

See especially:

Adblock-lean: set up adblock using dnsmasq blocklist

abl_split_compress()
{
  [ -z "${SPLIT_MODULO}" ] && echo "FATAL: SPLIT_MODULO is empty!!!" >&2
  nice -n 19 awk -v NPMODULO=${SPLIT_MODULO} -v NPPREP="$1" -v NPPOST="$2" 'BEGIN{
    for(i=0;i<=255;i++){_ORD[ sprintf("%c",i) ]=i} # init _ORD
    for(x=0;x<NPMODULO;x++){
      o="" NPPREP sprintf("%02x",x) NPPOST # convert to hex, 10 -> 0a
      system("mkfifo \"" o "\"") # make named pipe
      system("nice -n 19 gzip < \"" o "\" > \"" o ".new.gz\" &") # open named pipe for read
      printf("") > o # open named pipe for write
    }
    system("sleep 1") # give OS some time to start all sub processes and connect pipes
  };{
    S=tolower($0)
    o="" NPPREP sprintf("%02x", ( _ORD[ substr(S, 1, 1) ] % NPMODULO ) ) NPPOST
    print S > o # write to named pipe - depending on first char
  };END{
    for(i=0;i<NPMODULO;i++){
      o="" NPPREP sprintf("%02x",i) NPPOST
      close(o) # close named pipe
    }
    system("sleep 1") # give OS some time to finish all sub processes
    system("sync")
    for(i=0;i<NPMODULO;i++){
      o="" NPPREP sprintf("%02x",i) NPPOST
      system("rm \"" o "\"") # remove named pipe file
    }
  }'
}

antonk · March 17, 2025, 11:34pm

That implementation works around awk limitations by essentially piping the data into gzip into temporary files, then running sort on each of those files. I did implement and test a very similar idea, just not inside awk and avoiding intermediate gzip and temporary files. It worked, the problem: it was slower. Something like 18s vs 23s. So slower by about 30%. This could be useful to reduce the memory footprint of sort, however as we discovered a few months ago, the largest memory hog in our processing chain is dnsmasq, not sort. If the device doesn't have enough free memory for normal sort of the input then it definitely doesn't have enough free memory to load the final blocklist into dnsmasq. And if the memory temporarily gained by splitting up the sort is free and available to begin with then I don't see any justification for paying for it with reduced performance.

Lynx · March 18, 2025, 5:42am

I thought that that technique could be leveraged rather to round robin lines from a download across multiple processors. And then also to output processed lines to files corresponding with the downloads.

X downloaders and round robin to Y processors. Then pass on processed data to X files corresponding to the X downloads.

This way even a single file can be split up across multiple cores. And everything is balanced.

antonk · March 18, 2025, 6:26am

Well I don't know how to implement the round robin idea in shell code in a way that doesn't degrade the performance. Also if this was possible to implement, this would only be useful for cases where the amount of downloaded data is large in comparison to the bandwidth. For adblocking, we are talking about let's say 50MiB of data max, which on 100Mb/s (megabit) connection takes 4s to download if the servers allow (plus some connection-establishing overhead). Even if we assume that the effective bandwidth to the server is 1/4th of that already very conservative number, it's 16s download time. So at least for this application, I'm not seeing a good reason to seriously try a complex implementation for the sole purpose of shortening the download time. But if you want to implement this for fun then why not.

P.s. I suppose one could redirect data in awk, in chunks, and try to coordinate multiple such awk programs running in parallel, but I can't imagine a way to establish such coordination, especially establish it reliably, especially considering the fact that data flows in pipes as bytes, not as lines, and that the kernel only guarantees that the data is delivered atomically for very small chunks (see the man page I linked to above), and if you send larger chunks then atomicity is not guaranteed, i.e. bytes sent sequentially from multiple sources may get mixed up (at least to my understanding). [edit: after re-reading the man page, I think it will only get mixed up if not sent sequentially, but this probably implies that one process needs to wait for another process to finish sending the data before writing its own chunk to the pipe, which means that the data flow must be interrupted during synchronization]

All this before even considering the performance overhead of piping data through another processing step which does some logic on every line.

Lynx · March 18, 2025, 7:57am

But doesn’t this round robin technique help ensure maximum utilisation of cores even if there is only one file to download or perhaps just one large file and two small files?

Since otherwise if it’s just one download file per core then the processing load will be mainly taken up by a single core or at least unevenly spread.

In terms of implementation, would it not work to have X awk processes, one awk process for each download file, with each awk process passing lines round robin between Y processing processes, and then for the Y processes to do their usual processing except that results are passed to Y awk processes that share out lines back to X files, whereupon the compression occurs, with each file corresponding to the download file?

I realise I might be missing something with the above, but I wonder if splitting out and reconstructing in this way might offer a way to evenly distribute incoming data across processes in a way that doesn’t care about how many or what size of download files there are. That is, overcome the issue that the resource allocation is dependent on the download file number and relative sizing such that mostly only a single core will get utilised.

To give an example, say you have one 20MB blocklist and one 500KB blocklist. If we just allocate all of the first 20MB file to one core and all of the 500KB to the second core then the second core will hardly get used. And if there are more cores then we can’t even use those.

Whereas by sharing out the lines across the cores we ensure all are used whatever the download list number and relative size.

The dance involving splitting out from X to Y and then merging back from Y to X is to retain the capability to discard a bad blocklist despite the continuous and parallel processing.

antonk · March 18, 2025, 11:26am

There are 2 separate things which this technique could theoretically help, if it worked. One is reducing the download time - this one is insignificant because the files typically small, compared to bandwidth. The other one is processing time. I'm just saying this in order to have everything clear.

Now if we discard the download time gain as insignificant, we are left with the question: will splitting data into chunks, processing those chunks in parallel and then recombining the data reduce the processing time?

The answer to this is somewhat nuanced because there are different scenarios to consider and different hardware will have different number of cores and those cores will have different performance. In general, we can also address this in terms of bandwidth, in this case bandwidth of each core while processing the data. That bandwidth depends on 2 things: the load (i.e. processing + overhead) and core capability to handle the load.

Now let's recognize that in the case of processing, just like in the case of download, the data is not endless. Any extra processing overhead will be only justifyiable if the data volume compared to the processing bandwidth is high enough. For example, if the data consists of 4 bytes which take some nanoseconds for a single core to process and the extra overhead is just launching 4 awk programs (ignoring the overhead internal to processing data with awk) then the performance will be definitely degraded just by that overhead. This limits the scenarios in which this sort of split can be useful in the first place.

Now consider my x86 CPU, as an example. It takes ~10s for it to process ~1.1M entries on one core with the current algorithm, fed from a single file. We are taking about the whole thing, from start to finish, including the ~85% of the time spent on processing stages which we can not parallelize (2nd processing stage, running dnsmasq --test and loading the file into dnsmasq). If we calculate the remaining ~15% of that time, it's less than 2s. This includes overheads associated with our current scheduler and connection establishment overhead. So I think it's safe to say that the actual processing time (of one core) which any further parallelization would affect in this case is realistically less than 1s and I can't tell how much less because of measurement precision. So I think we can agree that trying to implement a quite insanely tricky system in order to reduce this time is simply not worth it. With your router, if processing the currently largest file (i.e. the tif list) on a single core takes about 50s start to finish then we are talking about ~7.5s of time spent on download+scheduler overhead+processing. So let's say that without connection establishment overhead and scheduler overhead, it's 6s. I would argue that here it's not worth it either. This puts another constraint on where parallelization at the level of data chunks will be worth to implement: sufficiently slow cores.

Now we also want a sufficiently large number of those slow cores. Generally the extra processing overhead imposed by splitting up and recombining the data will add some minimum processing time and if, for example, you have 16 cores cores then that time is better spent than if you have 2 cores. However, in practice, slow router CPUs normally have few cores, normally just 2. Moreover, in practice, routers with slow CPU's also normally have a limited RAM capacity, which imposes another constraint, and that constraint may be incompatible with the data volume vs processing bandwidth constraint (i.e. less RAM -> less data to process -> less worth it)

I'm writing all this to explain why I think that a case where such a technique would be justifyiable and useful even theoretically may be very rare or not even exist, before considering the practical aspects of the implementation.

I kinda have to do other things today so I'll leave my thoughts about the practical aspects for later.

In the meantime, here is a video which demonstrates a practical example of a more efficient single-threaded algorithm outperforming a less efficient multithreaded algorithm (be prepared for some f* words, hehe).

https://www.youtube.com/watch?v=mmL8G-hMTHc

Edit: forgot to mention another constraint: single-core processing bandwidth needs to be sufficiently low compared to download bandwidth. For example, if your download bandwidth is 100kb/s (kilobits) then your core has all the time in the world to catch up to download, even if the core is really, really slow. This is perhaps the least significant constraint out of all because most people have fast connections, but I thought it's worth to mention because it does matter for some cases.

moeller0 · March 18, 2025, 1:27pm

From the peanut gallery:
a) How often do you need/want to run this?
b) Do you assume a typical router has nothing else it needs CPU cycles for?
c) Maybe the existing approach, while not perfect, is already "good enough"?

Lynx · March 18, 2025, 1:35pm

The existing approach in adblock-lean is definitely more than good enough; I’m pretty sure it’s the fastest and most memory efficient domain blocking solution on OpenWrt right now. Even with a super crappy router one can now reach circa one million domains .

I think this degree of further optimisation is all just for fun; wether my router takes 50s or 30s to complete the processing required for blocklist updating at 5am or so doesn’t make so much difference.

moeller0 · March 18, 2025, 1:38pm

Proceed with the fun then, ignoring the Peanut gallery is often a good idea

Lynx · March 18, 2025, 1:49pm

Maybe you have some ideas about the parallelisation issue though? All your ideas basically made cake-autorate what it is. The peanut gallery can be a very good source of protein!

Lynx · March 18, 2025, 1:56pm

All seems fair enough.

Did you already try parallelising the dnsmasq --test aspect? Since this consumes so much processing time, wouldn’t this be worth experimenting with?

antonk · March 18, 2025, 2:47pm

dnsmasq --test is the least time-consuming part of the 2nd processing stage, at least on my machine. I think it takes about 1s for a very large final list. zcatting, merging, sorting, converting the final list to dnsmasq format and then gzipping it takes longer, and loading the result into dnsmasq takes the longest (essentially running service dnsmasq restart).

While dnsmasq --test does not take much time, it does consume quite a bit of memory. Less than loading the list into dnsmasq but more than any other processing step. I've never seriously considered parallelizing this because first, not much time to be gained there, and second, when parallelized, this has the potential to become the largest memory hog in the process, effectively raising the minimum acceptable memory bar.

antonk · March 18, 2025, 3:37pm

About the implementation of chunk processing parallelization.

Let's visualize our task this way: we have 255 digits in a certain order filling a line in our terminal from left to right. We need to move that line from the bottom of the screen to the top of the screen. Moving each digit takes some time. The digit which arrives first is the one written first, regardless of the order in which the digits were in the beginning. Data integrity must not be disturbed, i.e. the resulting data must be the same as the initial data.

The simple approach: one worker moves one digit at a time. Nothing can go wrong here and there is no overhead.

The chunk-parallelized approach:
Let's say that we have workers A, B, C. Let's say that we decide that each worker will move 8 digits at a time. If more than 1 worker is moving digits simultaneously, obviously the resulting data will not be the same as initial data (since data is written in the order in which it arrives), so only 1 worker can write data at any time. Now the kernel puts another constraint: before starting to write any data, each worker must wait for the previous worker to stop writing data - otherwise data integrity can not be guaranteed.

If we rely on workers being somehow magically synchronized on their own, most certainly we get corrupted data as a result because currently there is nothing keeping them synchronized. So we need to implement a synchronization mechanism.

Now how to implement it is a separate issue. Basically the only way I can see is:

run a system() command in the awk script which creates a subshell
inside the subshell, figure out the PID of the subshell
write that PID into the filesystem
wait for a signal intended for that PID
once that signal arrives, exit the subshell which will resume the awk script
skip N lines and continue processing the data from there

Now who will send the synchronization signal? We can not do this on the scheduler level because the scheduler doesn't know how long each worker will take to process its current data chunk. So workers will have to pass the signal from one to another. By (again) running a system() command inside awk, reading PIDs of other workers, figuring out which worker to send the signal to and then sending that signal. So now we need to add these steps to our synchronization sequence:

run a system() command in the awk script which creates a subshell
inside that subshell, read PIDs of other workers from a file
find the PID of the next worker
send a signal to that worker
figure out the PID of the subshell
write that PID into the filesystem
wait for a signal intended for that PID
once that signal arrives, exit the subshell which will resume the awk script
skip X lines and continue processing the data from there

None of those operations are free. Creating a subshell takes time, performing I/O (twice) on the filesystem takes time, sending a signal takes time, receiving a signal takes time, exiting the subshell takes time and skipping N lines takes time. This is your synchronization overhead right here, but it's just the beginning. The consequence is that effectively, it is quite possible that some or even most of the workers will be sleeping most of the time waiting for synchronization signals. And we still did not account for the overhead of making N extra binary calls (N is the number of workers), and the overhead of processing same data N times rather than 1 time (skipping lines is still copying those lines in memory before discarding).

Now what are the chances of this whole operation being actually faster than the simple single-threaded processing? I say that chances are somewhat less than 0. But you are welcome to try if you really want to.

Lynx · March 18, 2025, 4:29pm

Well yes that approach in terms of waiting for signals and so forth seems painful.

But Linux already ensures that no two processes can write to a FIFO at one time.

https://www.gnu.org/software/libc/manual/html_node/Pipe-Atomicity.html

Doesn’t that help?

I don’t have a specific implementation in mind, and I get that coming up with one isn’t easy, but I’m just imagining that it may be possible to implement such a round robin concurrent approach in an efficient way by leveraging FIFOs. At least is seems worth exploring.

Did you consider the way that script I pasted above operated? Does it offer any helpful insights?

antonk · March 18, 2025, 5:15pm

That link doesn't say that. Also, I just tested this:

mkfifo /tmp/1
wget google.com -O- > /tmp/1 & wget google.com -O- > /tmp/1 &
cat /tmp/1 > test.html

I inspected the resulting test.html file and the closing </html> tag of the first download is on line 23 while the opening <html [...]> tag of the second download is on line 12. Which means that the data got interleaved.

Also here is a direct quote from the man page:

POSIX.1 says that writes of less than PIPE_BUF bytes must be
       atomic: the output data is written to the pipe as a contiguous
       sequence.  Writes of more than PIPE_BUF bytes may be nonatomic:
       the kernel may interleave the data with data written by other
       processes.

Which, to my understanding, implies that nothing is stopping multiple processes from writing to the same pipe simultaneously.

moeller0 · March 18, 2025, 5:25pm

If you write chunks <= PIPE_BUF these should stay uninterrupted, no? Not sure that helps with yoyf data though...

antonk · March 18, 2025, 5:31pm

Yes but that would force processing in very small chunks and this doesn't actually remove the requirement of synchronization, while dramatically increasing synchronization frequency, and, as a consequence, the contribution of synchronization overhead.
(Also lines do not align to pre-defined length, so more processing of the lines would be required, and this would impose the requirement to synchronize not just time but also which line each process should start its next chunk with, which obviously makes the synchronization much more tricky and costly).