These are more or less arbitrary numbers which @Wizballs came up with but they seem to fit well all memory capacities. They are set in mk_preset_arrays
.
So is target lines count a kind of desirable number of lines per blocklist file?
I am trying to make sense of:
Yes, exactly, the optimal final entries count. The code is sometimes calling them 'lines' for historical reasons
Hmm but why is the max file part size set like this:
Like thatās max size per file part, right?
The 'mini' preset is a special case since it includes 1 URL. So it makes sense to set part size equal to blocklist size.
My logic was that we want to have some headroom with the blocklist sizes. So the calculated max blocklist size is 25% larger than the target size (which is target lines count * 20 B ).
Max part size is calculated the same way, except we do not multiply by 1.25. This effectively allows for one of the included lists to be much larger than others, as is the case with the 'large' preset where TIF is kinda dominant, and its size fluctuates by much.
Nice. From what I have considered so far this all seems reasonable.
What happens if no preset is used?
These calculations only play out when running the setup
and gen_config
commands where the user picks one of the pre-defined presets. Otherwise the user sets the values as they like.
Also the way it's currently implemented, presets are only meaningful while running these commands. After the values are set, the code never checks whether they correspond to any preset.
Certainly really nice to get these adjusted in a sensible way like this. Looks good to me.
Just need to update the readme then (with the auto-generated values) and we're good to merge.
Actually, I'm thinking: since the algorithm for these calculations is pretty straightforward and should be good for most use cases, does it make sense to make these settings optional and have an option like target_entries_count
which we then use to automatically calculate the values when these options are not set? On one hand, this adds yet another option. On the other hand, this allows the user to ignore 3 options which may not be very obvious for them how to set.
Iām not so sure from the perspective that wonāt users rather care about memory footprint and whereas itās clear say not to use more than 100MB, a maximum number of lines (target lines?) seems a little more obscure doesnāt it? Also I wonder - maybe the label ātargetā isnāt so apt because itās more an upper limit of sorts than a target isnāt it?
Should we perhaps rather just have a single value relating to maximum memory to allow adblock-lean to use? So maybe keep the maximum and calculate the maximum individual part size based on that somehow?
The tradeoff is between memory footprint and adblocking effectiveness, so these are the 2 parameters that a reasonable user should care about. Entries count represents the latter for this matter, in a way that seems straightforward, no?
I agree that current preset-related code infers about entries count from total available memory (by setting the *_cnt variables which you asked about above), so it does make sense to generalize this. However, I'm not sure that we actually can generalize this, because memory use is tricky business, and because while we have a fairly good idea of bytes per entry in our default Hagezi lists, other lists may have different statistics. Also if we ask the user to specify max memory footprint, I think it may be awkward to communicate to them what we are actually doing with this figure and the fact that (at least with the current code) we can not guarantee that this figure will not be exceeded.
I do see that communicating the idea of target entries count may be not very straightforward, but I think that it's at least doable. Perhaps we could call it total_domains_count and add something like this in comments:
The total expected count of domains in the final blocklist.
If set, adblock-lean uses it to calculate appropriate values for
min_good_line_count, max_file_part_size_KB, max_blocklist_file_size_KB.
Otherwise, you need to fill in these options.
What do you think? I'm not locked on this idea and if you think it's too complicated or not useful, I'm fine with giving up on it.
Perhaps @ninjanoir78 could comment on the above idea from user's perspective.
Right now I'm using the flint2 mt6000 with a lot of space and memory so it is not a big deal for me.
Iām inclined to think specifying a number of lines is not as straightforward as setting a memory storage figure. I imagine that a user has a sense of: I have 100MB to spare, but doesnāt have a clue about how many lines of a blocklist is too much.
@Wizballs any thoughts?
In principle, I don't mind this approach. I just can't see how to implement this in a way that would match user's expectations when they set this figure, for reasons I wrote above. The figures we currently have (Blocklist of N domains is the sweet spot for devices with X MB memory capacity) are just empirical data for a set of defined total memory sizes. I don't know how to come up with a generalized algorithm. We know the average bytes per entry for Hagezi lists, but we don't know how much memory this actually takes once processed and loaded into dnsmasq.
If you have an algorithm then show me
Maybe itās the red wine talking, but I see the merit in setting max blocklist size and part size based on preset, but I struggle to see how getting user to set a maximum number of lines works because all user knows about is spare ram and not numbers of lines.
First of all, setting this figure manually will be only relevant when the user wants to have a non-default set of URLs. For default presets, we calculate all this automatically anyway without asking the user.
Now if I was a user who wants to manually set their own URLs, I would think of it this way: I want to block as many domains as possible. Let's see how many I can fit into my available memory. Then I set total_blocklist_domains to arbitrarily high number and experiment with a different set of URLs. Once I settle on a combination of URLs which is "just right", I know (based on adblock-lean output) how many domains are included in the final list. Then I use that value to set total_blocklist_domains and I don't need to bother with the other 3 options.
So this requires a bit of trial and error but not so complicated, I think. With the current options, the user would still need to do the same trial-end-error, except they would need to change values of 3 options, first in the start of the process and then at the end of the process, and they need to understand how to set each of these 3 options in an optimal way. So to me this seems less straightforward.
We could experiment a bit with blocklists of different sizes and check actual memory used by dnsmasq and see if we can come up with some way to infer used memory from domains count, and the other way around. If we can then perhaps we could implement your idea of user setting the maximum memory reserved for adblocking and the code using that value to calculate the figures for max downloaded part size and max blocklist size.