Policy-Based-Routing (pbr) package discussion

I am sorry for the repeated messages, but I'm quite desperate as I'm out of ideas on what to do anymore, and it seems like I'm the only one experiencing this. I've extensively searched the internet, and the only remotely related issue I've found is:

But i can't make much sense of it.

Here are some other things I've tried so far:

  1. Setting a timeout using option nft_set_timeout '1h'. The idea was that perhaps too many IPs were causing a hardware bottleneck. However, this option didn't help; the same errors occur after a while, even with only a few IPs in the rules.

  2. Splitting my PBR rules so that no single rule contains more than 15 domains. This also didn't work.

  3. Turning off nft_set_auto_merge, but that didn't help either.

One thing I've noticed during this time is that restarting dnsmasq temporarily resolves the issue.

It looks like a DNSMasq problem on a 22.03 build which is EOL related to passwall.
Furthermore a lot in Chinese, one of the languages I do not master

So not much I can do, I hope someone else can help you

Just to be clear I'm not using 22.03 with passwall. The only reason i mentioned that GitHub issue is because it is the only result I'm getting when searching for the errors I'm getting.

I'm using 23.05.5 with PBR 1.1.8-r10 and no passwall

From the log it's evident that it's the dnsmasq spamming. :wink: Are there any pbr activities in the logs prior to the spam? Like had pbr been recently restarted or did anything pbr-related happen before dnsmasq started having problems populating nft sets?

Do you know which domain resolves to 0.0.0.0 and why? I don't think it's a desired behaviour.

Also, the nft set in the status output is quite large. I can't recall what the nft set size limits are, but maybe you're nearing it. Have you tried splitting up that policy with the large list of domain names into multiple smaller policies?

PS. You can just use whatsapp.net instead of listing all third-level domains.

I don't recall seeing anything related to PBR or anything unusual in general before this issue starts, but I have set up remote logging now, so I will send you the full logs when it happens again.

Yes, I have pihole set up, and I assume some analytics domains get blocked, which pihole responds with 0.0.0.0. I can change pihole to respond with NXDOMAIN. I don't think that's the issue tho, as I have seen the issue happen when there was no 0.0.0.0 in the rulesets, but I will still try it just to be sure.

Yes, I have tried splitting the domains into 10-15 domain groups, which did not fix the issue. I have also tried using the option nft_set_timeout '1h', so I never had more than 15-20 IP rules at once, but that also did not work.

The log indicates a problem with DNSMasq, the only ting I can find is that you add list dns '192.168.1.1' to all your interfaces.
That is not correct, this option is to set the upstream DNS server for DNSMasq to use, but this is the address of DNSMasq itself so in essence you create a loop.

Although this is wrong, DNSMasq normally will take care of this so I do not think it is related to your problem but better remove it from all your interfaces

Sorry, is it normal that from version 1.1.8-r6 to latest the pbr need to be restarted-reloaded to working?

Thanks

It works for me, but you can try 1.1.8.-r10, there are some subtle changes.

Thanks, also on 1.1.8-r10 I get the issue.

If you want here pbr status output on first boot. https://pastebin.com/j6D95hVu

Here status after restart. https://pastebin.com/bJjReiq4

It seems that a lot of things are missing on nftset=

But that could be normal, the nftsets have to be filled by DNSMasq and that happens only after DNSMasq gets queried but if you e.g. restart pbr/router and not the client the client will have its DNS cache and will not query DNSMasq and the set is not filled

See for some pointers about Domain Routing:

Some general focus points for Domain based routing:

You need to have DNSMasq full installed to use nftsets (recommended) see the PBR read.me).
DNSMasq must be used as DNS resolver so the use of DNS hijacking needs special attention, see PBR DNS policies above.
The domains must first be resolved by DNSMasq before they are added to the set so flush DNS cache on router and client or reboot both router and client.
It takes about a minute after Saving and Applying before services have restarted and routing is in place so be patient!
Domain based PBR rules usually have to come first, so make sure those rules are on top in the GUI!

Thanks, so how can I solve without restarting pbr or revert to 1.1.8-r4 with delay?

Not sure what you mean, after a reboot or restart (service pbr restart) ,the set is empty:

root@R7800-1:~# nft list set inet fw4 pbr_wan_4_dst_ip_cfg136ff5
table inet fw4 {
        set pbr_wan_4_dst_ip_cfg136ff5 {
                type ipv4_addr
                flags interval
                counter
                auto-merge
                comment "domain_via_wan"
        }
}

The set will be filled if DNSMasq is resolving the domain names, if I do a nslookup netflix.com (netflix.com is one of the domains) to let DNSMasq resolve the domain names the set is filled:

root@R7800-1:~# nft list set inet fw4 pbr_wan_4_dst_ip_cfg136ff5
table inet fw4 {
        set pbr_wan_4_dst_ip_cfg136ff5 {
                type ipv4_addr
                flags interval
                counter
                auto-merge
                comment "domain_via_wan"
                elements = { 52.214.181.141 counter packets 0 bytes 0, 54.170.196.176 counter packets 0 bytes 0,
                             54.246.79.9 counter packets 0 bytes 0 }
        }
}

For me this works the same for 1.1.8-r4 and 1.1.8-r10

Here are the logs of the issue starting to happen. Seems like dnsmasq crashes/restarts right after the error first happens :confused:

I always had hard time understanding the DNS settings so my solution was to shove 192.168.1.1 to every place that mentioned DNS :rofl:

For user in need for some user scripts see:

for some inspiration :slight_smile:

After e restart or reload of pbr service.

Thanks, for your support, I'll revert to 1.1.8-r4. Thanks

Thanks for capturing the logs, it's not immediately obvious to me what's happening tho. Could it be the RAM limit/issue?

I can't link specific lines of the log, so I'll just copy/paste:

<30>Feb 15 09:28:43 OpenWrt dnsmasq[1]: 43732 192.168.1.104/39814 nftset add 4 inet fw4 pbr_wg_xray_4_dst_ip_cfg076ff5 157.240.227.61 g.whatsapp.net
<30>Feb 15 09:28:43 OpenWrt dnsmasq[1]: 43934 192.168.1.104/47542 nftset add 4 inet fw4 pbr_wg_xray_4_dst_ip_cfg076ff5 157.240.227.60 graph.whatsapp.com
<27>Feb 15 09:28:43 OpenWrt dnsmasq[14]: nftset inet fw4 pbr_wg_xray_4_dst_ip_cfg076ff5 netlink: Error: cache initialization failed: Protocol error
<27>Feb 15 09:28:43 OpenWrt dnsmasq[12]: nftset inet fw4 pbr_wg_xray_4_dst_ip_cfg076ff5 Error: Could not process rule: File exists
<30>Feb 15 09:29:33 OpenWrt dnsmasq[1]: 46662 192.168.1.104/53426 nftset add 4 inet fw4 pbr_wg_xray_4_dst_ip_cfg056ff5 151.101.129.140 reddit.com
<27>Feb 15 09:29:33 OpenWrt dnsmasq[42]: nftset inet fw4 pbr_wg_xray_4_dst_ip_cfg056ff5 netlink: Error: cache initialization failed: Resource temporarily unavailable
<30>Feb 15 09:29:38 OpenWrt dnsmasq[1]: started, version 2.90 cachesize 150
<30>Feb 15 09:29:38 OpenWrt dnsmasq[1]: compile time options: IPv6 GNU-getopt no-DBus UBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP conntrack no-ipset nftset auth cryptohash DNSSEC no-ID loop-detect inotify dumpfile
<30>Feb 15 09:29:38 OpenWrt dnsmasq[1]: UBus support enabled: connected to system bus
<30>Feb 15 09:29:38 OpenWrt dnsmasq-dhcp[1]: DHCP, IP range 192.168.100.100 -- 192.168.100.249, lease time 1d
<30>Feb 15 09:29:38 OpenWrt dnsmasq-dhcp[1]: DHCP, IP range 192.168.1.100 -- 192.168.1.249, lease time 12h

This doesn't look like dnsmasq crashed, did you restart it manually or was it restarted thru cron or some other process?

Well, I can't say what the RAM usage is when this error starts, but I have always seen RAM usage in this range, so I don't think it should be an issue.

root@OpenWrt:~# free -m
              total        used        free      shared  buff/cache   available
Mem:         249100       60464      149400         460       39236      144212
Swap:             0           0           0

No, I didn't touch dnsmasq (or any other setting) during that time, and the only active cron I have is this:

0 0 * * * scp /tmp/dhcp.leases root@192.168.1.101:/tmp/dhcp.leases

Are you aware of any way to turn on some kind of debug mode that would give more info about what caused the restart?

Outside of checking the OpenWrt dnsmasq wiki and/or dnsmasq init script for clues on how to enable additional debugging I have no idea.

It does feel like something starts to go awry (as indicated by netlink error messages) and then dnsmasq restarts.

Is there maybe something killing the ip tables that pbr is creating?

Once i get an opportunity i will just nuke my router and do the most minimal possible setup with only PBR and Wireguard setup and see the results of that, hopefully that will work fine then i can slowly add back all my setup and see where the issue really is :woman_shrugging: :face_with_diagonal_mouth:

Open WRT 24.10.0
PBR 1.1.8-r10
dnsmasq-full 2.90-r4
OpenVPN 2.6.12-r1

Hi. I've implemented PBR and it works well.

I have used traceroute on the client, a mac connected to the router wireless to validate the policies.

I am using domain based routing policies to route via the VPN , as well as a DNS routing policy routed via the VPN

I have configured OpenWrt to issue google's DNSs to all clients 8.8.8.8 8.8.4.4

I have tried to use dnsmasq nft sets, and have installed dnsmasq-full as per this recipe : https://docs.openwrt.melmac.net/pbr/#Howtoinstalldnsmasq-full and verified that it is installed and dnsmasq is not.

However, when I switch the Resolver to dnsmasq.nftset PBR no longer works. traceroute does not show the domains routing via the VPN

I flushed the DNS cache on my client mac, which is connected to the router's wireless.

Is this because I am not pointing my clients DNS servers to the dnsmasq server on the router?

If this is the case, and I change back the dnsmasq server on the router, will the PBR DNS policies still route the dnsmasq server's lookups via the VPN?

If my configuration, using Google DNS, should be working with dnsmasq.nftsets then I guess I've more debugging work to do.