I am running 22.03 on my RB5009UG. I've added a few firewall rules, and a WireGuard instance, but otherwise it's pretty much stock. After booting, routing is broken: DNS resolution works, but 'the internet' does not. After reloading the firewall, routing functionality is as it should be. Looks like somehow the firewall is loaded a bit too early?
I diffed the nftables ruleset prior and after reload, I turned it into a gist to have a colourised diff (command nft list ruleset):
Ruleset:
# cat /etc/config/firewall
config defaults
option syn_flood '1'
option input 'ACCEPT'
option output 'ACCEPT'
option forward 'REJECT'
option drop_invalid '1'
config zone
option name 'lan'
list network 'lan'
option input 'ACCEPT'
option output 'ACCEPT'
option forward 'ACCEPT'
config zone
option name 'wan'
list network 'wan'
list network 'wan6'
option input 'REJECT'
option output 'ACCEPT'
option forward 'REJECT'
option masq '1'
option mtu_fix '1'
config forwarding
option src 'lan'
option dest 'wan'
config rule
option name 'Allow-DHCP-Renew'
option src 'wan'
option proto 'udp'
option dest_port '68'
option target 'ACCEPT'
option family 'ipv4'
config rule
option name 'Allow-Ping'
option src 'wan'
option proto 'icmp'
option icmp_type 'echo-request'
option family 'ipv4'
option target 'ACCEPT'
config rule
option name 'Allow-IGMP'
option src 'wan'
option proto 'igmp'
option family 'ipv4'
option target 'ACCEPT'
config rule
option name 'Allow-DHCPv6'
option src 'wan'
option proto 'udp'
option dest_port '546'
option family 'ipv6'
option target 'ACCEPT'
config rule
option name 'Allow-MLD'
option src 'wan'
option proto 'icmp'
option src_ip 'fe80::/10'
list icmp_type '130/0'
list icmp_type '131/0'
list icmp_type '132/0'
list icmp_type '143/0'
option family 'ipv6'
option target 'ACCEPT'
config rule
option name 'Allow-ICMPv6-Input'
option src 'wan'
option proto 'icmp'
list icmp_type 'echo-request'
list icmp_type 'echo-reply'
list icmp_type 'destination-unreachable'
list icmp_type 'packet-too-big'
list icmp_type 'time-exceeded'
list icmp_type 'bad-header'
list icmp_type 'unknown-header-type'
list icmp_type 'router-solicitation'
list icmp_type 'neighbour-solicitation'
list icmp_type 'router-advertisement'
list icmp_type 'neighbour-advertisement'
option limit '1000/sec'
option family 'ipv6'
option target 'ACCEPT'
config rule
option name 'Allow-ICMPv6-Forward'
option src 'wan'
option dest '*'
option proto 'icmp'
list icmp_type 'echo-request'
list icmp_type 'echo-reply'
list icmp_type 'destination-unreachable'
list icmp_type 'packet-too-big'
list icmp_type 'time-exceeded'
list icmp_type 'bad-header'
list icmp_type 'unknown-header-type'
option limit '1000/sec'
option family 'ipv6'
option target 'ACCEPT'
config rule
option name 'Allow-IPSec-ESP'
option src 'wan'
option dest 'lan'
option proto 'esp'
option target 'ACCEPT'
config rule
option name 'Allow-ISAKMP'
option src 'wan'
option dest 'lan'
option dest_port '500'
option proto 'udp'
option target 'ACCEPT'
config zone
option name 'wg'
option input 'ACCEPT'
option forward 'ACCEPT'
option output 'ACCEPT'
option network 'wg0'
option masq '1'
config forwarding
option src 'wg'
option dest 'lan'
config forwarding
option src 'lan'
option dest 'wg'
config rule
option src '*'
option target 'ACCEPT'
list proto 'udp'
option dest_port '8192'
option name 'Allow-Wireguard-Inbound'
config redirect
option target 'DNAT'
option src 'wan'
option dest 'lan'
list proto 'tcp'
option src_dport '4505'
option dest_ip '10.0.0.5'
option dest_port '4505'
option name 'Salt 1'
config redirect
option target 'DNAT'
option src 'wan'
option dest 'lan'
list proto 'tcp'
option src_dport '4506'
option dest_ip '10.0.0.5'
option dest_port '4506'
option name 'Salt 2'
I'm on 22.03 HEAD, compiled a few days ago. So it's current since its latest fw4 bump which happened around 9 days ago.
It's been over a month since I set up OpenWrt on this device first, so I'm not sure if it's a regression. Will see what reverting 628d7917ea03a24de43a35fd90894cf8d5d62dc0 does.
No, not really. You need to debug the 20-firewall hotplug script, see if it is invoked at all for INTERFACE=wan and if yes, where it bails out exactly. Sprinkling logger statements will help.
#!/bin/sh
has_zone() {
fw4 -q network "$INTERFACE" >/dev/null && return 0
eval $(ubus call "network.interface.$INTERFACE" status | jsonfilter -e 'ZONE=@.data.zone')
fw4 -q zone "$ZONE" >/dev/null
}
logger -t firewall "Interface up or update"
[ "$ACTION" = ifup -o "$ACTION" = ifupdate ] || exit 0
logger -t firewall "Interface update with IP addresses"
[ "$ACTION" = ifupdate -a -z "$IFUPDATE_ADDRESSES" -a -z "$IFUPDATE_DATA" ] && exit 0
/etc/init.d/firewall enabled || exit 0
logger -t firewall "Zone check"
has_zone || exit 0
logger -t firewall "Reloading firewall due to $ACTION of $INTERFACE ($DEVICE)"
fw4 -q reload
Which yields this:
# logread -e firewall
Mon May 30 22:38:52 2022 user.notice firewall: Interface up or update
Mon May 30 22:38:52 2022 user.notice firewall: Interface update with IP addresses
Mon May 30 22:38:52 2022 user.notice firewall: Zone check
Mon May 30 22:38:52 2022 user.notice firewall: Reloading firewall due to ifup of lan (br-lan)
I can say for sure that the PPPoE interface only comes up at 22:39:01 and there's no firewall messages in the log anymore after that, only before (the ones above).
That looks as if no hotplug event is triggered at all when PPPoE is established. Is there any other prior scripts to 20-firewall? Do you see dangling shell processes in ps www ?
So it seems the hotplug calls are indeed stuck. I suppose one of 20-ntpclient, 30-nlbwmon or 95-ddns is the culprit. One of these scripts does not return/hangs, preventing /sbin/hotplug-call to complete, causing events to pile up.
My first hunch would be the ddns script, due to the presence of a ddns script in your ps output
OK. I've deleted the 95-ddns hotplug script, but no difference AFAICT:
# logread -e firewall
Tue May 31 19:14:56 2022 user.notice firewall: Interface up or update
Tue May 31 19:14:56 2022 user.notice firewall: Interface update with IP addresses
Tue May 31 19:14:56 2022 user.notice firewall: Zone check
Tue May 31 19:14:56 2022 user.notice firewall: Reloading firewall due to ifup of lan (br-lan)
After discussing this with @jow, it turned out the ntpclient hotplug handler spawns an ntpclient command that never returns, which blocks all subsequent hotplug events.
Ntpclient is pulled in by the luci-app-ntpc package, I will be filing a bug against the former as requested.
I wonder whether hotplug should not actually allow to configure scripts as best-effort and time these out if they do not return in a user configurable amount of time? Or at the very least a monitoring process that can add a warning to the log that execution of a specific hotplug script blocked for XX seconds?
(I suffered from the same issue on turris OS for a long time and the best fudge I came up with (not being able to figure out the root cause) was to synthesize calls to the /etc/hotplug.d/iface scripts (where the offending ntpclient script lives I believe) from /etc/hotplug.d/net, rather crude).