[Workaround] GL-AR150: No DHCP if LAN cable is not plugged during boot

I just tried to switch the two commands in the init script. Changes nothing …
I also tried to stop and start dnsmasq via /etc/rc.local instead of restarting it, also changes nothing. Still, as soon as a LAN cable is plugged, I login on the router and restart dnsmasq, the dhcp-range line is added to the config.

1 Like

I had to reconnect on my other machine before I get the ipv4, but it works without cable.
My service starts at 80 and have a sleep before it stops the service.
Try adding a sleep with 1 second in the dnsmasq init script i will.

What about this:

uci set dhcp.@dnsmasq[0].nonwildcard="0"
uci commit dhcp
service dnsmasq restart

--bind-dynamic is definitely tricky.
I'm not sure enabling it by default is a good idea.

Restarting dnsmasq while a LAN cable is plugged does fix the problem without changing anything in the config, so changing something and restarting it whilst being connected via LAN is not expressive, as DHCPv4 always will work after that (unless turned off of course) …

But also, sadly, changing the "nonwildcard" parameter changes nothing. Still no dhcp-range set in the config and no DHCPv4 after a reboot without LAN cable …

Still, this is default config and probably, those OpenWrt guys had a good reason to do so, handn't they?

What is it with that "carrier" that the init script checks for? What does this mean? And why isn't it present with no LAN cable attached in a bridge of a physical ethernet port and a wifi device which definitely won't have a cable plugged?

Looks like interface binding via nonwildcard affects only DNS, but not DHCP.
So as netstat clearly indicates that dnsmasq is not listening DHCP port, the only possible explanation is within dnsmasq initialization script.

I see, you have already traced it:
https://bugs.openwrt.org/index.php?do=details&task_id=2145#comment6113

I believe, hardware state dependency should be dropped, at least for nonwildcard=0.

If you know the offending line, why don’t you put a timeout loop before it?
not real code

counter = 30. 
while counter > 0 && jsonfilter -e @.carrier == false. 
    counter = counter - 1. 
    sleep 1. 
endwhile

If that works and results in a working system, your hunch about specific device timing/race conditions is probably spot on.

This was a good idea :wink:

I added the following code to the init script, right before the check that fails and should not:

counter=1
while [ $(devstatus "$ifname" | jsonfilter -e @.carrier) = 'false' ] && [ $counter -le 20 ]; do
	logger -t "test" "No carrier, waiting $counter secs"
	counter=$(($counter + 1))
	sleep 1
done

which results in the following log upon a reboot without a LAN cable plugged:

...
Sun Feb 24 02:13:08 2019 daemon.info procd: - init complete -
Sun Feb 24 02:13:09 2019 user.notice test: No carrier, waiting 1 secs
Sun Feb 24 02:13:10 2019 user.notice test: No carrier, waiting 2 secs
Sun Feb 24 02:13:11 2019 user.notice test: No carrier, waiting 3 secs
Sun Feb 24 02:13:12 2019 user.notice test: No carrier, waiting 4 secs
Sun Feb 24 02:13:14 2019 user.notice test: No carrier, waiting 5 secs
Sun Feb 24 02:13:15 2019 user.notice test: No carrier, waiting 6 secs
Sun Feb 24 02:13:16 2019 user.notice test: No carrier, waiting 7 secs
Sun Feb 24 02:13:16 2019 kern.info kernel: [   36.119928] eth1: link up (1000Mbps/Full duplex)
Sun Feb 24 02:13:16 2019 kern.info kernel: [   36.123166] br-lan: port 1(eth1) entered blocking stat
e
Sun Feb 24 02:13:16 2019 kern.info kernel: [   36.128318] br-lan: port 1(eth1) entered forwarding st
ate
Sun Feb 24 02:13:16 2019 daemon.notice netifd: Network device 'eth1' link is up
Sun Feb 24 02:13:16 2019 kern.info kernel: [   36.136384] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: lin
k becomes ready
Sun Feb 24 02:13:16 2019 daemon.notice netifd: bridge 'br-lan' link is up
Sun Feb 24 02:13:16 2019 daemon.notice netifd: Interface 'lan' has link connectivity
Sun Feb 24 02:13:21 2019 daemon.info dnsmasq[727]: exiting on receipt of SIGTERM
Sun Feb 24 02:13:21 2019 daemon.info dnsmasq[1273]: started, version 2.80 cachesize 150
Sun Feb 24 02:13:21 2019 daemon.info dnsmasq[1273]: DNS service limited to local subnets
Sun Feb 24 02:13:21 2019 daemon.info dnsmasq[1273]: compile time options: IPv6 GNU-getopt no-DBus no
-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-DNSSEC no-ID loop-detect in
otify dumpfile
Sun Feb 24 02:13:21 2019 daemon.info dnsmasq-dhcp[1273]: DHCP, IP range 192.168.1.100 -- 192.168.1.2
49, lease time 12h
...

AND WORKING IPV4 DHCP! So this really seems to be some timing issue!

About the "[Solved]" edit done by some mod: I don't think this is actually solved, this only can be some workaround as it's caused by the combination of the hardware and the current state of OpenWrt. It's not really solved until some patch really avoiding and/or fixing this problem is added upstream.

Cool, well that’s a good outcome in that it kind of confirms the suspicions.
I’d add this information to the bug report, and hope that a dev picks it up as a bug.

It would be interesting to know if it occurs on the original firmware from gl iNet as well. If so, they may have a bit more of a vested interest in fixing it.

For a workaround I am using in init.d/dnsmasq

reload_service() {
        procd_send_signal dnsmasq "$@"
	sleep 5
	rc_procd start_service "$@"
}

Well, this is also about timing inside the init script, so most probably, this is really a timing issue.

Okey but still what was

dhcp_check has to wait for ?

I don’t have my ar150 with me, but dying to know
what is waiting for ?

Don’t know even what devststus does

But on git I found it calls jshn.sh

You said before that with LAN cable unplugged but with IPv6 address or ipv4 address static given

So you are implying that br-lan set up is taking to long to complete before dhcp_check starts ?

And this is problem number one and then the second problem is that something is crippled by this first action and something else occour that prevent dhcp to work on successive attempts of connecting through lan or wlan ?

When cable is plugin procd sends a sigTERM instead of a sigHUB with takes longer to react to that saves it, I do not think that is what it suppose to happen. My guess it only does that because of bad timing again and I think that dhcp_check waits for the old dnsmasq to close down.

This is definitely not what should happen! But maybe, the different kinds of signals are the core problem … or the way they are handled. I'm pretty sure the problem is not located where I put my workaround. I think it's more a dirty hack to get it to work. But at least, waiting for that carrier makes it work at all …

1 Like

https://openwrt.org/docs/guide-user/base-system/dhcp_configuration#race_conditions_with_netifd

1 Like

Well, this will probably "mask" the problem by skipping the dhcp_check() test (making it another workaround), but shouldn't dhcp_check() return successfully instead of failing?

So the question is not how to ship around this issue but why does it happen at all, isn't it?

Okay, I can confirm that setting this option ships around the problem by skipping the check in question.

Don't get me wrong: This is definitely the best workaround until now, because one does not have to edit init files, but only set a config option!

But the question is: Should this work without the option (so that we have another problem to be solved) or is the problem that this option should be enabled by default?

The point is that imo, a fresh OpenWrt installation should simply work … so however, this needs to be tracked down!

1 Like

I would say "yes" to that -- that the underlying problem is likely that the test of interface "carrier" is failing unexpectedly.

Whether it's worth trying to resolve (and get the patches into master and 19.x), or just documenting the workaround is another question.

Because that check is:

  • redundant;
  • has negative effect on fault tolerance;
  • should not be enabled by default.

Well, for a router that is advertised as being delivered with OpenWrt preinstalled, I would say one should fix it so that one would not have to say "OpenWrt preinstalled, but with a recent version, you have to do some workaround to get it to work" :wink: