Starlink DHCP vs OpenWrt: is this a bug?

Hey there, I'm coming to ask if I should file a bug report and if so, where.

There's a problem using OpenWRT with Starlink. Long story short; when the Starlink dish ("Dishy") reboots, it temporarily hands out an IP address of 192.168.100.100 to the router connected to it. That DHCP lease has a 5 second expiration. Once Dishy has finished configuring itself it will hand out a final IP address that is your real address, something in the CgNAT 100.64.. block.

OpenWRT isn't ever getting this second address. Instead, it keeps renewing the 192.168.100.100 lease. A lot of routers seem to have this problem, so it could be a bug with Starlink's DHCP server. Or it could just be OpenWRT reacting badly to the weird super-short 5 second DHCP lease.

I finally caught my router in this bad state and got some logs that suggested OpenWRT was ignoring the 5 second lease expiration. Details on my blog at https://nelsonslog.wordpress.com/2021/04/07/openwrt-vs-starlink-dhcp-leases/

Is this a bug I should file against OpenWRT? If so, is it core, udhcpc (Busybox), or netifd?

I'm also open to any advice on how to further debug the problem. Unfortunately I didn't have tcpdump installed at the time so couldn't capture the DHCP traffic. Next time!

looks pretty well debugged to me...

this if anything ( there a few tunables like max_wait/retry? that may assist if tuned right down )

a hotplug 'logwatch' script... that SIGs udhcpc or ifups when a few(192.x.x) renews are seen ||&& internal timer w max_loops should sort it pretty well...

WAN = CgNAT && exit also...

1 Like

Thanks for the reply! Sounds like you're encouraging me to file a bug with udhcpc. I couldn't find the tunables you referred to; are they command line options to udhcpc or something else?

Still some mysteries. Why does OpenWRT decide to treat it as a 122 second lease? What happens when it asks for the renew of the bad address; is Dishy allowing the renew? If so it may be Dishy has a bug too. It still seems like OpenWRT has enough information to get the new address on its own without any Dishy changes.

There are lots of folks using scripts to workaround the problem, force a reconfigure if the IP address stays 192.168.100.* too long. I like the logwatch idea. I'll probably do that as a band-aid but it'd be nice to find a real fix.

yup '--help'

#or you can look here at some glue
cat /lib/netifd/proto/dhcp.sh

I'm no expert on this... but in general you are contending with

  • udhcpc functionality ( and specified options ) which in essence means 'the dhcp protocol specification' or ideally something very close too it... ( in cases custom patches may be applied to facilitate edge cases )
  • netifd... or 'the watcher of udhcpc' which incorporates linkstate and general under(and-over)lying link management and configuration ( I pretty much 'logically' include hotplug and any additional application~system level interactions in this category also ... with fast occurring state changes hotplug[&&||system-bus-latency] is a good place too look for strange behavior )
  • as you say... the upstream device/s adherence to these specifications
  • interactions from the upstream physical link state or other properties

play around with '--help' and try a few options... if nobody answers your '122sec' question i'm sure you'll be able to figure it out based on the above areas of separation... ( netifd debug may also shed some light )... suspect you are dealing with two minor quirks at different places listed above...

1 Like

OpenWrt dhcp server uses 2 minutes as the minimum lease time. Most likely it has some connection to the 2 minutes you see on the dhcp client.
However the starlink should not be renewing the 192.168 IP if it is not working and should hand out the 100.64.

1 Like

I'd say if at all, it would be a bug / change request in upstream busybox. It's udhcpc client currently enforces a minimum lease time of 122 seconds, see https://git.busybox.net/busybox/tree/networking/udhcp/dhcpc.c#n1749

The comment /* timeout > 60 - ensures at least one unicast renew attempt */ implies that this lower limit ensures that other aspects of the client work properly, so blindly changing it to something lower might have other unintended consequences.

The best place to gather feedback would be the Busybox mailing list at https://busybox.net/lists.html

You could also try reporting a bug directly at https://bugs.busybox.net/ but I suppose bringing the question to the list will facilitate more feedback.

2 Likes

If you can tcpdump this interaction you'll see what happens. I suspect that Dishy is incorrectly allowing the renew.

In my opinion because this is a very specific edge case, you're likely to get the best results by running a cron job every minute checking to see if you have a 192.168 address on the WAN and if so doing ip addr show dev eth0.2 | grep 192.168 && ifdown wan ; sleep 10; ifup wan or something similar.

1 Like

That bug was reported years ago and they have not fixed it yet.

I had the same problem and I solved it with the luci-app-watchcat package.

  1. Install the package luci-app-watchcat
  2. Services - > Watchcat
  3. Mode: Restart Interface
  4. Period: 120s or 2m
  5. Interface: Your "ethx.x" (wan, wan6) interface.
    Save & Apply.

Don't change and don't use the other options.

1 Like

Thanks for all the suggestions and advice. Next time the problem occurs or when I have time to induce it, I'll get the tcpdump to see for sure what's going on with DHCP. With that info in hand I think I'll have enough to file a proper bug report. I'm also hoping to help Starlink understand the problem. They officially tell their customers to use their own router which doesn't have this problem. But they also have said they are happy with users using their own routers, so I'm hoping they still care.

jow, thanks for finding the code in Busybox that enforces the 122 second lease minimum. I can totally see why they'd fudge that; super short leases must cause strange problems in some circumstances. And it's not the root cause of my problem; I could live with the router recovering from the bad state after 122 seconds instead of 5. But it's never recovering.

dlakelan, I agree that if Dishy is allowing the renew of the bad address then that's a Dishy bug. I want to capture the DHCP traffic to verify that's really what's happening.

O_o, do you happen to know where the bug was reported? Thanks for the watchcat suggestion, that looks like a fairly good generic service. Reading the code it uses a ping to 8.8.8.8 to decide if the network is up, then takes some action (like restart interface) if not. The support for just restarting an interface (instead of rebooting) is fairly new. That's a potential handy fix for all sorts of circumstances. The other option is to write a Starlink-specific fix like dlakelan is suggesting.

Whatever the end result, hopefully either OpenWRT will work with Starlink unmodified or else there'll be some very simple instructions / a single package to install to make it work right. I have a feeling this Starlink+OpenWRT combination is going to be popular. (Starlink's provided router is super limited.)

2 Likes

Maybe adding a new setting in wan interface called WAN Monitoring or something like that, so that it ping every x time to x host and when fails restart the wan interface, something like luci-app-watchcat package or find a better solution, I don't know.

question... why does 'DIshy' do this at all...

  • Is it a 'sense valid client' thing?
  • an allow admin thing similar to failsafe ping?
  • something else?
2 Likes

This behavior is not uncommon. My Vodafone Germany cable modem does exactly the same. Whenever it is not able to negotiate an upstream connection (or while it is in the process doing so) it leases an 192.168.100.xxx IP address instead of the actual WAN IP.

2 Likes

In the system log OpenWRT prints out lease time on every new WAN address loan it gets from the DHCP server (I am not sure if it is also visible in LuCI system overview).
But how long lease time does the OpenWRT router thinks it has?
Because OpenWRT will not do anything until that timer goes down to 0.

DOCSIS modems will continue to respond to that 192.168.100.x address, to allow access to the modems information/status pages. I seem to recall that this is mandated by the docsis standards somewhere.
Is that the same for dishy? Can you still access it via 192.168.100.xx after a cgnat ip address has been assigned?

wulfy23, as jow says the 192.168.100.100 address is handed out right when Dishy starts up. It takes it a minute or so to get its final cgNAT 100.64.. address. I've assumed that time is for it to acquire satellites and get told by Starlink's infrastructure what IP address to use.

flygarn12, OpenWRT thinks it has 122 seconds on the 192.168.100.100 lease (as displayed in the syslog). As discussed above, that's a minimum udhcpc enforces. The Luci view of the lease on the front page shows 5 seconds which is what Dishy actually told it. OpenWRT renews the lease after 61 seconds, so it's acting as if it really is a 122 second lease. I definitely think it's a problem if Dishy is allowing the renew, but I want to capture that with tcpdump to be sure that's really happening.

moeller0 is right, there's a Dishy status page at 192.168.100.100. Also a gRPC endpoint for some diagnostic data that folks have figured out how to access and a nice mobile app from SpaceX that displays the status data. You can still access it once the cgNAT is set up, but it requires adding a static route to get to it. (Come to think of it, that static route would be a second nice feature for a "OpenWRT for Starlink" package.)

2 Likes

I usually add an extra interface with a static IP in the range of the modems network and put it in the WAN firewall zone. That automatically gives you a route and means you aren't dependent on the DHCP. It might be useful to put a firewall rule that just blocks DHCP responses with 192.168 addresses. You could use the u32 match.

1 Like