Claim: a given PC on boot is randomly hanging the router (Archer C7 v2). Possible or bizarre coincidence?

I know someone with an Archer C7 v2. He's been using OpenWRT (ath79) for a while on a busy wired and wireless network and has kept up with the updates, currently on the latest 19.07. He has cable Internet and it's a home network (no server).

I can't pinpoint when this started, but it was late last year or early this one, and it was infrequent enough at first that I didn't think much of it. But as the months wore on, he began to notice that it seemingly coincided with when he would power on a desktop Win10 PC. The obvious symptom: once in Windows, he would find that he didn't have connectivity.

He'd then ask around and find that everyone else on the network is also suddenly offline. Further, at that point the router can't be reached by Web browser or even WinSCP (Ethernet and WiFi, and the PC in question has a direct if lengthy connection to the router via Ethernet). The router then needs to be cold booted, so anything interesting in its log for that day is lost.

What is the frequency of this? Sometimes it won't happen for two or three weeks, but sometimes it'll happen two or three times in a week.

I'm at a complete loss, since even if I wanted to completely hang a router with a PC I wouldn't know how to do it. I don't even know if there are a list of possibilities or if this is an illusion or coincidence of some kind.

The only thing I know in the realm of possibility would be an IP conflict, but I checked that: the PC is on DHCP (automatic). Further, the MAC is reserved in the router so that he gets the same IP every time. But even two PCs with the same IP shouldn't take out a router.

Not sure what your target with this post is especially as you start it with "Claim" and then write "I know someone".
Are you looking for advice how to troubleshoot such an issue or just want to start some strange thread?

That was my way of expressing skepticism. This seems like borderline sci-fi to me, but my post, which apparently you didn't like very much, contains the facts as I know them.

If this isn't a coincidence, in what ways is OpenWRT vulnerable to being hung passively (i.e. without someone actively trying to do it by going into, say, terminal)?

Well there are many ways to hang OpenWRT e.g. crash the wifi or OOM which both can happen on a C7.
So my first test would be to connect a PC via Ethernet Cable and have two SSH sessions running one tailing the logs and one checking the free memory and see what happens when that Win10 PC is connected.

You might need serial cable to see the console when the hang happens. Network SSH might work to show the initial memory spike (if there is one), but the connection would likely die before the actual hang/crash.

In researching this earlier, I went through the "WiFi hangs" threads for this model, but there was no mention of Ethernet also dying in those threads, so it shouldn't be that problem. Still, to be on the safe side, weeks ago I switched it over to the preferred "non-ct" driver, which is better in the sense that it takes less memory. The router typically has about 65-70 MB free, which is more than with the ck driver.

This is an "everything hangs" for everyone problem via Ethernet and Wifi and including http, ping, and ssh.

I wasn't aware that it had a serial interface, so scaring one of those cables up may be next. I like the idea of lying in wait for it with another PC, but the problem there is that weeks can go by with no problem despite the "problem" PC being used every day.

I have no idea if that router has one, but many routers do (usually hidden inside the chassis).

There are those crashes, e.g. in ipq806x routers like R7800 there has been a kernel crashing jumbo frame bug with earlier kernel versions (until 5.4, I think).

If the the kernel crashes due to OOM (or something in the network stack), the network stack dies likely already before (or simultaneously with) the OOM/kernel crash.

Some debugging ideas: does the culprit PC connect with wired or wifi? does the crash happen both ways? (is the crash related to a specific network stack, or to the config related to that device...)

The good thing about that particular PC is that it's a desktop, so just Ethernet.

If it is OOM, it's something sudden and catastrophic. I get daily email reports from the router, and the free mem is remarkably stable. I've never seen it lower than about 45, and since dumping ck it hasn't been near that. Still, this isn't to say that something isn't coming along and causing the memory to plummet, though a PC causing that by booting is weird.

On your last question, I've been over that PC and can't see anything network-wise to make it stand out. Vanilla Win10 with the NIC on "Automatic" in terms of acquiring the address. As mentioned earlier, that is one of a few PCs that uses the router's function (Network->DHCP and DNS->Static Leases) to reserve an address, but it's been that way for ages.

Not all devices show in the DHCP lease tables. I have an esxi server with a static IP address and it doesn't show up. I doubt that's the problem, because your problem device is using DHCP and wouldn't always receive the same IP address.

Are there any switches being used? Could it be a switching loop? Vlans?

To rule out an electrical issue, you could place a switch between the router and the critical PC, to isolate the router. Probably not your problem but maybe worth trying?

Also, if you should manage to get away with a warm reboot (without power cycling) the following:
cat /sys/kernel/debug/crashlog > last_crash_log
might recover something potentially useful from memory? And maybe one of the watchdog packages coud be used to initiate a soft reboot (probably not if the router truely wedges hard)?

2 Likes

Seems to have. Picture in

Not all devices show in the DHCP lease tables. I have an esxi server with a static IP address and it doesn't show up. I doubt that's the problem, because your problem device is using DHCP and wouldn't always receive the same IP address.

In this case, even though it's not a static assignment on the PC, it does get the same IP each time due to the MAC-based reservation in the router. It's sort of a cheat way to still use DHCP but remain consistent.

Are there any switches being used? Could it be a switching loop? Vlans?

Elsewhere on the network, but that PC is a straight shot to the router with a long cable. No Vlans.

To rule out an electrical issue, you could place a switch between the router and the critical PC, to isolate the router. Probably not your problem but maybe worth trying?

That would be easy to do and worth a try. I was thinking earlier about the length of the cable, but it's less than 100m so should be good. At this point though, I can't really eliminate anything until actually trying it.

And maybe one of the watchdog packages coud be used to initiate a soft reboot (probably not if the router truely wedges hard)?

That's very interesting. Soft reboot as in doing something like "/etc/init.d/network restart" I assume. I'll definitely look into that, which seems to be via this:

If you go that route, the underlying assumption is that it is not about a full router crash, but network stack lockup making teh router unreachable while still operational... To debug issue, you should maybe (after detecting a dead network from the router side) run a few debug commands and store the output into flash, so that the logs are visbile after the reboot.
In style of dmesg > /etc/dmesg.log ; logread > /etc/logread.log ; sync ; reboot

2 Likes

I just noticed that Watchcat doesn't have custom capability like that, at least under 19.07. Would you happen to know of one that allows it?

How is the PC connected to the network? Wireless or wired ethernet? If wired ethernet, is it a direct onboard NIC, or is it through a USB-C docking hub device (or similar)?

It's wired using the Ethernet from the motherboard.

Does other network connectivity (i.e. between two hosts on the LAN) also experience issues at the same time as the C7 v2 appears to hang?

That's one thing I'm not sure about, but I'll try to find out. That would at least tell us that the switch part of the router is functioning. I'd be kind of surprised if it didn't work.

A broadcast storm can bring down the entire network. If the lan appears to also be messed up, another switch (just any standard unmanaged switch will be fine since there are no VLANs) can test that theory. It may not be the router, but rather that windows machine causing a broadcast storm and this crippling the entire network at L2.

Wouldn't that be interesting. I'll have to look that up. But if that were the case, would it make sense that rebooting the problem PC wouldn't resolve it (it definitely doesn't) but rebooting the router would?