WR802n isn't reachable after a few days/weeks, reboot helps

SerialConsole · February 24, 2022, 11:56pm

I'm using OpenWrt 21.02.1, r16325-88151b8303 on WR802n.
Sometimes after a few weeks some devices in the Wifi WR802n is creating aren't reachable anymore. It even happened that device was reachable from one host, but not from the other. WR802n stays reachable all the time.

Once I did a dhcp renew and then device was reachable again, but about 24 hours later device wasn't reachable again.

After a reboot device is reachable again and at least for about one week, then next reboot needs to be done. dmesg has no new entrys (last entry about 35 seconds after reboot).

What can be done to debug this issue?

EDIT: I'm not using OpenWRTs DHCP, but a dedicated DHCP.

anon12960265 · February 25, 2022, 12:01am

I think it seems to be related to my DHCP issue on the same release.

SerialConsole · February 25, 2022, 12:17am

Thanks for the fast reply. Sorry, didn't mention that I'm using a dedicated DHCP. I've updated first post. I think you refer to that post Interface stopped distributing DHCP leases Since you're using OpenWRTs DHCP it might probably not be related to your issue.

psherman · February 25, 2022, 12:24am

I'd highly recommend that you start looking at the DHCP leases and if they are expiring for some reason.

Reboot of what? the end device? or your WR802n? How long is your DHCP lease time (on the external DHCP server)? If I had to guess, we're talking a week for the leases (604800 seconds)?

Rebooting individual devices will force them to request a new DHCP lease. Rebooting your WR802n might do the same thing.... wifi would bounce, and any devices directly connected to the WR802n by ethernet would also have their connections bounce.

You may want to reduce your lease time to maybe a few hours to see if that causes the issue to manifest faster. This becomes a good thing because you can then troubleshoot more effectively.

It may be wise to setup an external syslog server and have both your DHCP server and OpenWrt send logs to that host -- this way you can see the patterns in the logs.

SerialConsole · February 25, 2022, 12:32am

I can connect via serial console to the devices. They still have their IP address using ifconfig.

Reboot of what? the end device of your WR802n?

Sorry, I meant reboot of the WR802n. Lease time is 10 days.
I have a ping watchdog on those devices so I get notificated when there are connection problems. As already written I also head the problem that I couldn't connect to the device via SSH, was also not pingable from my PC, but the linux system doing the ping watchdog was able to ping it. In fact there was no single packet lost from that device. So I don't think it is an DHCP issue.

Rebooting individual devices will force them to request a new DHCP lease. Rebooting your WR802n might do the same thing.... wifi would bounce, and any devices directly connected to the WR802n by ethernet would also have their connections bounce.

WR802n has only one LAN port and this is used to supply it with LAN. I'm using WR802n as accesspoint only.

DHCP Server is a Fritzbox, that can't do much logs.

I can grab next time the problem occurs logs. Which log files of OpenWRT are helpful?

psherman · February 25, 2022, 12:34am

Maybe the easiest thing is to set your WR802n (if this is the device that is having issues) with a static IP and see if that resolves the issue. If it does, the problem is DHCP (somewhere -- could be server, could be client, or some strange situation where something else is eating the DHCP packets -- strange, but can happen).

SerialConsole · February 25, 2022, 12:35am

The WR802n are reachable all the time. It are the WifiClients that aren't reachable anymore. I've updated the first post to make this clear. However it must be on some kind of logical layer, because clients don't show any Wifi-reconnect.

psherman · February 25, 2022, 12:39am

Oh... I see.

Rebooting the WR802n will cause the wifi interface to bounce, thus causing the clients to need to reconnect and request a new lease.

So... next time you see this happen, try rebooting the client device(s) (any that are unreachable). If that solves the problem, it is DHCP client or DHCP server related. If it doesn't resolve it, it is possible that your OpenWrt WR802n is swallowing those DHCP packets.

TCP dump will be useful here. Again, working with a very short DHCP time might accelerate the onset of the problem and make troubleshooting easier/faster.

SerialConsole · February 25, 2022, 12:41am

I did a renew on a non working client yesterday or the day before yesterday. Today almost (or all) clients of the WR802n weren't reachable anymore, so it is probably not a DHCP issue, because it is very unlikely that all leases expire at the same time.

psherman · February 25, 2022, 12:45am

Actually, if the solution to date has been to reboot the WR802n, it is very likely the DHCP leases would expire at the same time. All of the wireless devices that connect to the WR802n would need to request a new lease when the wifi comes back up. This means they would all have fresh leases at t=0 (when wifi comes up) give or take several seconds. That would also mean that they would be looking to renew at the same general time as each other, and then if no renewal happens, they would all expire within the same general time frame.

Did this succeed?

EDIT: I'm not trying to place blame on DHCP -- it just seems like something worth investigating.
Another thing -- what are the devices that are becoming unreachable? Are you able to read their logs or see connectivity details (i.e. is the DHCP lease still active or has it expired, is the wifi link running, etc.)? And are you able to test from those devices -- say ping from one device to another (that would be connected via the same wifi AP)?

SerialConsole · February 25, 2022, 12:52am

There are other devices connected to another WR802n and these devices weren't affected.

Renew the lease via systemctl restart dhcpcd did succeed, but as already mentioned only until today when all devices of that AP weren't reachable anymore. Before I've always rebooted whole WR802n and it was working then for at least one week.

Another thing -- what are the devices that are becoming unreachable? Are you able to read their logs or see connectivity details (i.e. is the DHCP lease still active or has it expired, is the wifi link running, etc.)? And are you able to test from those devices -- say ping from one device to another (that would be connected via the same wifi AP)?

One device that gets unreachable is a raspberry PI so I have full logs available. That is were I see that there is no connection drop, because cat /var/log/syslog.1 | grep 'carrier lost' stays empty.

Feb 24 12:42:14 raspberrypi dhcpcd[643]: wlan0: carrier lost
Feb 24 12:43:45 raspberrypi dhcpcd[643]: wlan0: carrier lost

On 12:42 I was rebooting faulty WR802n, raspberry PI roamed to another WR802n and then on 12:43:45 it roamed back to the rebooted WR802n and stayed there.

As written above I had the situation were it wasn't reachable from 192.168.178.20 (wired device), but was reachable from 192.168.178.40 (wired device doing each minute a ping to check connectivity).

Multiple WR802n shows the same behaviour, so they need a reboot from time to time, but not at the same time.

psherman · February 25, 2022, 12:55am

since your WR802n remains reachable, let's see what is coming up in the logs (logread).

Identify the time(s) that the Pi became unreachable and look in the Pi's log for all stuff around that time that could affect connectivity. Then look at the 802n logs and see if that has any interesting stuff in that time.

SerialConsole · February 25, 2022, 8:10pm

Currently I have the situation: Raspberry PI connection is stable from 192.168.178.40, no ping drops. But from 192.168.178.20 there are disruptions about every minute.

--- 192.168.178.142 ping statistics ---
110 packets transmitted, 110 received, 0% packet loss, time 264ms
rtt min/avg/max/mdev = 1.587/13.383/95.395/21.500 ms

Antwort von 192.168.178.142: Bytes=32 Zeit=2ms TTL=64
Antwort von 192.168.178.142: Bytes=32 Zeit=5ms TTL=64
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.20: Zielhost nicht erreichbar.
Antwort von 192.168.178.142: Bytes=32 Zeit=2220ms TTL=64
Antwort von 192.168.178.142: Bytes=32 Zeit=1ms TTL=64
Antwort von 192.168.178.142: Bytes=32 Zeit=5ms TTL=64
Antwort von 192.168.178.142: Bytes=32 Zeit=8ms TTL=64
Antwort von 192.168.178.142: Bytes=32 Zeit=5ms TTL=64

Ping-Statistik für 192.168.178.142:
    Pakete: Gesendet = 84, Empfangen = 83, Verloren = 1
    (1% Verlust),
Ca. Zeitangaben in Millisek.:
    Minimum = 1ms, Maximum = 2220ms, Mittelwert = 46ms
STRG-C

So 15 packets lost / 15 seconds not reachable. 192.168.178.20 has no network connection problems at all. Internetaccess is working great, all traffic goes via the same switch. I'm 99% sure it has something to do with WR802N. There is no firewall active on 192.168.178.142.

logread does not show anything at that time.

psherman · February 25, 2022, 8:16pm

Can you draw a diagram? From a topology standpoint: where are 192.168.178.40, 192.168.178.42 and 192.168.178.20 with respect to the WR802n and the host that is performing the pings?

Does the host performing the pings have completely normal connectivity to everything other than .20?
What about .20 -- does it have a complete loss of connectivity, or can it connect to other things?

SerialConsole · February 25, 2022, 8:19pm

192.168.178.20 and 192.168.178.40 go directly to switch. Switch is directly connected to WR802n.

I connect from 192.168.178.20 to 192.168.178.40 via SSH while ping to 192.168.178.142 was lost, so .20 has no connection problems at all.
Above ping summary was executed in parallel on both machines.

psherman · February 25, 2022, 8:21pm

Sorry... could you draw the diagram for me (I find it much easier to understand visually).

SerialConsole · February 25, 2022, 8:24pm

Does that help?

psherman · February 25, 2022, 8:41pm

yes... thanks.

So 20 and 40 can ping each other without issues, but 142 cannot ping 20 and 40? I assume that 142 becomes entirely isolated from everything on the other side of the WR802n? Are there any other devices on the WR802n wireless side? Can 142 ping any of those? Can any of those ping 20 and 40?

SerialConsole · February 25, 2022, 8:49pm

yes

, but 142 cannot ping 20 and 40?

.40 pings .142 and this works without problems.

Are there any other devices on the WR802n wireless side?
Yes, some. I don't know which. The number also changes, because these roam between different WR802n.

I just did on .142 a ping to google dns 8.8.8.8 No pings were lost, but on .20 device was not pingable for about 20 seconds.

psherman · February 25, 2022, 8:52pm

This would suggest that the issue is either the switch or the .20 device. I don't really see how the WR802n could selectively pass traffic in this way (unless you have a bridge firewall setup, but even still, it would not be intermittent).

I have seen situations where a failing switch can cause intermittent packet loss only on/between specific ports. Can you replace the switch as a test?