System hiccups running OpenWrt 18.06.2 on TP-Link TL-WR940N v4

PrimeSuspect · June 19, 2019, 10:49am

I've run into a really strange problem that I'm hoping someone else has seen before or better yet make a suggestion on how to fix it.

As the title states I'm running OpenWrt 18.06.2 on TP-Link TL-WR940N v4 in Access Point mode. Once or twice every hour the router stops responding to ping, http and ssh. Pretty much all of the active services (I have dhcpd and dns disabled). None of the devices attached to the router have any connectivity problems, only the router's IP. I wrote a cron job from the gateway the router is attached to so I could monitor this a little better and here's an example of what I'm seeing.

Wed Jun 19 00:46:12 EDT 2019 ping timeout
Wed Jun 19 00:46:22 EDT 2019 http timeout
Wed Jun 19 00:46:37 EDT 2019 ssh timeout
Wed Jun 19 01:01:02 EDT 2019 01:01:02 up 5:47, load average: 0.00, 0.00, 0.00
Wed Jun 19 01:16:12 EDT 2019 ping timeout
Wed Jun 19 01:16:22 EDT 2019 http timeout
Wed Jun 19 01:16:37 EDT 2019 ssh timeout
Wed Jun 19 01:31:02 EDT 2019 01:31:02 up 6:17, load average: 0.00, 0.00, 0.00
Wed Jun 19 01:46:13 EDT 2019 ping timeout
Wed Jun 19 01:46:23 EDT 2019 http timeout
Wed Jun 19 01:46:38 EDT 2019 ssh timeout
Wed Jun 19 02:01:02 EDT 2019 02:01:02 up 6:47, load average: 0.00, 0.00, 0.00
Wed Jun 19 02:16:02 EDT 2019 02:16:02 up 7:02, load average: 0.00, 0.00, 0.00
Wed Jun 19 02:31:12 EDT 2019 ping timeout
Wed Jun 19 02:31:22 EDT 2019 http timeout
Wed Jun 19 02:31:37 EDT 2019 ssh timeout
Wed Jun 19 02:46:02 EDT 2019 02:46:02 up 7:32, load average: 0.01, 0.02, 0.00
Wed Jun 19 03:01:12 EDT 2019 ping timeout
Wed Jun 19 03:01:22 EDT 2019 http timeout
Wed Jun 19 03:01:37 EDT 2019 ssh timeout
Wed Jun 19 03:16:02 EDT 2019 03:16:02 up 8:02, load average: 0.00, 0.00, 0.00

So every 15 minutes I perform a ping, http request and try to ssh in and run the uptime command. Often times all of the requests timeout. However, if I run another job to just ping the router's IP every 10-15 seconds the same cron job is successful every single time.

Tue Jun 18 20:16:02 EDT 2019 20:16:02 up 1:02, load average: 0.00, 0.00, 0.00
Tue Jun 18 20:31:02 EDT 2019 20:31:02 up 1:17, load average: 0.00, 0.00, 0.00
Tue Jun 18 20:46:02 EDT 2019 20:46:02 up 1:32, load average: 0.00, 0.00, 0.00
Tue Jun 18 21:01:02 EDT 2019 21:01:02 up 1:47, load average: 0.04, 0.01, 0.00
Tue Jun 18 21:16:03 EDT 2019 21:16:03 up 2:02, load average: 0.00, 0.00, 0.00
Tue Jun 18 21:31:02 EDT 2019 21:31:02 up 2:17, load average: 0.02, 0.01, 0.00
Tue Jun 18 21:46:02 EDT 2019 21:46:02 up 2:32, load average: 0.07, 0.01, 0.00
Tue Jun 18 22:01:02 EDT 2019 22:01:02 up 2:47, load average: 0.03, 0.02, 0.00

Not a single timeout. Does anyone have any ideas on what is happening and why? I'm more than happy to run any other tests and provide the results. Thanks in advance for any insights or help you may be able to provide.

tmomas · June 19, 2019, 10:53am

TL-WR940N v4 is a 4/32 device. See https://openwrt.org/supported_devices/432_warning for details regarding the limitations of 4/32 devices.

Maybe you are running into RAM problems.
Do you see anything useful in the logs?

PrimeSuspect · June 19, 2019, 11:13am

I did run into storage issues when I tried installing additional packages (of course) and thought about compiling my own image but I have other servers on my network that I can perform all of the tasks I was planning to offload to the router so I just decided not to bother.

Right now it is really just serving as a wifi access point (ie connecting my wireless devices to my LAN). I have not seen any RAM issues to date. For instance right now:

root@OpenWrt:~# free
             total       used       free     shared    buffers     cached
Mem:         27856      19472       8384        496       2048       6764
-/+ buffers/cache:      10660      17196
Swap:            0          0          0

I have not seen anything in the logs either, because I have most everything disabled after bootup the log file is pretty sparse.

It is just really strange that if I continue to ping the router everything works perfect but if I don't constantly ping it then these hiccups occur. I'm not aware of any sleeping/hibernating process that would happen so I can't explain why this works but it does.

mk24 · June 19, 2019, 1:11pm

I do see the uptime continues to increase, the router is not crashing. Typically with an out of memory situation, the out of memory killer will start killing processes like dropbear and hostapd, and then the router will never accept a ssh connection or wifi connection respectively.

I would look at the client side (some sort of script, what happens if you run ping or ssh manually?) and the network in between, maybe something is losing its ARP if you don't keep pinging.

PrimeSuspect · June 19, 2019, 1:48pm

Correct, the router is definitely not crashing. All of the processes are fine too as you surmised.

I have tried manually connecting during these "hiccups" and it is the same result as the monitoring script I wrote (time outs). I don't know why only this router would be losing its ARP but that actually would explain everything happening so I will investigate that and report back with my findings. Thank you for the suggestion.

PrimeSuspect · June 19, 2019, 6:52pm

Alright, mk24 was definitely onto something when he suggested it might be losing its arp.

I modifed the script to check the arp table before it does the ping, http and ssh tests. The gateway (where the monitoring script is running from) doesn't ever lose the arp of the router but it does appear that the router may be losing the arp of the gateway. So whenever I start getting timeouts to the router from the gateway I can just clear the arp table and then it will instantly start working again (forces an arp update).

I am seeing some dropped packets on the router bridged lan interface so maybe some of the arps are getting dropped:

br-lan    Link encap:Ethernet  HWaddr 50:C7:BF:2F:98:1D
          inet addr:192.168.24.10  Bcast:192.168.24.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2945 errors:0 dropped:407 overruns:0 frame:0
          TX packets:1232 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:367745 (359.1 KiB)  TX bytes:251950 (246.0 KiB)

The router is constantly asking for arp updates of the gateway (the router is 192.168.24.10 and the gateway of course is 192.168.24.1):

10:12:12.669407 arp who-has 192.168.24.1 tell 192.168.24.10
10:12:12.669426 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27
10:13:17.789419 arp who-has 192.168.24.1 tell 192.168.24.10
10:13:17.789435 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27
10:14:23.869469 arp who-has 192.168.24.1 tell 192.168.24.10
10:14:23.869487 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27
10:15:30.989420 arp who-has 192.168.24.1 tell 192.168.24.10
10:15:30.989463 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27
10:16:37.149741 arp who-has 192.168.24.1 tell 192.168.24.10
10:16:37.149785 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27
10:17:41.229843 arp who-has 192.168.24.1 tell 192.168.24.10
10:17:41.229890 arp reply 192.168.24.1 is-at 00:0d:b9:50:15:27

No other instance on the network asks for arps even close to that frequently. Most of the other instances keep its arp for around 20 minutes or so before they re-arp. I looked at the busybox man page for arp and it looks like I should be able to force it but neither arp -s or arp -d (to try and delete an entry) seemed to be working for me. Can someone else see how often their openwrt router asks for arp updates of their gateway?

My /etc/config/network is pretty straightforward:

config interface 'lan'
        option type 'bridge'
        option ifname 'eth1.1'
        option proto 'static'
        option ipaddr '192.168.24.10'
        option netmask '255.255.255.0'
        option delegate '0'
        option gateway '192.168.24.1'
        option broadcast '192.168.24.255'
        option dns '192.168.24.1'

config switch
        option name 'switch0'
        option reset '1'
        option enable_vlan '1'

config switch_vlan
        option device 'switch0'
        option vlan '1'
        option ports '1 2 3 4 0t'

So I now know exactly what is happening, I just don't know why.