LAN stops working every now and then

Not that long ago I bought WR1043NDv4 and installed LEDE 17.01 stable on it. Everything seems to be working great except one thing. LAN stops working every now and then. There is no internet, no local network. I can't access webpages or other devices on the same network.

I just had an incident a few minutes ago and to my surprise the logs are clean. Nothing unusual is happening there. The syslog and dmesh are clean, no new unusual entries. During the outage I can't access services on other computers, networked hard drives or router configuration page (LuCI). SSH doesn't respond either. The only indication it happened is less space on storage after logging into the router through SSH and checking df.

The outage seems to be correcting itself after a few minutes. Anyone noticed something similar on your devices?

3 Likes

First check the involved clients if they still have their IP addresses and routes set, as this would be the most likely cause of problems (not renewing the DHCP lease correctly, if that actually is the case, testing more current development snapshots could be a way out).

Other than DHCP woes, the problems you describe generally shouldn't happen, as the router's switch should more or less be configured to pass everything on undisturbed between the LAN ports, so even if the router's OS would get stuck, the switch should remain in its previously configured state... What could result in very similar behaviour would be either loops between LAN segments or faulty devices 'reflecting' some/ most of the packets going by (e.g. years ago I had a switch exposing that behaviour whenever it lost power/ was switched off), which could effectively drown the network with useless packets. Therefore it would make sense to reduce your network down to just two clients connected directly to your LAN ports, check if the problems are gone and then carefully restoring the whole networking infrastructure piece by piece, to eventually pinpoint a potentially broken participant (could be a client's network card, some rogue dumb switches or APs or even some semi-broken (patch-) cable). If you're familiar with network sniffer like wireshark, this might also help to cut down your experimenting.

Keeping in mind that your device is relatively new to market and remembering the recent thread about NAT leakage on TL-WR1043ND v4 could also imply a more general problem with initializing the particular switch chipset used in your router. Also give the vlan setup a quick sanity check.

The router in question was working for less then 6h and DHCP leases are set to expire in 12h.

During the most resent bug occurrence only one device was connected to the network. It's an older netbook working 24/7. It was there on the previous router so I don't think it's the culprit. It is a 100Mb/s device so it's affected by this bug: https://forum.openwrt.org/t/tp-link-wr1043nd-v4-issue/
Can the bug discussed in the thread linked above cause this issue?

I think this was a LEDE problem in general and it was fixed by adding a rule to firewall. I have this rule added in my configuration but the problem persist with or without it.

What do you mean? I didn't change the vlans. I have the default one untouched.
I can provide my configuration files here if this would help.

DHCP clients usually try to renew their lease after half the DHCP lease time.

So let's say there is a problem with dnsmasq. How do I fix this or at least conform it?
Would you like to see my config?

As mentioned before, you need to check the situation on your connected clients first.

I will try to do that but it will be hard since the issue occurs at random times. Anything else I should check while I'm at it? All of my machines are running Windows by the way.

Also one other thing.
The DHCP expiration times are different for all of my computers, yet the issue is happening simultaneously on all of them.

It just happened.
It's not DHCP. After doing /etc/init.d/dhcp restart I got:

root@WR1043NDv4_LEDE:~# service dnsmasq restart
udhcpc: started, v1.25.1
udhcpc: sending discover
udhcpc: no lease, failing

WAN doesn't work also, yet WiFi is working fine, I can connect to the router LuCI page and through SSH. Ethernet clients don't get IP from DHCP, it's like there was no network. Switch is working, computer detects new cable and LED on router is turning ON when connecting cable.

Logs again are clear. Don't know what to do next. Any ideas?

The idea was not to do anything on the router, but to check the clients if their IPs and routing tables were correctly renewed; restarting dnsmasq on the router doesn't help with that at all and neither encourages the clients to renew their dhcp lease.

I did check it and DHCP didn't work. I couldn't get an IP or any connection when using Ethernet ports. I don't think it was DHCP because WiFi worked flawlessly, IP was assigned properly there. Also WAN port didn't work. I was unable to obtain Public IP. I don't know what it has to do with DHCP.

Either way I ended up with replacing the router for a new unit. If this happens again I will be sure there is something wrong with the system.

So... I have the same program with the replace unit. Any ideas?

Actually switch itself seems to be working in basic mode. I am able to access network storages connected through LAN, yet the network on switch is completely dead.

Same problem here: LAN/WAN connections are up (there is a link), but no data is transferred. Wifi works, but logs contain only failed pppoe requests. I don't use dhcp (also service is disabled). And it seems to be more frequent since I moved from factory firmware (once every day). Also tried latest OpenWrt build; same result... :S

Finally! Someone who shares the same problem. I was starting to think I was doing something wrong.

Since you don't use DHCP I guess we can rule out dnsmasq as a cause. When you say factory firmware do you mean the stock TP-Link firmware? I admit I haven't used it at all. As soon as I booted it up I installed LEDE onto it. If this issue is also present in original firmware then maybe there is some hardware issue?

I had WR1043ND v1 before running OpenWRT and with the same settings it was working very well, no issues. I was going to test it on v4 next but since you've already done it, I think I pass.

Did you maybe had a chance to use v3 before?

1 Like

Yes, the stock TP-Link firmware seems a bit more stable for me (used it for a month with only 3-4 LAN-lockups). After disabling every feature I could (IPv6, firewall, wifi...),and the problem still persisted I moved to OpenWRT (then LEDE). Actually I found one guy at the Opewrt forums with the same issue. TP-Link got his router replaced, but that did not solve his problem. Someone said, that this also could be a traffic-related problem (weird ISP configuration, oversized packets...), or the problem is in the u-boot section which is not replaced by custom firmware. I try to experiment with it in the weekend.
We used V3-s at work, and they were rock-solid, thats why I choose this box. I hope the problem can be solved in software.

If the problem is also present on factory firmware then I don't know what else there is to do. I actually was able to to run the router for 28 days without incident when I initially installed LEDE. Right now I'm unable to get more then 3 days without WAN/LAN hangs. I also got the unit replaced and this also didn't solve the issue for me. The weird thing is that the logs and dmesg are perfectly clean. In my case there isn't a single entry reflecting the issue. I wrote a script that checks every 5 minutes if the internet is working and then restarts the network when WAN/LAN hangs.

About the traffic-related problem you mentioned. I don't think it the problem with the ISP. My cable modem runs in modem mode, which means that the LEDE router is doing all the work. I also had the same ISP, same modem and same config on v1 and I didn't had this issue there.

You also mentioned u-boot section. I'm not sure what this has to do with the problem since firstly the problem also occurs on the TP-Link firmware and secondly once the LEDE boots, u-boot has nothing else to do. I don't think replacing u-boot sections is a standard practise when installing LEDE, but I might be wrong. Maybe @jow or @hnyman can clarify this.

Did you maybe tried to contact TP-Link support regarding this issue?

Maybe try replacing the power supply if you happen to have a suitable replacement chord available.

My replacement unit came with a new power supply.
I also don't think the power supply changed since HW2.1 and every thing worked fine there.

I'm thinking about returning the router to the store as a faulty model. BTW TP-Link still didn't fixed the NAT leakage problem so I can use this as a base cause if the store will try to make troubles for me (which would be illegal for them to do so).

Can you maybe share a link to the thread you mentioned?

It was Gluon, sorry (but it's based on OpenWRT too):