Actually I have no idea. It's always happening at random for me. Usually though at day, when there are three or four computers on the network. I actually haven't noticed any "hangs" at night when only the server was running. I have an older PC working as server and pushing data 24/7. I have also a PC machine and two laptops. I noticed that when all of them are in use the "hang" occurs more often. I should probably mention that during the "hang" I always had at least two computers connected via Ethernet cable.
There was one incident, 28 days after the first LEDE install, coming from factory firmware. It was 2 am, only server was working, my brother connected his laptop via Ethernet and shortly after the hang occurred.
Do you have any script or something that restarts your network when Internet fails?
I do. I have had it write a log message to my USB so I knew then the restart occurs.
Well, I also have this issue with my setup. I don't know if this is connected. I replaced the Ethernet cable and reinstalled the OS on computer. It didn't solve the issue. I've never had this problem on v1.
I forgot that I have a 2nd 24/7 client attached to the LAN ports. My own PC
Installed SmokePing now. Maybe this will show something.
Did you got any response?
I received an e-mail from TP-Link Poland, Tech Department. I won't be publishing the content of the e-mail because it's in Polish but they asked if I ever used those options "IP & Mac Binding" or "DoS Protection" in Advanced>Advanced Security tab. Well, I didn't. Did any of you used them?
I don't think they understand that OpenWRT and LEDE also shows this issue. It's funny because I believe that the official firmware is based on OpenWRT Altitude Adjustment.
They said the unit probably has a hardware failure, and I should return it.
I did not touch those settings you mentioned (both are disabled by default).
For now I installed the newest TP-Link "portugal" firmware: 3.17.9 Build 20170401 Rel.64459n (also shows up if you choose Germany under regions), and waiting for the problem to occur.
That's funny since there are now 4 or 5 people with the same issues across Europe. Did you pointed that out to them? May I ask from which country are you from and which TP-Link (Tech Department?) gave you that response?
@andreas asked before if I know how to reproduce this problem. Do you maybe have anything to add to what I already said?
Out of curiosity. How do you recover from the WAN/LAN "hang" on official firmware?
I know, and the replacement unit they got in return suffers from the same issue as I heard. Also I sent them the links for this forum threads, and the one I opened on tp-link forums:
For now I take my time with the return, maybe we/they can find a solution, or until they come out with v4-rev.2 hardware (which I probably will get as a replacement then ). Btw I'm from Hungary.
I really don't know how to reproduce the problem, seems pretty random to me (maybe depends on connection count?: torrents, multiple clients...)
Thats a good question
I go to the box, and press the reset button.
I have exactly same issue on my 2 TP-LINK Archer C7 AC1750 (CA) version routers. The freeze happened at the time when another device was joining the network either via WIFI or via cable. When it happened all the devices were shut off internet connection. (I have 4 devices connected to the router). The devices were pingable, still have access to the router, just didn't have access to the internet. Waiting 10-20 minutes the route would come back to life, then the internet connections came back to all the devices. syslog, dmesg were clean.
When it happens browsers return DNS error for well known websites (e.g. www.google.com)
I suspected hostapd+hdcpd+dnsmasq caused the issue.
The issue happened on 2 identical boxes. so It either has design defects on the power supplier or power supplier is not related to the issue.
That doesn't seem exactly like the problem I'm having. You see I can't ping router or other devices if I'm connected via Ethernet. I also can't get IP on WAN interface and can't get IP from DHCP client. The WiFi is working fine, I can ping other devices (not the Ethernet one), access router though SSH etc.
I don't think my router ever recovered but I couldn't afford to wait 20 minutes to get my network back online.
Do you use DNSCrypt-Proxy?
If I were you I would check if this isn't just a DNS problem. Have you tried pinging IP addresses, like Google's DNS servers (126.96.36.199)?
If you don't, then you should try, next time when this happens to you.
Does renewing a lease cause a client to lose connection (say would a Roku stop streaming)?
I haven't noticed the "hang" being connected to DHCP.
@lacamester said his DHCP service (dnsmasq) was disabled entirely. Problem with DHCP also doesn't explain the issue with WAN port. If this was a DHCP issue then only some computers would have lost connection, since they have different lease expire times and Internet on WiFi would be working fine. The WiFi clients are unaffected.
My SmokePing is running now for 4 days without a single lost packet on LAN. I can also say for sure that I never had any outages that lasted for more than 10 to 15 minutes because one of my monitoring servers is behind the router. So, I am afraid, I cannot help debug this problem. Seems my router is not affected.
Update 2017-05-24 12:28: Added specs:
I am running LEDE Reboot 17.01.1 r3316-7eb58cf109, essentially default settings apart from: LAN is 10.0.0.0/24, router has static IPs, NAT leakage workaround, WLAN, odhcpd, and some IPv6 features are disabled.
Can you tell me which build are you running and do you use SQM for any chance?
Sorry, I should have. I edited my original posting. I run the router without added packages, and without special QoS.
So. I used the latest tp-link "portugal" firmware for more than a week now... and had no issues, lan/wan stops whatsoever with it. Seems like it's a firmware issue after all (If it stays stable for a month, then I'll use it permanently).
But how does it affect LEDE then?
Would you mind trying to flash LEDE onto your device after sometime to check if the issue has been resolved there too? Maybe there was an issue with u-boot or something?
I'm thinking of a kernel-patch, or some device-driver fixes (tp-link uses the same gnu-tools, daemons as lede/*wrt according to the log entrys).
I had an issue when I installed a brand new v4 with Lede in a datacenter as a firewall between the servers and the rest of the network.
With default settings it did not pass any traffic towards the wan port (if I remember correctly, RX packet counter was growing but TX wasn't - or maybe the other way around).
Anyway, after lowering MTU to 1400, it immediately started working properly.
Maybe you could give that a try.
How do I lower MTU in LEDE?