Archer C7 dropping some/all IPv4 packets only for certain clients [SOLVED]

Hello all,

My house has numerous devices, including a roku, amazon fire stick, several android tablets, a couple of iphones, three Windows PCs (all Win7), and two iphones. For the last few months -- and possibly forever, I may never have noticed -- two of the windows PCs and both of the iphones will randomly stop being able to send IPv4 data. Pings to the Archer C7 (which is my primary gateway) time out, as do nslookup calls or anything that tries to go to the internet via IPv4. IPv6, however, seems to continue to work (the original symptom was "why does Google work but nothing else sometimes?"). Then after a few minutes, IPv4 will magically start working again, and will continue to do so for some time. It's worth noting that these are all on wireless g/n, so only the 2.4GHz radio is involved. However, I have plenty of other 2.4GHz devices that work.

The router logs have not indicated anything interesting. DHCP requests are answered quickly, and the devices always show the IP assigned by the router. When I make very short lease intervals, the devices update at the expected rates. So, DHCP seems to be OK too.

What I've tried so far:

  1. Deleting ARP cache and other cached network data from the client devices
  2. Starting with a brand-spankin-new reinstall of Win7 (on one of the machines) to ensure it's not a settings issue
  3. Using a USB network adapter (with a different MAC, of course) to see if it was a MAC-related issue
  4. Updating to LEDE 17.01.2
  5. Disabling IPv6 on the client
  6. Disabling IPv6 on the router (by going to my LAN interface and disabling the IPv6 DHCP and DNS settings and setting IPv6 length assignment to disabled -- not sure if that's sufficient)
  7. Switching to hardwired ethernet (for the PCs only, of course)

None of these things have made any difference. In all cases, the router logs the incoming DHCP requests, assigns the addresses, and then, as far as I can tell (without really knowing where to look), just rejects all packets from the device for... a while. And then suddenly everything works fine.... And then it doesn't again. It's maddening that I haven't been able to find any pattern to this.

The unit is not under high load, but I even tried switching to a higher-rated power adapter in the case that something was dragging it down (the only other things I have this router doing is a DDNS call every few days, and acting as an OpenVPN server so I can reach my music library from work, and yes, I tried disabling that as well).

Can anyone offer any other suggestions as to what I should be looking at or what might be going wrong?

Thanks!

You covered a lot of ground already. Especially since the problem persists over wired, you know you can eliminate RF problems.

Extra interfaces (tunnels, 3G, whatnot) being active and intermittently selected as a default route comes to mind.

I'd try running wireshark on the Win7 boxes (staying on wired for this) and see if they think they are sending data out the wired port, while the problem is manifesting, and ensure said packets are not encapsulated in a tunneling protocol.

Then if that looks OK, use dumpcap or tcpdump to check whether packets are arriving at the Archer and then whether it is forwarding them by specifying ingress/egress interfaces in turn. Important to be certain where those drops are happening.

Here is a wireshark packet capture running on the Dell laptop with a wired connection (the ip is .238). The last 4 or 8 pings at the bottom of the file worked (even though the capture says no response!), while all of the previous ones did not. I don't see any really obvious signs of something magically getting fixed in between, mostly just a lot of ARP packets, but I'm definitely not a wireshark expert of any sort.
I was hoping there might be a similar kind of tool I could run on the router to monitor all inbound traffic, but I can't get tcpdump (the only one I'm familiar with) to look at any other adapter aside from my public-facing ones. When I try the LAN, wireless or bridge adapters they all say they can't be put into monitor mode. And doing a generic "tcpdump host 192.168.0.238" yields nothing - so either the packets aren't making it to the router, or I'm not using the right tcpdump command to see them. How do you install dumpcap? It wasn't in the opkg list, and I can't imagine trying to compile something on the router itself.

And as one additional datapoint I disabled the stupid MS teredo IPv6 interface on the laptop to see if it would make a difference after noticing all of the packet spam in the pcap. Sadly, the problem remains.

dumpcap is just wireshark/tshark's UI-less backend for use when you want to limit your code surface exposure, or just need something really tiny. If you have tcpdump, that's good enough. I've never had an issue running it on lan ports (you'd run it on the ethX interface that is the uplink from the embedded switcth to the CPU.)

It looks like when they fail, packets to 192.168.0.1 are using a Wistron destination MAC address, and when they succeed they are using a TP-Link destination MAC address. That would make me go looking for another device on the LAN which is stealing 192.168.0.1; one with a Wistron card.

That is an amazingly good lead -- thanks so much! Now to start hunting down the zillion-plus IP-connected devices in my house...

Check ARP tables for alternate address also with the same or close MAC for clues, and failing that, bisect by hub/wire.

Also, you can get problems like this mostly out of your hair by using a random 10.x /24 subnet for your home LAN instead of 192.168.x Not 10.0.0 or 10.1.1 or 10.1.2 -- everyone uses those. Choose something unlikely.

That's exactly what I did, though I was never able to tell exactly which device was causing the problem. We had a lightning strike a few weeks ago that reset/broke several devices in my home. I'm assuming that one of those is the culprit. I've changed to a completely random 10.x/24, and have basically rebooted every device one at a time so I can assign names to each MAC address to hopefully make this kind of thing easier to diagnose. I've also rotated my wifi passwords and added a new guest network on a separate firewall zone just in case there were bad actors involved. Thanks so much for your help and great suggestions!

No problem at all. If you could edit the thread subject to add [SOLVED] that would be great.