Wrt1900acs v2 - Loosing internet connection randomly

abysso2 · May 28, 2020, 6:14am

Regarding the dhcp fw rules: Same here ...

trendy · May 28, 2020, 7:34am

It's not clear which one is this. I am referring to the IP you are using on your OpenWrt as gateway in the wan interface and is default gatway for your network.

Morgoth · May 28, 2020, 11:29am

The problem occurred again few minutes ago. The WAN interface was connected for 2d 1h 9m 16s and the lease was expiring in 4 days (4d 22h 50m 44s).

dmesg:

[46022.627061] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46607.075744] ieee80211 phy0: radar detected by firmware
[46607.623240] ieee80211 phy0: channel switch is done
[46607.628058] ieee80211 phy0: change: 0x60
[46608.117127] ieee80211 phy1: Mac80211 start BA a4:45:19:19:2d:af
[46612.247855] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.282072] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.342072] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.402062] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.462065] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.522065] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46612.582069] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46613.368079] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[46715.662384] ieee80211 phy1: Mac80211 start BA a4:45:19:19:2d:af
[53815.348931] ieee80211 phy1: Mac80211 start BA a4:45:19:19:2d:af
[64732.883841] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[77440.893488] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[110682.524963] ieee80211 phy0: Mac80211 start BA a4:45:19:19:2d:af
[132533.645448] ieee80211 phy1: Mac80211 start BA a4:45:19:19:2d:af

logread. For reference I lost connection few minutes before May 28 13:00

Thu May 28 10:17:08 2020 authpriv.info dropbear[9975]: Child connection from 192.168.0.160:50369
Thu May 28 10:17:11 2020 authpriv.notice dropbear[9975]: Password auth succeeded for 'myuser' from 192.168.0.160:50369
Thu May 28 10:17:50 2020 authpriv.notice sudo: myuser : TTY=pts/0 ; PWD=/home/myuser ; USER=root ; COMMAND=/bin/ash
Thu May 28 10:39:32 2020 daemon.err uhttpd[1764]: luci: accepted login on / for root from 192.168.0.160
Thu May 28 10:49:13 2020 authpriv.info dropbear[9975]: Exit (myuser): Disconnect received
Thu May 28 12:59:05 2020 daemon.err uhttpd[1764]: luci: accepted login on / for root from 192.168.0.160
Thu May 28 12:59:43 2020 authpriv.info dropbear[10602]: Child connection from 192.168.0.160:54109
Thu May 28 12:59:46 2020 authpriv.notice dropbear[10602]: Password auth succeeded for 'myuser' from 192.168.0.160:54109
Thu May 28 12:59:52 2020 authpriv.notice sudo: myuser : TTY=pts/0 ; PWD=/home/myuser ; USER=root ; COMMAND=/bin/ash

ping WAN IP: SUCCESS
ping ISP modem 'administration' IP: SUCCESS
ping default Gateway: FAILED (Note: I can't ping even when my connection is working.)
ping ISP DNS servers: FAILED
ping Google DNS: FAILED
ping Quad9 DNS: FAILED
ip route: routes are ok with the correct default gateway
netstat:
- the only established connection is for my SSH session.
- remaining are ports listening.
- no connection to outside the LAN
ifstatus wan: the only difference with this morning is the uptime
arp:
- the arp content is indentical to the one I got earlier this morning.
- the arp for the default gateway is present.

Once I collected all that information I ran ifup wan. The DHCP lease expiration went back to 7 days but I still could not connect outside the LAN.

This time I didn't reboot the router. I only restarted the WAN interface and the connection to internet started to work again.

Let me know if there are other information I should collect.

abysso2 · May 28, 2020, 11:50am

@trendy:

Here is the config of my wan interface:

grafik

Regading my last post, the *194 is the inside interface, the 193 the ouside interface - when i loose connection to the internet (no ping 1.1.1.1), i can ping both - the *194 and the *193 ip!

trendy · May 28, 2020, 1:48pm

Then it is not dhcp issue.

This makes troubleshooting more difficult.
Last resort, run a tcpdump on the wan interface when the problem reoccurs:
tcpdump -i eth1.2 -ev
Try to ping, browse, to see if traffic is going out.

trendy · May 28, 2020, 1:50pm

Try to traceroute or mtr 1.1.1.1 to see where it stops.
However if you can reach your ISP then it might as well not be your problem.

anon50098793 · May 28, 2020, 2:06pm

hmmm... was that via;

ip neigh show | grep REACHABLE

you can trigger a tcpdump or your ping scripts in the connection down function

#!/bin/bash

VERBOSE=1
LOOPDELAY=3
BACKOFFUP=3
BACKOFFSTALE=1 
STALEMAX="3"
WANREPAIRMAX=10000 #WANREPAIRMAX=10
###############
STALECOUNT="0"
DOWNCOUNT="0"
CHECKCOUNT="0"
WANREPAIRCOUNT=0
WANREPAIRSTATE="INIT"
WANSTATUS=
GWIP="`ip route | grep default | cut -d' ' -f3`"
if [ ! -z "$GWIP" ]; then  WANSTATUS="UP"; else WANSTATUS="DOWN"; fi
WANIF=$(uci show network | grep 'network.wan.ifname' | cut -d"'" -f2)
[ -z "$WANIF" ] && echo "unknown WANIF" && exit 0
WANIP=$(ip route show | grep default | cut -d' ' -f9)
WANIFAUTO="`ip neigh show | grep -w "$GWIP" | cut -d' ' -f3`"

echo "          WANIF: $WANIF@$WANIP>$GWIP"
sleep 2


getgwip() {
	GWIP="`ip route | grep default | cut -d' ' -f3`"
	if [ ! -z "$GWIP" ]; then  WANSTATUS="UP"; else WANSTATUS="DOWN"; fi
}

checkup() {
	CHECKCOUNT=$[$CHECKCOUNT + 1]
	echo -n "CHECKUP:"
        if ping -c 2 -w 3 "${GWIP}" &>/dev/null; then
		#echo " [ok]";
		WANSTATUS="UP"; return 0
	else
		#echo " [down]";
		WANSTATUS="DOWN"; return 1
	fi
}

connectiondown() {
	if [ "$WANREPAIRSTATE" = "FIXING" ]; then
		echo "connection still down"
	else
		logger -t wan-mon "${WANIF} DOWN:`date +%Y%m%d-%H%M` ${DOWNCOUNT}"
		/etc/init.d/network restart; sleep 7
		logger -t wan-mon "${WANIF} `date +%Y%m%d-%H%M` RESTARTING NETWORK..."
		WANREPAIRCOUNT=$[$WANREPAIRCOUNT + 1]
		WANREPAIRSTATE="FIXING" #REPAIR FAILED WAIT
	fi
	WANSTATE="DOWN"
	showstats
}

connectionup() {
	logger -t wan-mon "${WANIF} UP:`date +%Y%m%d-%H%M` ${DOWNCOUNT}"
}


showstats() {
echo -n "${WANRESULT:-LEARNING}"
echo -n ":::"
echo -n "$WANIF@$WANIP>$GWIP [$WANSTATUS]"
echo -n " STALE[$STALECOUNT]"
echo -n " CHECK[$CHECKCOUNT]"
echo -n " DOWN[$DOWNCOUNT]"
echo -n " WANREPAIR[$WANREPAIRCOUNT]"
echo -n " REPAIRSTATE[$WANREPAIRSTATE]"
echo ""
}


while true; do

	if (("$WANREPAIRCOUNT" > "$WANREPAIRMAX")); then
		echo "ENDING SCRIPT ::: WANREPAIRCOUNT: $WANREPAIRCOUNT > WANREPAIRMAX: $WANREPAIRMAX"
		exit 21
	fi

	getgwip

	if [ -z "$GWIP" ]; then
		connectiondown
	else
		WANSTATE="`ip neigh show dev $WANIF | grep -w "$GWIP"`"
		case $WANSTATE in
		*"FAILED"*)
			WANRESULT="FAILED"
			WANSTATUS="DOWN"
			if [ "$WANREPAIRSTATE" = "FIXING" ]; then
				echo "hmmm... fixing but still down"
			fi
		;;
		*"STALE"*)
			WANRESULT="STALE"
			STALECOUNT=$[$STALECOUNT + 1]
		;;
		*"REACHABLE"*)
			WANRESULT="REACHABLE"
			STALECOUNT=0
			if [ "$WANREPAIRSTATE" = "FIXING" ]; then
				WANREPAIRSTATE="REPAIRED"
				logger -t wan-mon "${WANIF} UP:`date +%Y%m%d-%H%M` ${DOWNCOUNT} NETWORK RESTORED..."
			fi
			sleep $BACKOFFUP

		;;
		*"DELAY"*)
			WANRESULT="DELAY"
		;;
		*)
			WANRESULT="UNKNOWN"
			WANSTATUS="DOWN"
		;;
		esac

		showstats
		if (("$VERBOSE" > "0")); then
			: #logger -t wan-mon-stat "`showstats`";
		fi

		if (("$STALECOUNT" > "$STALEMAX")); then
			if checkup; then
				STALECOUNT=0
			else
				DOWNCOUNT=$[$DOWNCOUNT + 1]
				connectiondown
			fi
		fi

		if [ "$WANSTATUS" == "DOWN" ]; then
			echo "STATE is DOWN at bottom loop"; sleep 1
			if checkup; then
				STALECOUNT=0
			else
				DOWNCOUNT=$[$DOWNCOUNT + 1]
				connectiondown
			fi
		fi

		if (("$STALECOUNT" < "$STALEMAX")); then sleep $BACKOFFSTALE; fi
	fi

	sleep $LOOPDELAY
done

echo "Exiting"
exit 0

Morgoth · May 28, 2020, 4:27pm

Unfortunately, no. I simply ran arp command and checked that the gateway's IP was there. I'll use the command you provided for the next time.

Morgoth · May 28, 2020, 4:41pm

I'll try to check with my ISP if there is a reason why I cannot ping the gateway. In the mean time I'll add the tcpdump command to my script.

Barney · May 28, 2020, 7:45pm

Here you wrote, that you do not have internet access.

Here you wrote, that you do have internet access.

Both can't be true. Either you have internet access or you do not have internet access.

Please notice, that it can take up to 30 seconds to assign the wan interface a new IP.

The next time the failure appears, check the wan IP before you run ifup wan, then run ifup wan, wait 1 minute and check the wan IP again. If the old and new IP are different, it verifies my thesis of my earlier post.

Morgoth · May 28, 2020, 9:00pm

However it's exactly what is happening every time. Just before 13h00 I lost internet connection. By no internet connection I mean that the computer I was using (macbook pro, wired connection) couldn't reach any website in a browser. I couldn't connect to any website (I use www.perdu.com for test) or ping public DNS servers. I couldn't connect outside the LAN, thus no internet connection.

As I mentioned previously all LAN connections were working. From the router I could ping my public IPv4 address (82.64.x.x) on the WAN interface. I still couldn't reach the internet. The IP is static. By static I mean that my ISP always give me the same IPv4. A static DHCP lease.

The WAN interface still had the 82.64.x.x IP when I executed ifup wan. The lease was renewed immediately and I kept the same 82.64.x.x IP. My internet connection was restored only after restarting the WAN interface. The modem was up the whole time and it didn't require any action.

The next time it happens I will also test few IPv6 addresses to see if I get the same thing. I'd like to see if both WAN and WAN6 interfaces are impacted. I'll also get the output of ip a to have all the IP assigned on the router. I'll post back when I have updates.

Finally, just to be sure I'll contact my ISP again tomorrow to ask them if they see anything suspicious on my line or the modem itself.

Barney · May 28, 2020, 9:42pm

This falsifies my suspicion. My assumption was, that the modem changes the IP and the linksys router operates with the old, invalid IP.

When you always get the same IP from the modem, the ifup wan doesn't cause anything.

Restarting the wan interface from LuCi is the same as running ifup wan from a ssh command line.

So far I have no further idea. For me your problem is located either in the modem or with your ISP.

If your problem does not exist, when you operate your modem in router mode, then I suspect the bridge mode in your modem as cause for your troubles.

Morgoth · May 29, 2020, 7:11pm

It happened again today. Same thing as yesterday: no 'connection' outside the LAN. Unfortunately nothing new in logread and dmesg.

But I got something interesting this time. IPv6 is still working when it happens. For example I could not ping the IPv4 for forum.openwrt.org (139.59.210.197), but it works if I use IPv6 2a03:b0c0:3:d0::168b:9001.

Same for other destinations. For example I can ping AdGuard DNS IPv6 (2a00:5a60::ad2:ff), but not it's IPv4 (176.103.130.131). Same thing with Google and Quad9 DNS servers.

It's clear that the internet connection is still working but some traffic doesn't work.

This time I didn't have time to run tcpdump. My script was still doing ping and traceroute on some IPs when things started to work again. This time I didn't restart anything. I would say it took about 10-15 minutes for connection to work again. Next time I'll start tcpdump first and let it run in background while other tests are running.

Edit: Here when I mention tcpdump it is the command suggested previously to only capture packages going out of wan (tcpdump -i eth1.2 -ev)

trendy · May 29, 2020, 7:54pm

That might choke your router. The example I gave earlier will capture all packets going out of wan.

Morgoth · May 29, 2020, 8:36pm

Of course. I edited my last reply to clarify my 'intentions'. When I mention tcpdump, I actually use the command you suggested (tcpdump -i eth1.2 -ev). I'll also add the -w flag to 'record' the dump so I can have a look at it after. Hopefully the next data collect will be usable.

trendy · May 30, 2020, 10:04am

This might be worse, as it will fill the flash.

Morgoth · May 30, 2020, 11:04am

Nothing to worry about: I have a SSD connected to the router with about 200Gb of free space. That should be enough

mattbatt · June 2, 2020, 1:26pm

I'm trying to figure out if this is similar to my issue openwrt.org/t/replug-ethernet-everyday/55686/2
does unplugging the ethernet from your modem and plugging it back in reconnect the network to the internet?
I am also not able to connect to my modem.
And it's happening during high internet use on random days with no pattern.

Morgoth · June 2, 2020, 7:53pm

I didn't try that. So far I either rebooted or restarted the network interface.

Morgoth · June 11, 2020, 5:17pm

I'm back: a new occurrence of the problem today. Hopefully the data I collected will make sense.

ip neigh show all: the default gateway for my internet access (82.64.x.x) shows as REACHABLE.
ping IPv4 addresses: they all fail.
- my ISP DNS servers
- 9.9.9.9
- 8.8.8.8
- 1.1.1.1
ping IPv6 addresses: OK
- 2a00:5a60::ad1:ff
- 2a03:b0c0:3:d0::168b:9001:
I captured a tcpdump (tcpdump -i eth1.2 -ev). Unfortunately my skills here are limited and I don't know what I should look for exactly (44k lines)
- I did a dumb grep -c on few strings that seems to appear the most.
- ACK: 37219
- TCP segment of a reassembled PDU: 11190
- Application Data: 4724
I notice in the tcpdump that one IP appear more often than any other. Of the 44k likes, 31532 lines for 68.65.192.73 alone (CrashPlan). Indeed CrashPlan is running on my macbook pro.
I can ping IPv6 addresses and not IPv4. I think it's safe to say that my internet connection is working, but only for IPv6 traffic.
Could it be a form of network congestion ? Like CrashPlan "flooding" my WAN connection ?