OK, here is a dump from the 10 minutes test that I ran.
All the timestamps in the log and the switch output are 100% correct and syncronized, since the switch is also configured to send the logs to the router's syslog-ng.
1) I did a small modification of mwan3track in order to keep a track of the Packet loss towards each mwan3 trackable targets.
[root@RTR-TEST:~]# diff /rom/usr/sbin/mwan3track /usr/sbin/mwan3track
--- /rom/usr/sbin/mwan3track
+++ /usr/sbin/mwan3track
@@ -13,15 +13,23 @@
track_ips=$(echo $* | cut -d ' ' -f 9-99)
host_up_count=0
lost=0
+PLOSS_DEFAULT="100" #Default Packet loss rate in %
+PLOSS_THRESHOLD="5" #Packet loss rate in %, that triggers host condition change (up|down)
while true; do
for track_ip in $track_ips; do
- ping -I $2 -c $4 -W $5 -s 4 -q $track_ip &> /dev/null
- if [ $? -eq 0 ]; then
+ #ping -I $2 -c $4 -W $5 -s 4 -q $track_ip &> /dev/null
+ PING_RESULT=$($(which ping) -I $2 -c $4 -W $5 -s 64 -q $track_ip | grep "packet loss") &> /dev/null
+ PLOSS=$(echo "${PING_RESULT}" | sed -e 's/^.*received, //' -e 's/^.*errors, //' -e 's/% packet loss.*//')
+ if [ -z "${PLOSS}" ]; then PLOSS=${PLOSS_DEFAULT}; fi
+
+ #if [ $? -eq 0 ]; then
+ if [ "${PLOSS}" -le "${PLOSS_THRESHOLD}" ]; then
let host_up_count++
else
let lost++
+ logger -t mwan3track -p notice "Tracked host issue detected on Interface $1 ($2): $track_ip -> ${PLOSS}% packet loss"
fi
done
[root@RTR-TEST:~]#
2) The mwan3 interface configuration
[root@RTR-TEST:~]# cat /etc/config/mwan3
config interface 'wan1'
option enabled '1'
list track_ip '46.40.125.1'
list track_ip '212.73.140.119'
list track_ip '8.8.8.8'
list track_ip '208.67.222.222'
list track_ip '208.67.220.220'
option reliability '3'
option count '5'
option timeout '2'
option interval '20'
option down '1'
option up '3'
config interface 'wan2'
option enabled '1'
list track_ip '46.249.80.1'
list track_ip '87.120.130.66'
list track_ip '8.8.8.8'
list track_ip '208.67.222.222'
list track_ip '208.67.220.220'
option reliability '3'
option count '5'
option timeout '2'
option interval '20'
option down '1'
option up '3'
3) Test started - the switchport facing the ISP1 is administratively disabled (switch logging can be seen in the router's log as mentioned).
sw1#
sw1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
sw1(config)#int fa 0/1
sw1(config-if)#shut
sw1(config-if)#^Z
sw1#sh int desc | i WAN
Fa0/1 admin down down WAN1_/24Mbps/
Fa0/2 up up WAN2_/45Mbps/
sw1#sh clock
20:02:17.252 EEST Mon May 11 2015
sw1#
sw1#
sw1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
sw1(config)#int fa 0/1
sw1(config-if)#no shut
sw1(config-if)#^Z
sw1#sh int desc | i WAN
Fa0/1 up up WAN1_/24Mbps/
Fa0/2 up up WAN2_/45Mbps/
sw1#sh clock
20:08:58.666 EEST Mon May 11 2015
sw1#
4) After WAN1 interruption you can see that there are several trackable hosts that seems pingable via the disables link (wan1), but the ping failes via the operational link (wan2).
It even happened 2 times for those 10 minutes that the wan1 was detected to go UP and down again after a while. Through the whole period the switch port handling the link to ISP1 (t.e. wan1) was down, so no ping should ever pass via wan1(eth0.2).
There are several failed pings via wan2 (that fully operational at this moment) on a random bases (I suppose)
May 11 20:02:13 sw1 2001: May 11 20:02:12 EEST: %SYS-5-CONFIG_I: Configured from console by dir2cas on vty0
May 11 20:02:14 sw1 2002: May 11 20:02:13 EEST: %LINK-5-CHANGED: Interface FastEthernet0/1, changed state to administratively down
May 11 20:02:15 sw1 2003: May 11 20:02:14 EEST: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
May 11 20:02:20 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 208.67.220.220 -> 100% packet loss
May 11 20:02:32 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 208.67.220.220 -> 100% packet loss
May 11 20:02:38 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:02:42 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 8.8.8.8 -> 100% packet loss
May 11 20:03:08 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 208.67.222.222 -> 100% packet loss
May 11 20:03:14 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 8.8.8.8 -> 100% packet loss
May 11 20:03:20 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 208.67.222.222 -> 100% packet loss
May 11 20:03:24 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:03:24 RTR-TEST mwan3track: Interface wan1 (eth0.2) is offline
May 11 20:03:24 RTR-TEST mwan3: ifdown interface wan1 (eth0.2)
May 11 20:03:26 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 8.8.8.8 -> 100% packet loss
May 11 20:03:28 RTR-TEST root: stopping ntpclient
May 11 20:04:10 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:04:10 RTR-TEST mwan3track: Lost 20 ping(s) on interface wan1 (eth0.2)
May 11 20:04:52 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:04:52 RTR-TEST mwan3track: Lost 5 ping(s) on interface wan1 (eth0.2)
May 11 20:05:34 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:05:34 RTR-TEST mwan3track: Lost 5 ping(s) on interface wan1 (eth0.2)
May 11 20:05:35 RTR-TEST mwan3track: Interface wan1 (eth0.2) is online
May 11 20:05:35 RTR-TEST mwan3: ifup interface wan1 (eth0.2)
May 11 20:05:38 RTR-TEST firewall: Reloading firewall due to ifup of wan1 (eth0.2)
May 11 20:05:41 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 208.67.220.220 -> 100% packet loss
May 11 20:05:52 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 212.73.140.119 -> 100% packet loss
May 11 20:05:55 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:05:56 RTR-TEST mwan3track: Interface wan1 (eth0.2) is offline
May 11 20:06:01 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 208.67.222.222 -> 100% packet loss
May 11 20:06:11 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 87.120.130.66 -> 100% packet loss
May 11 20:06:41 RTR-TEST mwan3: ifdown interface wan1 (eth0.2)
May 11 20:06:45 RTR-TEST mwan3track: Tracked host issue detected on Interface wan2 (eth0.3): 208.67.222.222 -> 60% packet loss
May 11 20:07:26 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:07:26 RTR-TEST mwan3track: Lost 20 ping(s) on interface wan1 (eth0.2)
May 11 20:08:09 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:08:09 RTR-TEST mwan3track: Lost 5 ping(s) on interface wan1 (eth0.2)
May 11 20:08:51 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:08:51 RTR-TEST mwan3track: Lost 5 ping(s) on interface wan1 (eth0.2)
May 11 20:08:51 RTR-TEST mwan3track: Interface wan1 (eth0.2) is online
May 11 20:08:51 RTR-TEST mwan3: ifup interface wan1 (eth0.2)
May 11 20:08:55 sw1 2004: May 11 20:08:54 EEST: %SYS-5-CONFIG_I: Configured from console by dir2cas on vty0
May 11 20:08:55 RTR-TEST firewall: Reloading firewall due to ifup of wan1 (eth0.2)
May 11 20:08:56 sw1 2005: May 11 20:08:55 EEST: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up
May 11 20:08:56 sw1 2006: May 11 20:08:56 EEST: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up
May 11 20:08:58 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 208.67.220.220 -> 100% packet loss
May 11 20:09:08 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 8.8.8.8 -> 100% packet loss
May 11 20:09:17 RTR-TEST mwan3track: Tracked host issue detected on Interface wan1 (eth0.2): 46.40.125.1 -> 100% packet loss
May 11 20:09:17 RTR-TEST mwan3track: Interface wan1 (eth0.2) is offline
May 11 20:09:25 RTR-TEST root: starting ntpclient
May 11 20:09:27 RTR-TEST mwan3: ifdown interface wan1 (eth0.2)
May 11 20:09:30 RTR-TEST root: stopping ntpclient
May 11 20:10:11 RTR-TEST mwan3track: Lost 15 ping(s) on interface wan1 (eth0.2)
May 11 20:11:31 RTR-TEST mwan3track: Interface wan1 (eth0.2) is online
May 11 20:11:32 RTR-TEST mwan3: ifup interface wan1 (eth0.2)
May 11 20:11:35 RTR-TEST firewall: Reloading firewall due to ifup of wan1 (eth0.2)
May 11 20:11:36 RTR-TEST root: starting ntpclient
5) You can see that even after the switch port towards ISP1 was brought up which should lead to a fully operational wan1, mwan3 detects a link fault and after more than 2 minutes restores the link again as it should.
As a conclusion, I am currently not able to identify the root cause. It is related either with the conntrack package/mechanism or with mwan3 marking rules.
Any ideas would be fairly helful.
BR, dir2cas