Mwan3 stops working suddenly

I have there a Linksys WRT1200AC (LEDE Reboot 17.01.4) with mwan3 (2.0.2-1) and sqm-scripts (1.1.3-2) in use, which causes me problems.

It works fine in general with the load balancing and fail over, but I had now two times the situation that the Internet stopped working (almost) completely. Only an ingoing already established PPtP forwarding was still alive.
There was not any outgoing connection possible on this router. I have tried to ping via the according WAN interfaces without success.

At the end it helped to run "mwan3 restart" to fix this issue, a "/etc/init.d/sqm restart" before did not change anything.

root@fw1:~/opkg-upgrade# /etc/init.d/sqm restart
SQM: Stopping SQM on eth1.301
SQM: Starting SQM script: layer_cake.qos on eth1.301, in: 6144 Kbps, out: 6144 Kbps
SQM: layer_cake.qos was started on eth1.301 successfully
SQM: Stopping SQM on eth1.302
SQM: Starting SQM script: layer_cake.qos on eth1.302, in: 51200 Kbps, out: 30720 Kbps
SQM: layer_cake.qos was started on eth1.302 successfully
SQM: Stopping SQM on br-lan_guest
SQM: Starting SQM script: layer_cake.qos on br-lan_guest, in: 3072 Kbps, out: 6144 Kbps
SQM: layer_cake.qos was started on br-lan_guest successfully
SQM: Stopping SQM on br-lan_mobile
SQM: Starting SQM script: layer_cake.qos on br-lan_mobile, in: 6144 Kbps, out: 16384 Kbps
SQM: layer_cake.qos was started on br-lan_mobile successfully
root@fw1:~/opkg-upgrade# mwan3 status
Interface status:
 interface wan1 is offline and tracking is active
 interface wan2 is error and tracking is active

Current ipv4 policies:
balanced:
 unreachable

wan1_only:
 unreachable

wan1_wan2:
 unreachable

wan2_only:
 unreachable

wan2_wan1:
 unreachable


Current ipv6 policies:
balanced:
 unreachable

wan1_only:
 unreachable

wan1_wan2:
 unreachable

wan2_only:
 unreachable

wan2_wan1:
 unreachable


Directly connected ipv4 networks:
 224.0.0.0/3
 83.65.96.210
 10.1.0.0/16
 83.65.96.212
 192.168.10.255
 127.0.0.0/8
 127.0.0.1
 10.102.0.0/16
 192.168.12.0
 83.65.96.223
 83.65.96.213
 192.168.9.0
 127.0.0.0
 83.65.96.208
 192.168.11.255
 192.168.8.0/24
 83.65.96.211
 10.101.0.0/16
 192.168.9.1
 192.168.0.9
 192.168.9.254
 10.102.0.121
 83.65.96.208/28
 10.102.0.1
 192.168.10.0/24
 10.102.0.122
 192.168.0.254
 192.168.10.0
 10.100.0.0/16
 192.168.0.0
 192.168.10.254
 192.168.12.254
 192.168.9.0/24
 192.168.11.254
 192.168.12.255
 192.168.0.0/21
 192.168.12.0/24
 127.255.255.255
 192.168.7.255
 192.168.9.255
 192.168.11.0
 192.168.11.0/24

Directly connected ipv6 networks:
 2002:5a98:956e:4::/64
 fe80::/64
 2002:5a98:956e::/64

Active ipv4 user rules:
  698 48860 - wan1_wan2  udp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport sports 0:65535 multiport dports 1190:1209
 3185  161K - wan1_wan2  tcp  --  *      *       192.168.0.25         0.0.0.0/0            multiport sports 0:65535 multiport dports 25,465,587
    0     0 - wan1_wan2  tcp  --  *      *       192.168.0.26         0.0.0.0/0            multiport sports 0:65535 multiport dports 25,465,587
79119 9813K - wan1_only  all  --  *      *       0.0.0.0/0            195.58.160.194
79062 9808K - wan1_only  all  --  *      *       0.0.0.0/0            195.58.161.122
80921   10M - wan1_wan2  all  --  *      *       0.0.0.0/0            8.8.8.8
 130K 8056K - wan2_wan1  all  --  *      *       0.0.0.0/0            0.0.0.0/0

Active ipv6 user rules:
    0     0 - wan1_wan2  udp      *      *       ::/0                 ::/0                 multiport sports 0:65535 multiport dports 1190:1209
 1253  204K - wan2_wan1  all      *      *       ::/0                 ::/0
root@fw1:~/opkg-upgrade# mwan3 restart
SQM: Stopping SQM on eth1.301
SQM: Starting SQM script: layer_cake.qos on eth1.301, in: 6144 Kbps, out: 6144 Kbps
SQM: layer_cake.qos was started on eth1.301 successfully
SQM: Stopping SQM on eth1.302
SQM: Starting SQM script: layer_cake.qos on eth1.302, in: 51200 Kbps, out: 30720 Kbps
SQM: layer_cake.qos was started on eth1.302 successfully

eth1.301 = WAN1 (public IP on this IF)
eth1.302 = WAN2 (private IP on this IF with NAT to another OpenWRT router connected to LTE modem)

The same thing happend two weeks ago, we did the last time a reboot to get out of this situation.
Since this time everything was fine, but today in the night is happened again.

Have you guys any idea where this problem could come from or how I can locate the source of this issue?

I created a script in the meanwhile as a workaround, which restarts mwan3 if the Internet connection gets lost.

#!/bin/sh

# Purpose:
# Restarts the mwan3 service if there is no Internet connection available
# 5 ping tries to 8.8.8.8 with a timeout of 5s to check the internet connectivity

tries=0
while [[ $tries -lt 5 ]]
do
        if /bin/ping -w 5 -c 1 8.8.8.8 >/dev/null
        then
                exit 0
        fi
        tries=$((tries+1))
done

mwan3 restart

Hi, that is a very strange issue, have you tried to list your route table when the problems appears?
and see if there is some mwan process running.

This was the routing table before I issued the restart of mwan3 when that happend:

root@fw1:~# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         lte1.mgmt.ctb.c 0.0.0.0         UG    20     0        0 eth1.302
default         83-65-96-209.st 0.0.0.0         UG    30     0        0 eth1.301
10.1.0.0        10.102.0.121    255.255.0.0     UG    0      0        0 tun1
10.100.0.0      10.102.0.121    255.255.0.0     UG    0      0        0 tun1
10.101.0.0      10.102.0.121    255.255.0.0     UG    0      0        0 tun1
10.102.0.0      10.102.0.121    255.255.0.0     UG    0      0        0 tun1
10.102.0.1      10.102.0.121    255.255.255.255 UGH   0      0        0 tun1
10.102.0.121    *               255.255.255.255 UH    0      0        0 tun1
83.65.96.208    *               255.255.255.240 U     30     0        0 eth1.301
192.168.0.0     *               255.255.248.0   U     0      0        0 br-lan_ctb
192.168.8.0     lte1.mgmt.ctb.c 255.255.255.0   UG    20     0        0 eth1.302
192.168.9.0     *               255.255.255.0   U     20     0        0 eth1.302
192.168.9.1     *               255.255.255.255 UH    20     0        0 eth1.302
192.168.10.0    *               255.255.255.0   U     0      0        0 br-lan_mgmt
192.168.11.0    *               255.255.255.0   U     0      0        0 br-lan_guest
192.168.12.0    *               255.255.255.0   U     0      0        0 br-lan_mobile

Unfortunately I did not pay attention if the mwan process was running or not, but "mwan3 status" was working at this moment. Does this maybe imply that the mwan process was still running?

Is that tun1 a OpenVPN link? Try searching mwan3+openvpn there is some problems with OpenVPN and I think I've seen shadowsocks reports too.

https://github.com/openwrt/packages/search?q=mwan3+openvpn&type=Issues

Yes it is an OpenVPN tun adapter.
I have "LEDE Reboot 17.01.4" running on this router.

have you tried 17.01.5 - 18.06.0 ?

No, I did not.