PPPoE disconnected and won't re-connect successfully unless backup/restore

My router often disconnects suddenly and won't auto reconnect anymore. This has happened a few times. I tried to find out the pattern, but it seems to be random.

Major error I saw is pppoe waiting for PADO packet time out.

Sat Jan 22 11:46:11 2022 daemon.warn pppd[16890]: Timeout waiting for PADO packets
Sat Jan 22 11:46:11 2022 daemon.err pppd[16890]: Unable to complete PPPoE Discovery

The only workaround is to backup/restore the configs. This also makes me think it's not really the ppp issue. As I always backup when having this issue and restore immediately. It worked normally after the restore.

I have tried the following but didn't work:

  • Power cycle the router
  • Power cycle the modem
  • Restart network by /etc/init.d/network restart
  • Clone MAC or not makes no difference when having the issue
  • Also tried to tune the LCP threshold/interval

Any clue why? Or are there any other logs or ways to figure out why?

Information:
Device: Netgear Nighthawk X4S R7800
Version: 21.02 (compiled from this source)
Syslogs:

Sat Jan 22 11:45:56 2022 daemon.info pppd[16890]: Plugin pppoe.so loaded.
Sat Jan 22 11:45:56 2022 daemon.info pppd[16890]: PPPoE plugin from pppd 2.4.9
Sat Jan 22 11:45:56 2022 daemon.notice pppd[16890]: pppd 2.4.9 started by root, uid 0
Sat Jan 22 11:46:08 2022 daemon.info dnsmasq[22498]: read /etc/hosts - 4 addresses
Sat Jan 22 11:46:08 2022 daemon.info dnsmasq[22498]: read /tmp/hosts/dhcp.cfg01411c - 9 addresses
Sat Jan 22 11:46:08 2022 daemon.info dnsmasq[22498]: read /tmp/hosts/odhcpd - 4 addresses
Sat Jan 22 11:46:08 2022 daemon.info dnsmasq-dhcp[22498]: read /etc/ethers - 0 addresses
Sat Jan 22 11:46:11 2022 daemon.warn pppd[16890]: Timeout waiting for PADO packets
Sat Jan 22 11:46:11 2022 daemon.err pppd[16890]: Unable to complete PPPoE Discovery
Sat Jan 22 11:46:11 2022 daemon.info pppd[16890]: Exit.
Sat Jan 22 11:46:11 2022 daemon.notice netifd: Interface 'wan' is now down
Sat Jan 22 11:46:11 2022 daemon.notice netifd: Interface 'wan' is disabled
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.084413] ipq806x-gmac-dwmac 37200000.ethernet eth0: Link is Down
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.086734] dwmac1000: Master AXI performs any burst length
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.089837] ipq806x-gmac-dwmac 37200000.ethernet eth0: No Safety Features support found
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.106661] ipq806x-gmac-dwmac 37200000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.106782] ipq806x-gmac-dwmac 37200000.ethernet eth0: registered PTP clock
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.116156] ipq806x-gmac-dwmac 37200000.ethernet eth0: configuring for fixed/rgmii link mode
Sat Jan 22 11:46:11 2022 daemon.notice netifd: Interface 'wan' is enabled
Sat Jan 22 11:46:11 2022 daemon.notice netifd: Interface 'wan' is setting up now
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.123288] ipq806x-gmac-dwmac 37200000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.139286] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Sat Jan 22 11:46:11 2022 kern.info kernel: [221404.139529] IPv6: ADDRCONF(NETDEV_CHANGE): eth0.2: link becomes ready
Sat Jan 22 11:46:11 2022 daemon.err insmod: module is already loaded - slhc
Sat Jan 22 11:46:11 2022 daemon.err insmod: module is already loaded - ppp_generic
Sat Jan 22 11:46:11 2022 daemon.err insmod: module is already loaded - pppox
Sat Jan 22 11:46:11 2022 daemon.err insmod: module is already loaded - pppoe
Sat Jan 22 11:46:11 2022 daemon.notice netifd: wan (17120): uci: Entry not found
Sat Jan 22 11:46:11 2022 daemon.info pppd[17165]: Plugin pppoe.so loaded.
Sat Jan 22 11:46:11 2022 daemon.info pppd[17165]: PPPoE plugin from pppd 2.4.9
Sat Jan 22 11:46:11 2022 daemon.notice pppd[17165]: pppd 2.4.9 started by root, uid 0
Sat Jan 22 11:46:26 2022 daemon.warn pppd[17165]: Timeout waiting for PADO packets
Sat Jan 22 11:46:26 2022 daemon.err pppd[17165]: Unable to complete PPPoE Discovery
Sat Jan 22 11:46:26 2022 daemon.info pppd[17165]: Exit.
Sat Jan 22 11:46:27 2022 daemon.notice netifd: Interface 'wan' is now down
Sat Jan 22 11:46:27 2022 daemon.notice netifd: Interface 'wan' is disabled
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.494363] ipq806x-gmac-dwmac 37200000.ethernet eth0: Link is Down
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.496664] dwmac1000: Master AXI performs any burst length
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.499802] ipq806x-gmac-dwmac 37200000.ethernet eth0: No Safety Features support found
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.516608] ipq806x-gmac-dwmac 37200000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.516765] ipq806x-gmac-dwmac 37200000.ethernet eth0: registered PTP clock
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.524548] ipq806x-gmac-dwmac 37200000.ethernet eth0: configuring for fixed/rgmii link mode
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.535965] ipq806x-gmac-dwmac 37200000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Sat Jan 22 11:46:27 2022 kern.info kernel: [221419.540566] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Sat Jan 22 11:46:27 2022 daemon.notice netifd: Interface 'wan' is enabled
Sat Jan 22 11:46:27 2022 daemon.notice netifd: Interface 'wan' is setting up now
Sat Jan 22 11:46:27 2022 daemon.err insmod: module is already loaded - slhc
Sat Jan 22 11:46:27 2022 daemon.err insmod: module is already loaded - ppp_generic
Sat Jan 22 11:46:27 2022 daemon.err insmod: module is already loaded - pppox
Sat Jan 22 11:46:27 2022 daemon.err insmod: module is already loaded - pppoe
Sat Jan 22 11:46:27 2022 daemon.notice netifd: wan (17483): uci: Entry not found

Try the official stable version, or you should seek advice from the maintainer. It is hard to troubleshoot custom versions.

I have switched to an official stable version: OpenWrt 21.02.3 r16554-1d4dea6d4f
The random down problem is still there. I enabled debug log for ppp, there isn't any log from ppp when the issue happens. So I think it's probably not the ppp related issue. When the internet went down, I couldn't ping my ISP gateway, or any other WAN IP. LAN IP works. Found nothing special in logs, any suggestion how to debug?

BTW, reboot router or /etc/init.d/network restart will also recover the internet.

Have you tried the keepalive option?

https://openwrt.org/docs/guide-user/network/wan/wan_interface_protocols#protocol_pppoe_ppp_over_ethernet

1 Like

Thanks. Yes, I did. I have tried a couple of different settings, such as always alive, or very big numbers. I'm currently setting 15000, 3600. Doesn't seem to be helpful. And I think if this is the issue, I should see LCP message in the ppp debug log right? In fact I didn't.
Initially I thought it's about ppp issues because I don't have the internet and I couldn't ping ISP gateway successfully. But now I'm questioning if it's related ppp or some other issues.

If you don't see anything useful in the logs, you should probably ask your ISP for assistance.

BTW does restarting only the wan interface (ifup wan) help?
As a workaround you could try watchcat.
There are options to restart an interface or reboot the device if a ping to a specific host fails.

1 Like

Today I had the problem again. I had a deep dive and found that it was because the CPU port went down unexpectedly. I believe it's the port 6 for my R7800. And I have tried which exact command fixes the issue. It turned out to be:

swconfig dev switch0 load network

However, I wasn't able to figure out why. There are kernel logs such as: (i remember there is also message about invalid peer id but unfortunately it got flushed)
EDITED: seems like ath10k is the wifi driver so I don't think it's related?

[12286.122543] ath10k_pci 0001:01:00.0: wmi: fixing invalid VHT TX rate code 0xff
[13951.610960] ath10k_pci 0001:01:00.0: htt tx: fixing invalid VHT TX rate code 0xff

Here is the switch status output when the issue happens:

root@Openwrt:~# swconfig dev switch0 show
Global attributes:
        enable_vlan: 1
        ar8xxx_mib_poll_interval: 0
        ar8xxx_mib_type: 0
        enable_mirror_rx: 0
        enable_mirror_tx: 0
        mirror_monitor_port: 0
        mirror_source_port: 0
        arl_age_time: 300
        arl_table: address resolution table
Port 0: MAC [replaced_MAC_port0]
Port 1: MAC [replaced_MAC_port1]
Port 4: MAC [replaced_MAC_port4]
Port 5: MAC [replaced_MAC_port5_1]
Port 5: MAC [replaced_MAC_port5_2]

        igmp_snooping: 0
        igmp_v3: 0
Port 0:
        mib: ???
        enable_eee: ???
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 0
        link: port:0 link:up speed:1000baseT full-duplex
Port 1:
        mib: ???
        enable_eee: 0
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 1
        link: port:1 link:up speed:1000baseT full-duplex txflow rxflow auto
Port 2:
        mib: ???
        enable_eee: 0
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 1
        link: port:2 link:down
Port 3:
        mib: ???
        enable_eee: 0
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 1
        link: port:3 link:down
Port 4:
        mib: ???
        enable_eee: 0
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 1
        link: port:4 link:up speed:1000baseT full-duplex txflow rxflow auto
Port 5:
        mib: ???
        enable_eee: 0
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 2
        link: port:5 link:up speed:1000baseT full-duplex txflow rxflow auto
Port 6:
        mib: ???
        enable_eee: ???
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 0
        link: port:6 link:down
VLAN 1:
        vid: 1
        ports: 1 2 3 4 6t
VLAN 2:
        vid: 2
        ports: 0t 5

Port 6 is a CPU, however it seems to be connected to VLAN1 which is your lan.

1 Like

Yes, I think the VLAN settings are correct. And I'm keeping it as default. The problem is that the link went down suddenly. And the only way I can recover it is the network load from swconfig command

Here is the output when everything works fine:

Port 6:
        mib: ???
        enable_eee: ???
        igmp_snooping: 0
        vlan_prio: 0
        pvid: 0
        link: port:6 link:up speed:1000baseT full-duplex

It would make more sense if you lose access to the device than having problems with your internet connection, but who knows...

The other port connected to the CPU (Port 0) looks fine and you should be able to use it for the connection to vlan1 (without using Port 6 at all).

If you want to confirm this, here is a cron job that will check the status of Port 6 every minute and run swconfig dev switch0 load network if the port goes down. You can check how often this happens by running logread -e port6

cat << "EOF" >> /etc/crontabs/root
* * * * * if swconfig dev switch0 port 6 show | grep -q link\:down; then swconfig dev switch0 load network; logger -t port6 Port 6 down; fi
EOF
uci set system.@system[0].cronloglevel="9"
uci commit system
/etc/init.d/cron restart
2 Likes