"Timeout waiting for PADO packets" until reboot

I have a TPLink TL-WDR3600 router. It connects via Ethernet to a VDSL2 modem which connects to a residential phone line. The ISP is Plusnet. The WAN link is via PPPoE, it's configured correctly, it always connects first time. I've had this setup since 2016.

The WAN link can timeout if I max out the connection for a sustained period (I can't do this reliably), or if the ISP side goes silent on me. I've already added config option keepalive '5 5' to the WAN interface (and option pppd_options debug)

When the PPPoE connection time outs, pppd sends PADT and hangs up. When pppd is restarted, it always alerts "Timeout waiting for PADO packets". It keeps doing this until you reboot the router. After rebooting, it always connects on first attempt.

If the WAN link goes down due to timeout, no amount of ifdown wan && ifup wan will re-establish the WAN link until reboot. All attempts fail with "Timeout waiting for PADO packets". Waiting 5, 10 or even 30 minutes between ifdown wan and ifup wan does not fix the problem. You still get a PADO timeout! But the moment you reboot, just seconds later, it works immediately.

In normal circumstances (i.e. there has not been a connection timeout), you can run ifdown wan && ifup wan and it re-establishes the WAN link without any PADO packet timeouts

I've also tried rmmoding the PPPoE kernel modules (pppoe pppox ppp_async ppp_generic slhc) between ifdown and ipup, in case the problem is some kind of lingering state held in these modules, but that does not fix the problem either. If there's a connection timeout, you can ifdown wan, you can rmmod all these modules, you can wait 30 minutes - you will still get a PADO timeout when you ifup wan, and you will still get a connection seconds later once you reboot.

There is a bug filed for this by someone else, but it's marked "unconfirmed" and "very low priority".

How can I diagnose this bug further?

I'd like to get to the root cause of this problem, and if it's in the kernel, get it fixed.

Workaround

I wrote a script to reboot immediately if the error appears:

$ cat /etc/hotplug.d/iface/00-reboot-on-pado-timeout
# if router has been up for at least 5 minutes...
[ `sed 's/[. ].*//' /proc/uptime` -gt 300 ] && {
    # reboot if any "Timeout waiting for PADO packets" log messages seen
    [ `/sbin/logread -e 'Timeout waiting for PADO packets' | wc -l` -gt 1 ] && /sbin/reboot

}

Sample logs

Example of WAN link failure due to timeout, failure to re-establish.

Fri Feb 19 09:03:08 2021 daemon.info pppd[1444]: No response to 5 echo-requests
Fri Feb 19 09:03:08 2021 daemon.notice pppd[1444]: Serial link appears to be disconnected.
Fri Feb 19 09:03:08 2021 daemon.info pppd[1444]: Connect time 239.4 minutes.
Fri Feb 19 09:03:08 2021 daemon.info pppd[1444]: Sent 2985390766 bytes, received 729613003 bytes.
Fri Feb 19 09:03:08 2021 daemon.notice netifd: Network device 'pppoe-wan' link is down
Fri Feb 19 09:03:08 2021 daemon.notice netifd: Network alias 'pppoe-wan' link is down
Fri Feb 19 09:03:08 2021 daemon.debug pppd[1444]: Script /lib/netifd/ppp-down started (pid 3501)
Fri Feb 19 09:03:08 2021 daemon.debug pppd[1444]: sent [LCP TermReq id=0x2 "Peer not responding"]
Fri Feb 19 09:03:08 2021 daemon.notice netifd: Interface 'wan' has lost the connection
Fri Feb 19 09:03:08 2021 daemon.debug pppd[1444]: Script /lib/netifd/ppp-down finished (pid 3501), status = 0x1
Fri Feb 19 09:03:11 2021 daemon.debug pppd[1444]: sent [LCP TermReq id=0x3 "Peer not responding"]
Fri Feb 19 09:03:14 2021 daemon.notice pppd[1444]: Connection terminated.
Fri Feb 19 09:03:14 2021 daemon.debug pppd[1444]: Send PPPOE Discovery V1T1 PADT session 0xff5 length 28
Fri Feb 19 09:03:14 2021 daemon.debug pppd[1444]:  dst f8:13:08:40:d3:d5  src c0:4a:00:2c:d9:fa
Fri Feb 19 09:03:14 2021 daemon.debug pppd[1444]:  [host-uniq  00 00 05 a4] [AC-cookie  dc 8a 2d 68 eb 52 65 0b 7a dd 09 12 42 a5 97 32]
Fri Feb 19 09:03:14 2021 daemon.info pppd[1444]: Sent PADT
Fri Feb 19 09:03:14 2021 daemon.notice pppd[1444]: Modem hangup
Fri Feb 19 09:03:14 2021 daemon.info pppd[1444]: Exit.
Fri Feb 19 09:03:14 2021 daemon.notice netifd: Interface 'wan' is now down
Fri Feb 19 09:03:14 2021 daemon.notice netifd: Interface 'wan' is disabled
Fri Feb 19 09:03:14 2021 daemon.notice netifd: Interface 'wan' is enabled
Fri Feb 19 09:03:14 2021 daemon.notice netifd: Interface 'wan' is setting up now
Fri Feb 19 09:03:14 2021 daemon.err insmod: module is already loaded - slhc
Fri Feb 19 09:03:14 2021 daemon.err insmod: module is already loaded - ppp_generic
Fri Feb 19 09:03:14 2021 daemon.err insmod: module is already loaded - pppox
Fri Feb 19 09:03:14 2021 daemon.err insmod: module is already loaded - pppoe
Fri Feb 19 09:03:14 2021 daemon.info pppd[3602]: Plugin rp-pppoe.so loaded.
Fri Feb 19 09:03:14 2021 daemon.info pppd[3602]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Fri Feb 19 09:03:14 2021 daemon.notice pppd[3602]: pppd 2.4.7 started by root, uid 0
Fri Feb 19 09:03:14 2021 daemon.debug pppd[3602]: Send PPPOE Discovery V1T1 PADI session 0x0 length 12
Fri Feb 19 09:03:14 2021 daemon.debug pppd[3602]:  dst ff:ff:ff:ff:ff:ff  src c0:4a:00:2c:d9:fa
Fri Feb 19 09:03:14 2021 daemon.debug pppd[3602]:  [service-name] [host-uniq  00 00 0e 12]
Fri Feb 19 09:03:19 2021 daemon.debug pppd[3602]: Send PPPOE Discovery V1T1 PADI session 0x0 length 12
Fri Feb 19 09:03:19 2021 daemon.debug pppd[3602]:  dst ff:ff:ff:ff:ff:ff  src c0:4a:00:2c:d9:fa
Fri Feb 19 09:03:19 2021 daemon.debug pppd[3602]:  [service-name] [host-uniq  00 00 0e 12]
Fri Feb 19 09:03:24 2021 daemon.debug pppd[3602]: Send PPPOE Discovery V1T1 PADI session 0x0 length 12
Fri Feb 19 09:03:24 2021 daemon.debug pppd[3602]:  dst ff:ff:ff:ff:ff:ff  src c0:4a:00:2c:d9:fa
Fri Feb 19 09:03:24 2021 daemon.debug pppd[3602]:  [service-name] [host-uniq  00 00 0e 12]
Fri Feb 19 09:03:29 2021 daemon.warn pppd[3602]: Timeout waiting for PADO packets
Fri Feb 19 09:03:29 2021 daemon.err pppd[3602]: Unable to complete PPPoE Discovery
Fri Feb 19 09:03:29 2021 daemon.info pppd[3602]: Exit.

In good times when there is no problem with the WAN connection, ifdown wan && sleep 10 && ifup wan gives this:

Sat Feb 20 00:14:29 2021 daemon.info pppd[1447]: Terminating on signal 15
Sat Feb 20 00:14:29 2021 daemon.info pppd[1447]: Connect time 0.7 minutes.
Sat Feb 20 00:14:29 2021 daemon.info pppd[1447]: Sent 768857 bytes, received 649183 bytes.
Sat Feb 20 00:14:29 2021 daemon.notice netifd: Network device 'pppoe-wan' link is down
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: Script /lib/netifd/ppp-down started (pid 1937)
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: sent [LCP TermReq id=0x2 "User request"]
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: rcvd [LCP TermAck id=0x2]
Sat Feb 20 00:14:29 2021 daemon.notice pppd[1447]: Connection terminated.
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: Send PPPOE Discovery V1T1 PADT session 0x1d2b length 28
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]:  dst f8:13:08:40:d3:d5  src c0:4a:00:2c:d9:fa
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]:  [host-uniq  00 00 05 a7] [AC-cookie  dc 8a 2d 68 eb 52 65 0b 7a dd 09 12 42 a5 97 32]
Sat Feb 20 00:14:29 2021 daemon.info pppd[1447]: Sent PADT
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: Waiting for 1 child processes...
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]:   script /lib/netifd/ppp-down, pid 1937
Sat Feb 20 00:14:29 2021 daemon.debug pppd[1447]: Script /lib/netifd/ppp-down finished (pid 1937), status = 0x1
Sat Feb 20 00:14:29 2021 daemon.info pppd[1447]: Exit.
Sat Feb 20 00:14:29 2021 daemon.notice netifd: Interface 'wan' is now down
Sat Feb 20 00:14:29 2021 daemon.notice netifd: Interface 'wan' is disabled
Sat Feb 20 00:14:29 2021 daemon.notice netifd: VLAN 'eth0.2' link is down
Sat Feb 20 00:14:29 2021 daemon.notice netifd: Interface 'wan' has link connectivity loss
...
Sat Feb 20 00:14:39 2021 daemon.notice netifd: Interface 'wan' is enabled
Sat Feb 20 00:14:39 2021 daemon.notice netifd: VLAN 'eth0.2' link is up
Sat Feb 20 00:14:39 2021 daemon.notice netifd: Interface 'wan' has link connectivity
Sat Feb 20 00:14:39 2021 daemon.notice netifd: Interface 'wan' is setting up now
Sat Feb 20 00:14:39 2021 daemon.err insmod: module is already loaded - slhc
Sat Feb 20 00:14:39 2021 daemon.err insmod: module is already loaded - ppp_generic
Sat Feb 20 00:14:39 2021 daemon.err insmod: module is already loaded - pppox
Sat Feb 20 00:14:39 2021 daemon.err insmod: module is already loaded - pppoe
Sat Feb 20 00:14:39 2021 daemon.info pppd[2028]: Plugin rp-pppoe.so loaded.
Sat Feb 20 00:14:39 2021 daemon.info pppd[2028]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Sat Feb 20 00:14:39 2021 daemon.notice pppd[2028]: pppd 2.4.7 started by root, uid 0
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: Send PPPOE Discovery V1T1 PADI session 0x0 length 12
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  dst ff:ff:ff:ff:ff:ff  src c0:4a:00:2c:d9:fa
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  [service-name] [host-uniq  00 00 07 ec]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: Recv PPPOE Discovery V1T1 PADO session 0x0 length 52
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  dst c0:4a:00:2c:d9:fa  src f8:13:08:40:d3:d5
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  [service-name] [AC-name acc-aln20.eh-don] [host-uniq  00 00 07 ec] [AC-cookie  dc 8a 2d 68 eb 52 65 0b 7a dd 09 12 42 a5 97 32]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: Send PPPOE Discovery V1T1 PADR session 0x0 length 32
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  dst f8:13:08:40:d3:d5  src c0:4a:00:2c:d9:fa
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  [service-name] [host-uniq  00 00 07 ec] [AC-cookie  dc 8a 2d 68 eb 52 65 0b 7a dd 09 12 42 a5 97 32]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: Recv PPPOE Discovery V1T1 PADS session 0x1d37 length 12
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  dst c0:4a:00:2c:d9:fa  src f8:13:08:40:d3:d5
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]:  [service-name] [host-uniq  00 00 07 ec]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: PADS: Service-Name: ''
Sat Feb 20 00:14:39 2021 daemon.info pppd[2028]: PPP session is 7479
Sat Feb 20 00:14:39 2021 daemon.warn pppd[2028]: Connected to f8:13:08:40:d3:d5 via interface eth0.2
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: using channel 2
Sat Feb 20 00:14:39 2021 kern.info kernel: [   80.579823] pppoe-wan: renamed from ppp0
Sat Feb 20 00:14:39 2021 daemon.info pppd[2028]: Renamed interface ppp0 to pppoe-wan
Sat Feb 20 00:14:39 2021 daemon.info pppd[2028]: Using interface pppoe-wan
Sat Feb 20 00:14:39 2021 daemon.notice pppd[2028]: Connect: pppoe-wan <--> eth0.2
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: sent [LCP ConfReq id=0x1 <mru 1492> <magic 0xfcef8687>]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: rcvd [LCP ConfReq id=0x17 <mru 1492> <auth chap MD5> <magic 0x11fe2551>]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: sent [LCP ConfAck id=0x17 <mru 1492> <auth chap MD5> <magic 0x11fe2551>]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: rcvd [LCP ConfAck id=0x1 <mru 1492> <magic 0xfcef8687>]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: sent [LCP EchoReq id=0x0 magic=0xfcef8687]
[...chap auth...]
Sat Feb 20 00:14:39 2021 daemon.debug pppd[2028]: rcvd [LCP EchoRep id=0x0 magic=0x11fe2551]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: rcvd [CHAP Success id=0x1 "CHAP authentication success"]
Sat Feb 20 00:14:40 2021 daemon.info pppd[2028]: CHAP authentication succeeded: CHAP authentication success
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: CHAP authentication succeeded
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: peer from calling number F8:13:08:40:D3:D5 authorized
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: sent [IPCP ConfReq id=0x1 <addr 0.0.0.0> <ms-dns1 0.0.0.0> <ms-dns2 0.0.0.0>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: sent [IPV6CP ConfReq id=0x1 <addr fe80::4030:fda2:e257:4e52>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: rcvd [IPCP ConfReq id=0x1f <addr 172.16.14.227>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: sent [IPCP ConfAck id=0x1f <addr 172.16.14.227>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: rcvd [IPCP ConfNak id=0x1 <addr 147.147.154.104> <ms-dns1 212.159.6.9> <ms-dns2 212.159.6.10>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: sent [IPCP ConfReq id=0x2 <addr 147.147.154.104> <ms-dns1 212.159.6.9> <ms-dns2 212.159.6.10>]
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: rcvd [IPCP ConfAck id=0x2 <addr 147.147.154.104> <ms-dns1 212.159.6.9> <ms-dns2 212.159.6.10>]
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: local  IP address 147.147.154.104
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: remote IP address 172.16.14.227
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: primary   DNS address 212.159.6.9
Sat Feb 20 00:14:40 2021 daemon.notice pppd[2028]: secondary DNS address 212.159.6.10
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: Script /lib/netifd/ppp-up started (pid 2048)
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Network device 'pppoe-wan' link is up
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Interface 'wan6' is enabled
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Network alias 'pppoe-wan' link is up
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Interface 'wan6' has link connectivity
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Interface 'wan6' is setting up now
Sat Feb 20 00:14:40 2021 daemon.notice netifd: Interface 'wan' is now up
Sat Feb 20 00:14:40 2021 daemon.debug pppd[2028]: Script /lib/netifd/ppp-up finished (pid 2048), status = 0x1

start downgrading(or upgrading) until it's fixed...
max out the verbosity on daemons and proto handlers... (or packet level inspections/comparisons)

looks pretty well confirmed now...

2 Likes

I have this or very similar on a Netgear WNDR3700v4 device. It's driving the family crazy - but more importantly me too! Hopefully our logs align and I'm not thread hijacking!

In my case it does recover after about 5 minutes if I leave it alone. I've read that some ISPs will clear down a connection after a period of time, so maybe this is why mine recovers but yours doesn't.

Two other quirks that may or may not be helpful in understanding what is happening:

  1. I can't reach the device on the LAN side RJ45 ports. It's as if the physical or logical part of the switch has locked up, and may be preventing all RJ45 traffic on LAN and WAN.
  2. If I unplug the WAN cable and put it back in then things come back to life. Why I do not know, unless this is kicking some hardware level trigger that fixes point 1.

After a very bad period a week or two ago I though I'd fixed it last week by physically shifting the modem which sits between phone line and router. I assumed it was a bent / stressed cable - but it happened again yesterday.

I'll happily join any attempt to figure out what is happening.

Mon Jun 28 12:56:20 2021 daemon.debug pppd[1390]: rcvd [LCP EchoReq id=0x5a magic=0x125e9e37]
Mon Jun 28 12:56:20 2021 daemon.debug pppd[1390]: sent [LCP EchoRep id=0x5a magic=0xfa78f8a1]
...
Mon Jun 28 12:56:31 2021 daemon.debug pppd[1390]: sent [LCP EchoReq id=0x65 magic=0xfa78f8a1]
Mon Jun 28 12:56:32 2021 daemon.debug pppd[1390]: sent [LCP EchoReq id=0x66 magic=0xfa78f8a1]
Mon Jun 28 12:56:33 2021 daemon.debug pppd[1390]: sent [LCP EchoReq id=0x67 magic=0xfa78f8a1]
Mon Jun 28 12:56:34 2021 daemon.debug pppd[1390]: sent [LCP EchoReq id=0x68 magic=0xfa78f8a1]
Mon Jun 28 12:56:35 2021 daemon.debug pppd[1390]: sent [LCP EchoReq id=0x69 magic=0xfa78f8a1]
Mon Jun 28 12:56:36 2021 daemon.info pppd[1390]: No response to 5 echo-requests
Mon Jun 28 12:56:36 2021 daemon.notice pppd[1390]: Serial link appears to be disconnected.
Mon Jun 28 12:56:36 2021 daemon.info pppd[1390]: Connect time 491.9 minutes.
Mon Jun 28 12:56:36 2021 daemon.info pppd[1390]: Sent 320957870 bytes, received 1973876693 bytes.
Mon Jun 28 12:56:36 2021 daemon.notice netifd: Network device 'pppoe-wan' link is down
...
Mon Jun 28 12:59:13 2021 daemon.debug pppd[2325]:  [service-name] [host-uniq  00 00 09 15]
Mon Jun 28 12:59:43 2021 daemon.warn pppd[2325]: Timeout waiting for PADO packets
Mon Jun 28 12:59:43 2021 daemon.err pppd[2325]: Unable to complete PPPoE Discovery
Mon Jun 28 12:59:43 2021 daemon.info pppd[2325]: Exit.
...
Mon Jun 28 13:02:14 2021 daemon.debug pppd[2442]:  [service-name] [host-uniq  00 00 09 8a]
Mon Jun 28 13:02:14 2021 daemon.debug pppd[2442]: Recv PPPOE Discovery V1T1 PADO session 0x0 length 50

Have you tried /etc/init.d/network reload (or restart)?

No idea if they will work, but it might indicate what level of thing needs looking at if it does.

I'm going to adopt your script with that in to see if it helps me. I've created it in the same place as you. Does hotplug run it automatically or do I need to do something with cron and / or other config files?

EDIT: Reading https://openwrt.org/docs/guide-user/base-system/hotplug I assume it will be executed automatically. I did a chmod +x /etc/hotplug.d/iface/00-reboot-on-pado-timeout in case it needs that.

I've seen similar behaviour even in 14.07 .

At least in the case I observed it was like the WAN port was stuck.

@kyz
When this behaviour happens monitor what is sent/received on the WAN.

1 Like

Just out of curiosity, what happens (next time it goes down) if you invoke ifdown wan, wait ~5 minutes and then invoke ifup wan again? Some ISPs might not be too happy if you reconnect too quickly (in their opinion) - and constantly trying to re-connect might get you blacklisted for a few minutes.

1 Like

From what I've read about Plusnet in UK they don't mind the retry, but don't clear down the old connection for a few minutes. This is confusing as the logs show a PADT being sent, which should shut it down "nicely". Unless it doesn't get there, which would be consistent with the physical port dying.

For me the biggest mystery is why physically unplugging the WAN cable on the router and putting it back in gets things going again. Unless this makes the modem kick something upstream at the ISP it is making something in the router settle itself, such as the hardware, device driver, or network stack. I've no idea which.

1 Like

How would this be done? On OpenWRT, or with something like a PC running Wireshark on a (non-switching) hub with the modem and WAN port? I could probably set that up.

Thanks for all the suggestions so far.

Yes, it runs it whenever there's an interface state change. The script is pretty brutal, it just reboots if it ever sees the error message in the logs (with a grace period of 3 minutes so you have a chance to log in and disable it should it go haywire and put the router in a reboot loop)

What I would normally do is run ssh root@openwrt tcpdump -w- -i eth0 not tcp port 22 >dump.pcap which streams a dump of all ethernet traffic to my computer (excepting the ssh traffic which it's creating). This needs the tcpdump package installed on the router.

I also experimented with kmod-fs-cifs and cifsmount packages to mount a local smb share and write tcpdump output there autonomously, but that didn't work reliably for me.

that is useful...

"my bicycle A always has problems"
"my bicycle A always runs fine"

load and conditions relevant?

seems to me any 'provider' / ppp symptoms in this instance are subject to the above...
(and that it probably is a predominantly soc/dsl chip driver/interaction thing that more than likely is known to some if not already discussed elsewhere on the forum... if I were you guys i'd be picking through lantiq commits or whatnot and/or comparing driver source trees)

random disconnected from PPPOE. after that it was spamming "timeout waiting for PADO packets".
OpenWRT was reporting sending the right packets but not receiving anything (that's exactly what the message "timeout waiting for PADO packets" was suggesting it was the case)
dump done with a device between OpenWRT and ISP equipment was showing that nothing was sent from the OpenWRT WAN port.
the OpenWRT WAN led was blinking as it was sending the right thing.

only a reboot was fixing it. it has been observed on 3 devices with distinct SoC, but it wasn't that often on 2 of them. issue was happening anywhere between 3 min to 90 days of uptime.
I failed to replicate it with a local PPPOE server, but then again I didn't run that local PPPOE server more than 7 days...

I totaly failed to replicate it with X86. it was happening with Mips only (i only have mips routers (Qualcomm Atheros)...).
it was totaly random not related to cpu load, network load (it never happend with high loads to begin with).

L.E.: i'm no longer capable to replicate the situation that was causing it cause i changed isp and with the local PPPOE server i can't really trigger it in a resonable amount of time...
the cure to the problem while i was still having the isp that was triggering the problem was to use x86...

L.E. 2: there was nothing usefull in the logs. no crash or anything like that. there were also no signs of oom situation.

L.E. 3: to be honest at some point i started to believe that either there is something in particular with how the cable, my router, old isp equipment interact and is causing the wan port to hang or that i'm dealing with a bug that went on in several Atheros/Qualcomm SoCs (possible if they didn't designed them from 0).
because i wanted to cure the disconnect (so i'm not kicked out of the games cause that's what was happening) and i realise I can't really do it (because i highly doubted I can prevent the disconnect from happening) so i just gave up at some point (was kinda pointless for me to figure out how to get the wan port working after it hanged, that wasn't what i wanted to fix :stuck_out_tongue: ).

L.E. 4: isp equipment - my switch - openwrt router. when the issue was starting I could restart "my switch" no matter how many times i wanted it wasn't going to fix the problem.

1 Like

Maybe we have different manifestations of the same underlying problem, but as I mention above I lose wired connectivity on the LAN side as well. I'm not sure if wi-fi links stop as well - that seems less proven.

I'll take a look at my hub idea.

Well I put an old hub in there, then realised it was a 10Mb device which probably wouldn't cheer the family up much! I've now put a good quality 100Mb switch between modem and router, which won't let me watch traffic but does seem to be making throughput faster and more stable. Possibly imagining what I want to see...

Quick update...

The switch between modem and router didn't fix things.

ISP got involved and started tweaking things, including a "rejumpering" to different ports in the street cabinet and / or telephone exchange.

I wanted to give them an "industry standard" reference point to see if things were more stable, so flashed the router that was running OpenWRT back to stock TP-LINK firmware. I used it for a week or so and it was very good - no noticeable dropouts in the day and nothing showing in the PPP logs that it makes available.

I also took up the opportunity to get an ISP supplied router, and that has been in place for the last week. Again it has been super stable.

Now I'm confident that the line is stable I'm going to flash the TP-LINK device to OpenWRT again and see if it remains up. If not I'll be reasonably comfortable saying that the problem is somewhere in OpenWRT.

Hi, I am having the same problem, but it's very rare.

Only reboot works.

Tried: wait for auto recovery, reconnect the ISP cable, kill the pppd process, restart dsl_control and network, change interface mac and other configurations.

Didn't tried yet to send debug and terminate signals to pppd.

Couldn't capture packeds from tcpdump.

Did not inspected pppd source and ppp.sh.

Now testing:

I don't believe this is a problem with pppoe implementation in OpenWRT.

I think the problem is either with the code controling the switch/wan port or a harware bug that actually affects multiple SoCs (might not even be possible to fix it without a disconnect in pppoe case due to pppoe nature (way less friendly to disruptions compare to DHCP)).
You can test it relative simple. When you start to get the Timeout waiting for PADO packets unplug the wan cable, change the wan protocol to DHCP and connect it to the lan of another router that has DHCP active. If you get an ip via DHCP on wan it's not really a problem with the code regarding the wan port, if you get no ip and there is nothing in dmesg (talking here about a crash log or some message being spammed) most likely the port is stuck.

I'm no longer having the ISP I could easier trigger it and with a local PPPOE server it's not that easy (not much of a help when maybe it triggers random 1 time in 5+ days...)

What makes me be almost be sure it's the code related the switch/wan port is the fact that with an x86 build it never triggered in my case.