PPPoE wan link spurious drops, how to diagnose?

I have a PoE powered MikroTik Hex S serving my gigabit fibre connection. The lan5 port goes to the ONT with PoE passthrough and a 12V PoE splitter at the end to power the ONT, which is in my garage.
Over the past couple of days I've started noticing spurious connection drops at apparently random times of day.
To eliminate PoE power problems I tried powering the ONT with its wall wart, but the drops are still happening.

What seems to be happening is that the router detects the lan5 ethernet drop, which then tears down the whole PPPoE stack:

Wed May  7 19:15:22 2025 daemon.notice netifd: Network device 'lan5' link is down
Wed May  7 19:15:22 2025 kern.info kernel: [644730.706394] mt7530-mdio mdio-bus:1f lan5: Link is Down
Wed May  7 19:15:22 2025 daemon.notice netifd: 8021q 'lan5.40' link is down
Wed May  7 19:15:22 2025 daemon.notice netifd: Interface 'wan' has link connectivity loss

These events seem to be over in seconds, 5 seconds in this case:

Wed May  7 19:15:27 2025 daemon.notice netifd: Network device 'lan5' link is up

but still quite annoying as e.g. adblock-fast reloads all the blocklists, and so DNS service is out for a bit.

This connection has been rock solid for the past 5 years or so, so this kerfuffle is new. Up until a couple of weeks ago I used an EdgeRouter-X plumbed the same way (though no PoE passthrough), so I guess the problem most likely has to do with the switch.

My main question is what I can do to diagnose this to a cause (aside from staring at the existing logs).
I figure the problem has to be one of:

  1. The router.
  2. The connection to the ONT.
    • The patch cable from the router to the patch panel.
    • The run from the patch panel to the garage.
    • The patch cable from the garage outlet to the ONT.
  3. The ONT.
  4. A network issue on the ISP side.

In my experience these MT7621A devices are rock solid, so I think it has to be wiring or the ONT.
For the wiring, as I don't have any gear for testing it, I guess I'm down to wiggle-and-replace?
Is there any reasonably priced cable testing equipment I could pick up?

Is there any logging I could enable to see precisely why the interface is going down?

The problem with the ONT is that aside from its blinkenlichten, I don't know that it has any kind of diagnostic interface on my side. ... time passes ... Well aktually, it does have a fixed IP I can talk to. Alas there's not much there unless I hook up to the serial port.
Point a camera at it? Other ideas?

add
list pppd_options 'debug'
https://openwrt.org/docs/guide-user/network/wan/wan_interface_protocols#protocol_pppoe_ppp_over_ethernet
then check the log, note the offered/negotiated options, try to get your side to match provider's

So the hypothesis would be that the ONT or the ISP is dropping the ethernet connection because of mismatched PPP options?

Your log messages do not decisively show whether it is chicken or egg.

1 Like

Thanks, I don't know much about PPPoE nor GPON nor my ISP's (EBOX) network. Their support used to be quite approachable on DSLReports, but alas it seems they've curled up into a Bell-shaped ball now :frowning:.

It does seem from the logs that the Ethernet link is dropping first? As a matter of fact, I get the same exact sequence of log statements when I unplug the patch cable from the router.
I was hoping there'd be some kind of metric I could pull from all the way down at the physical layer that might allow me to diagnose problems down to there.

Reading up on GPON I speculate that the ONT would only make like the Ethernet dropped on carrier loss. Seeing as how GPON downstream is effectively a broadcast, I find it hard to believe this would affect me alone.

So far I've tried replacing the patch cable from the router with no change. It's hard to believe that the run from upstairs to the garage would have suffered a fault and everything in the garage was undisturbed when the problem started.

You have to enable pppd debug to see it happens. If ppp bad packet is first and triggers line reset or ethernet port in one device glitches.
I dont have crstall ball to guess.

So I enabled the debug option and AFAICT there's nothing controversial going on during connection. I do see a PAP authentication attempt that fails, followed by a CHAP authentication attempt that succeeds. I guess it would be expedient at least to disable PAP?

I also see mru negotiation for 1492 bytes.
Now, stupidly, I changed two things at a time. I'd previously bumped the MTU on the interface to 1500, which in turn bumps the MTU on the :40 VLAN interface to 1508. I suspect that reverting to defaults (1492) has cleared up the spurious drops, as the most recent session has lasted 48+ hours.
I'm going to play with this, slowly, going back to the PoE config and then I'll try to bump the MTU, see what happens.

I'm morbidly^Wacademically curious to know how an MTU of 1508 on the non-VLAN interface leads to spurious connection drops. If I can figure out what crashes the ONT maybe I can work around it.

Thanks so much for your help!

1 Like

Let it run continuously for few days, check speed/latency, then either rise log buffer or disable debug which will kick in at next router reboot/upgrade.
Thanks for detailed report.

1 Like

Many ONT do not support >1500 bytes of MTU on its Ethernet Ports.

ONT Ethernet Port MTU of 1500 bytes means the OpenWrt device's Ethernet Port MTU is also 1500 bytes, which means the pppoe-wan interface in OpenWrt can only have MTU of 1492 bytes or lower.

1 Like

You can ask your provider if they can support 1500 frames, technically it is j&st few checkboxes on their side. Any recent gigabit ethernet adapter does 1600

My ONT talks on VLAN 40, but 802.1Q is effectively a prefix to a regular Ethernet frame?
... time passes ...
Oh, I need to read up on this lots more, I don't think I understand PPPoE encapsulation at all for one thing, nor any other part of this, for that matter.

My understanding was that a 1508 MTU on the base interface would allow the 802.1Q prefix AND a regular Ethernet frame with 1500 content bytes. Looks like PPPoE carries an additional 8 bytes of header overhead. Now I comprehend nothing, as I don't see how 1508 less 16 allows a pmtu of 1500.
Guess I'd better do some captures to try and see what's what.

Thanks y'all.

In the meantime I've re-enabled PoE passthrough to power the ONT.
So far so good. I don't expect these drops are due to any kind of a power problem.

Son of a diddly!

May  9 00:12:39 OpenWrt pppd[2972]: Sent 1432358204 bytes, received 230826890 bytes.
May  9 09:10:18 OpenWrt pppd[19416]: Sent 1432882492 bytes, received 181923825 bytes.
May  9 11:34:52 OpenWrt pppd[22442]: Sent 1432423740 bytes, received 36073876 bytes.
May  9 11:58:39 OpenWrt pppd[1939]: Sent 1432161596 bytes, received 11167089 bytes.
May  9 15:40:51 OpenWrt pppd[7952]: Sent 1432489276 bytes, received 104313725 bytes.
May  9 17:46:03 OpenWrt pppd[24659]: Sent 1432227132 bytes, received 60601848 bytes.
May 10 10:17:47 OpenWrt pppd[2777]: Sent 1432292668 bytes, received 508886525 bytes.
May 12 13:42:49 OpenWrt pppd[429]: Sent 1432161596 bytes, received 1261454460 bytes.
May 12 19:43:53 OpenWrt pppd[23927]: Sent 1432292668 bytes, received 219627148 bytes.
May 13 00:11:31 OpenWrt pppd[24654]: Sent 1432423740 bytes, received 118624881 bytes.
May 13 13:02:16 OpenWrt pppd[11415]: Sent 1432161596 bytes, received 732170206 bytes.

There's a little bit of commonality there. Seeing as how my wan interface (lan5 in my case) always drops first, I wonder if the ONT is set up to reset/reboot on 1.3Gb?

try to max numbers in fast.com, you can make 10gigs off there.
in general proceed to plan B - leave debug on and set log buffer to 1024kB or more. Check you have days worth of logs day later.