ICMP drops when RTT changes

Hi community,

I'm running OpenWRT 23.05.0 on a Banana BPI-R3.

My WAN connection is FTTH GPON 2,5Gbits/s and my Banana is connected to an ONT via a 2,5 Gbits RJ45 SFP adapter.

I occasionally notice that my connection becomes unresponsive (e.g., webcalls become mute for a few seconds) and the PPPoE connection to my ISP remains UP during these events.

To monitor this situation I enabled the collectd ping plugin by constantly monitoring reachability to 8.8.8.8 and 1.1.1.1

After a few days of data collection, I noticed that occasional ICMP drops occur simultaneously on both targets. Since both targets are affected in the same time, I have to trace the problem back to the connection itself and not to one of the targets.

This is the graph for the last 24 hours.

I have noticed that most of the time, the drops correspond to a change in RTT for both targets. The difference in RRT is clearly visible and the same for both targets, so I have to think that something changes for the whole Internet connection.

I am trying to troubleshoot this problem, but I don't know how I can tell if the problem is my ISP or my router.

There is nothing notable on the logs during these events.

The CPU and memory usage is very low.

Temperatures are stable.

It seems there is no transmit errors on interfaces. Really, I don't know how to look for that on OpenWRT, but no errors are reported on the ONT web management interface.

PON RX power seems good (-15.99 dBm).

Do you have any advice for me?

Thanks.

Use mtr on your router to figure out the first ip address outside your router. Then set up an icmp probe to that address and check whether that shows the same drops....

Thank you @moeller0 for your advice.

I added an ICMP probe to my default gateway, but I get the same result as the other IPs. Same drops and same difference in RTT.

These are last 24 hours (DG public IP hidden for privacy).

This time I noticed that the failed probes to Google are less than the failed probes to the default gateway and Cloudflare. I have never seen any difference between the results, so I assume that this difference is not actually due to a lack of connectivity but a random occurrence.

So the backbone of my ISP seems to be excluded. I think I need to restrict the analysis to either my router or the L2 transport in the GPON infrastructure.

I am thinking of connecting the built-in 1Gbit/s interface instead of the SFP transceiver for a few days to rule out problems on the latter.

If even that doesn't change anything, I wouldn't know what else to do.

Further advices are welcome.

Update:

I tried to connect the built-in 1Gb interface instead of the 2.5G SFP and nothing changed.