Troubleshooting/Fixing a randomly unstable network connection

Hello

I hope I am not posting in the wrong forum/category, but I would appreciate any kinds of help. I know enough about Windows and generally software, but have not much experience troubleshooting these kinds of network issues.

My device:
I am using the latest stable OpenWRT 22.03.3 on a Raspberry Pi Compute Module 4 Rev 1.0 (to be precise the "reRouter CM4 1432" equipped with dual Gigabit Ethernet ports sold by seeed studio)

My setup:
I live in an apartment complex, my internet is provided via a building-wide fibre-service, my apartment essentially is connected via a single gigabit ethernet cable, there is no modem or login needed for the connection and there is no configuration available. The single cable is hooked up to the OpenWRT router which connects to an unmanaged Netgear 8-port switch, from which then cables go to an ethernet outlet in each room of the apartment, from which in some cases there are a couple more unmanaged switches, WiFi accesspoints and ultimately devices (PC, game consoles, etc.).

To illustrate:
[source Ethernet-cable] <-> [OpenWRT router] <-> [unmanaged Netgear 8-port switch GS308] <-> [additional switches/accesspoints] <-> [Devices (consoles/PC/...]

The issue:
I have sudden and completely random connection interruptions. This is most evident in games, for example FIFA23 on PS5. While navigating menus the game often freezes (as if wating for a response from the server), or during gameplay even though it is completely smooth and the pre-match screen shows a ping of 4-8ms the game suddenly shows an error "connection lost to server", without showing any connection trouble icons beforehand. After throwing me back to the main menu I have no problem to immediately reconnect, but the experience is of course very frustrating.
There are days where it does not happen at all, then sometimes it happens twice within 15 minutes, randomly throughout the day, so it is unlikely to be a capacity/bottleneck issue.

My settings:
The OpenWRT installation is vanilla, with the first ethernet port eth1 configured as WAN (DHCP client) and the second ethernet port br-lan configured as LAN (bridge). As additional software I have only installed Smart Queue Management QoS and configured it as per the Wiki. The issue has happened before and after adding the SQM package as well as trying to adjust settings for SQM.

The question:
What can I do to troubleshoot this issue? I am not sure if this is an issue that can be solved via OpenWRT configuration, or if the issue is with the provided source internet connection. But I need a starting point/evidence to even know what this issue could possibly be.

Sorry for the long post, any kind of help would be really great. Thanks!

The first thing I'd recommend is working to figure out if the problem is occuring at the router or LAN, or if it is happening on your uplink. To do this, setup persistent pings on your computer... to the following destinations:

  • the router
  • another device on yoru lan
  • a site on the internet (I often use google or the google dns 8.8.8.8)

The pattern of interruptions will often provide some clues as to where to look/troubleshoot.

1 Like

Also run a traceroute to find the IP of the first router on the other end of the cable from your apartment, and ping from OpenWrt to there. Check the logs to be sure the eth ports are staying up at full speed (1000 Mb).

1 Like

Thank you both for the reply.
One additional note: The issue also happens when I remove the OpenWRT router from the network-chain and basically have the other end of my internet source assign IPs to my local devices. Therefore I think I can rule the router out as a source of the issue.
The other end of my internet source cable is very likely some industry/business-grade fiber terminal somewhere in the basement of the apartment complex.

A couple of questions:

  • I know how to ping from a Windows command prompt, but not how to automate as well as log over a certain time period, what's the best way to do this? Is there a package to do this from OpenWRT?

  • What does an interruption look like in a log?

  • What would it look like in a log if the ports do not stay up at full speed.

I guess these are rather simple tasks for people familiar with networking, so apologies for asking rather basic stuff. Really appreciate the help!

If the landlord's system is just NATing all the apartments into a single fiber line, without SQM, there's going to be poor performance.

If the wan is connected to device eth0, use logread | grep eth0 to see any log entries related to the eth0 port. There really shouldn't be any after the initial up.

@mk24
Thanks.
I don't have the details of the hardware of the internet connection, but it is very likely something like those lines used in larger office buildings, so it's (supposedly) not just a simple line from the landlord.

I ran the command you mentioned (and learned how to open a dropbear instance in the process):

root@OpenWrt:~# logread | grep eth1
Thu Jan 12 23:06:34 2023 kern.info kernel: [    9.647954] 8021q: adding VLAN 0 to HW filter on device eth1
Thu Jan 12 23:06:34 2023 user.notice SQM: Starting SQM script: piece_of_cake.qos on eth1, in: 450000 Kbps, out: 450000 Kbps
Thu Jan 12 23:06:34 2023 daemon.notice procd: /etc/rc.d/S50sqm: SQM: Starting SQM script: piece_of_cake.qos on eth1, in: 450000 Kbps, out: 450000 Kbps
Thu Jan 12 23:06:35 2023 user.notice SQM: piece_of_cake.qos was started on eth1 successfully
Thu Jan 12 23:06:35 2023 daemon.notice procd: /etc/rc.d/S50sqm: SQM: piece_of_cake.qos was started on eth1 successfully
Thu Jan 12 23:06:36 2023 daemon.notice netifd: Network device 'eth1' link is up
Thu Jan 12 23:06:36 2023 kern.info kernel: [   11.846589] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Thu Jan 12 23:06:38 2023 user.notice SQM: Stopping SQM on eth1
Thu Jan 12 23:06:38 2023 user.notice SQM: Starting SQM script: piece_of_cake.qos on eth1, in: 450000 Kbps, out: 450000 Kbps
Thu Jan 12 23:06:38 2023 user.notice SQM: piece_of_cake.qos was started on eth1 successfully
Thu Jan 12 23:06:38 2023 user.notice firewall: Reloading firewall due to ifup of WAN (eth1)

I guess there is no issue here?

Sorry to bump this thread, would appreciate any help with these:

Thanks!

IIRC, a persistent ping on windows just needs the -t argument (I don't use windows, so I may be wrong here)...

You can just wait until there is an issue and then open multiple command windows with pings as described. If the problem is predictable (i.e. periodic or frequent enough to see a few times a day), just simply open the windows and keep the pings running in the background).