[Solved] Packet loss using OpenWrt on a fiber connection

No, you're right, you only need it if you don't get good latency under load. I don't know what the pfsense default is but it may well be managing queues by default. Anyway my point was the speed of your WAN connection is not sufficient information to determine that QoS is not needed. The advice to simply turn off QoS since the OP has 500Mbit fiber isn't sound advice unless the OP tests latency under realistic conditions and finds that it's fine without QoS. My experience suggests that it's not hard to make a 500Mbps fiber connection stall a VOIP call if you do anything remotely demanding such as uploading a large file to a cloud provider or trying to download an entire movie or a Debian ISO or apt-get upgrade something using one of the CDN based Debian mirrors. The key is just whether you connect to someone with even more bandwidth than you have, like a cloud storage that's connected at the datacenter with 10G/40G/100G which will clog 500Mbit - 1Gbps without breaking a sweat.

My VOIP adapter is hanging off a switch that is connected to a router through a bonded LAG group, and a fileserver with bonded NICs attached to that switch can saturate the 2Gig bonded ethernet. So now not only do I need QoS at the WAN to prioritize VOIP packets, but I need to also tag VOIP packets and video streaming packets with DSCP on the LAN and do QoS in the switches or someone who hammers the file server by writing out a couple gigabyte data file will destroy latency for everyone on the LAN for 5-10 seconds. That's even more true if someone is simultaneously streaming videos or doing an internet download.

Shockingly problems don't go away when your connection increases in speed, in part because everyone elses connections are also increasing in speed :slight_smile:

But none of that is relevant to helping the OP fix the packet loss at idle. Turning of QoS is unlikely to solve the packet loss at idle problem. Packet loss at idle is usually due to some hardware buffer filling up before the interrupt handler takes care of it, or due to cable issues, or stuff like that. to @Conno we didn't get an answer to whether or not the packet losses occur when it's the router itself that's pinging, or only when a device on the LAN pings through the router. If it's only going through the router, it could be some issue in your LAN switch hardware or driver.

1 Like

Nice to see this dicsussion made some things clear regarding how TCP works. I was also under the impression it wouldn't matter as long as you don't saturate your max bandwith. But in fact it does. And that was the reason for me to give SQM on OpenWRT a try. It seems to do a very good job on reducing bufferbloat. Now lets see if I can find a way to eliminate the packet loss.
I'll do some testing off hours to see if the packet loss also appears when pinging from the Luci browser. Currently I'm using the Asus RT-AC68u which has near perfect Bufferbloat scores and no packet loss. So it seems the LAN side of my network is not the problem. I'll post results asap. Thanks everybody!

Unless you don't saturate your bandwidth it doesn't/shouldn't unless something is very broken.

@dlakelan
I don't run pfsense so I'm not sure what you're trying to prove. Also, your arguments are full of assumptions that doesn't apply in the real world. If you can't push a VoIP call without QoS on your LAN you have something seriously broken. I'd buy that if you used a 10mbit switch (which doesn't exist but hopefully you get the idea) but otherwise something is seriously broken. I honestly think you haven't evaluate most of your claimed scenarios at all and are just throwing out random assumptions. QoS has its place no doubt but not in the way you describe it.

My experience and @Conno's experience differ from your assertion. That's all, let's stick to the topic which is helping @Conno figure out why he has ~10% packet loss even at idle, and therefore not caused by SQM or PPPoE running out of cycles, and his test of turning off SQM has already shown that it doesn't affect the packet loss.

I think I've found the problem. The Linksys WRT1900ACS would not see my Genexis Media Converter (Genexis OCG 1012 EU 1) on the WAN side. I've found out that this a common problem of the WRT1900ACS.

To overcome this problem I put a managed switch (Netgear GS108E) in between the Genexis Media Converter and the Linksys WRT1900ACS. The switch port that is connected to the modem is set to PVID 6 (tagged), and the side that is connected to the WRT1900AC is set to PVID 6 (untagged). Now the WRT1900ACS does recognise the Genexis on the WAN port. But packet loss of 10% arises in this scenario.

When I replace the WRT1900ACS with my Asus RT-AC68u (The managed switch stays between the Genexis and the Asus) I also have packet loss of 10%. As soon as I remove the switch and connect the Asus directly to the Genexis, the packet loss is gone.

I think it has to do with fact that I have to connect to my fiber ISP (KPN in The Netherlands) over a PPPoE tunnel. This PPPoE tunnel must be the first layer before the VLAN is untangled. Could this be the problem?

Forgive me my limited knowledge about VLAN and PPPoE technology.

On the wrt do a "ifstatus wan | grep -e '"device":'" to get the device name of the underlaying interface $INT, then do
opkg update ; opkg install ethtool

ethtool $INT # replace $INT with the real device name

and potentially
ethtool --show-eee $INT.

I would not be amazed if the issue between the WRT and the Genexis would be simply some sort of auto-configuration gone awry.

This is what i get:

Settings for eth1.6:
        Supported ports: [ MII ]
        Supported link modes:   1000baseX/Full
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  1000baseX/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Link detected: yes

But in the Luci Browser says the WAN port is not connected. And I don't get an IP from the ISP.

This suggests that you should not be using eth1.6 as your WAN instead using eth1 itself since your switch is untagged for VLAN6 on its port.

EDIT: also check on the eee settings as @moeller0 suggested, perhaps the link is bouncing up and down electrically because it tries to save power too aggressively.

Maybe you could post your /etc/config/network (after redacting any pppoe credentials),
Also it seems the te wrt1900ACS routes its wan port through its internal switch (see: https://openwrt.org/toh/linksys/wrt_ac_series), so we need to not ask eth1.6, but figure out how to get diagnostics out of the switch...

I've found the problem. I had to turn off igmp snooping on the switch. Everything is working as expected right now. Thank you all!

So this clearly seems like a bug though, because igmp snooping shouldn't cause this. Where to file this bug is the question.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.