So to me this looks as though the packet loss occurs right "after" the router, which is consistent with my previous observations. The question is: why? Again, if I connect directly to the cable modem, everything seems to be fine (tested for over 20 minutes at a time..).
Correction: there seems to be packet loss at the first step already. I overlooked that since latency seemed normal.
As for SQM: I currently do not have SQM installed/enabled. I wanted as little interference as possible. Right now everything you see is done with today's snapshot and almost no customization.
Ooops, sorry, yes, mtr not mrt... thanks for figuring that out and fixing it.
I agree. Also the Wrst delays look crazy 3065 ms or 3 seconds to the first upstream host (probably somewhere at/above the CMTS level). Testing was done over WiFi or over a wired connection?
With an interva of 100ms some packet loss is expected, as most intermediate devices are configured to deprioritize and rate-limit ICMP processing. So it seems the -i 0.1 was too aggressive, maybe -i 0.5 or might be a better way to probe things.
Good, that removes one possible culprit (ad also bad, because the only thing I have soe experience in is sqm).
I wonder, if you have access configured to the modem itself (192.168.100.1)? In that case running an mtr against that IP would also be interesting.
Mmmh, since this is a cable link maybe the issue is IPv4 related, could you also try: sudo mtr -ezb6 -i 1.0 -c 300 2001:4860:4860::8888
Depending on how your IPv4 is implemented this might make a difference...
I reran the tests, packet loss goes down to about 45% for the router, 20% to 30% for the next hop and 0% to 10% for the remaining hops. Latency on the other hand does not improve and is still around 3000ms for the first upstream host in worst case.
The modem exposes a web interface with some status info which I could share. There are no configuration options available. I ran mtr against it though and the results are basically the same as above (first hop after the gateway). In a quick test there was little to no packet loss for the modem itself, though very high latencies in the worst case. The router on the other hand showed high packet loss at very short intervals but consistently low latencies.
My connection is IPv4 only so this doesn't work. I never changed in the last, 7, 8 years maybe. Also the cable modem is from the stone age but used to work very well until recently. Actually it still seems to work, as long as there is no OpenWrt in between. I didn't call my ISP yet. They usually are like "Have you tried connecting your computer directly to the modem?", which in my case actually solves the problems...
Appendix (cable modem status info):
Modem Serial Number
Cable Modem MAC Address
Receive Power Level
Transmit Power Level
Cable Modem Status
Signal to Noise Ratio
Cable Modem Status
Cable Modem IP Addess
Sun Sep 06 15:33:52 2020
Time Since Last Reset
9 days 07h:29m:24s
Cable Modem Certificate
The data shown in the table below provides information about the customer premise equipment (CPE) connected to your cable modem.
Connected to MAC Address IP Address
Ethernet 00:00:00:00:00:00 --.--.--.--
Ethernet C4:6E:xF:73:3F:xE --.--.--.--
Ethernet C4:6E:xF:73:3F:xF 220.127.116.11xx
Ethernet C8:5B:76:06:4C:xx 95.91.59.xx
try the same test with a laptop connecting the modem directly, how is it, same?
I usually use mtr with tcpdump together and count pkts to see the pattern
eg start tcpdump
tcpdump -i ethX -s0 -n -w /tmp/mtr.pcap
problem could be due to flaky eth port too....maybe
try to connect the WAN port with your laptop and set static ips on both side just for testing
It is substantially better. Running mtr against the modem gives little to no packet loss, depending on the interval. Latency is much more stable, eventhough there are some rare spikes up to about 500ms.
When using the router, I frequently get a straight "no route to host" for 2, 3 seconds when pinging the modem or running mtr against it. When directly connecting to the modem, I did not observe such outages yet. Its quite bizarre.
That seems possible, but then again the problems are (almost) gone or at least much, much better when directly connecting to the modem. Could be a faulty uplink-port on the router but why would that break? It sits in its corner and just does its thing. Also shaking it a bit during tests does nothing. For now im inclined to blame either the modem or OpenWrt.
Okay, that IMHO rules out the ISP/cable-segment as most likely cause of the issue, something you knew all along.
Yes, it seems the wrong tree to bark up on, sorry.
Next question, do you see anything that might be related to ethernet drop outs in either logread or dmesg output (make sure to redact sensitive information if you post excerpts of that data here in the forum, like passwords and/or your IP address)?
That sounds like there might be multiple issues at work here, but let's try to figure out the 3 second stalls first
Yep, im gravitating towards some weird modem issue though. Something which in my case is still the ISPs business in some way.
Looks completely normal to me. logread lists my ssh-login as the last entry, while ping/mtr gives the usual weirdness at the same time. dmesg is also completely silent 35 seconds after a router reboot. Nothing to report there.
Mmmh, I wonder what $DEVICE do you get from ifstatus wan
and what do ethtool $DEVICE ethtool --show-pause $DEVICE ethtool --show-eee $DEVICE
return? (Replace $DEVICE by what ever you found as device from the ifstatus call), in case ethtool is not already installed on the router, do opkg update ; opkg install ethtool first.
I vaguely remember running into EEE (energy efficient ethernet) issues some years in the past, where an EEE enabled router did not harmonize well with older pre-EEE gear until I disables EEE for the router's port connected to the old device, but my memory is a bit dim and this could lead nowhere fast. Then again, your modem being old might make this at least worth a try.
Something like ethtool --set-eee $DEVICE eee off should help to disable this, assuming it is enabled in the first place...
The device is "eth0.2" (used to be eth0 in earlier releases I believe? but thats OpenWrt default now, some virtual device). Both eee and pause return not supported, for eth0.2 and eth0 respectively. Also trying to set anything doesnt work with the same message.
Nice idea though, I would never have thought about something like this.
Mmmh, too bad, but that brings me to a new idea potentially worth testing, if you have a ethernet switch you could borrow, try to place it between modem and router so that hopefully mismatches between their ethernet devices might be worked around. This is primarily intended as a testing vehicle and not necessarily as permanent solution (but it probably is not going to help anyway).
I put an old switch (10/100mbit, probably older than the modem) between modem and router. Initially things seemed more stable, but that was probably coincidence. Still drop outs while pinging the modem, still the occasional very high latency with mtr.
I removed the switch again but now replaced the cable between router and modem. No effect for better or worse aswell.
What is interesting though is that the maximum latency I observe with mtr is always around 3000ms, sometimes 2980, sometimes 3060, but never anything much more or less. Also I do not see a fixed pattern. Sometimes everything seems smooth for 3, 4, 5 minutes and then again there are several seconds of "outage" (including "host unreachable") at a time.
not sure if you got my hint correctly, if I understand correctly, you have
modem - router(owrt) - lan, right?
ok, forget modem for a moment, put together this network for testing
laptop(eth) - router(eth) with the same uplink cable
and do testing this way, do you get the same pkt drops? same rate etc?
I connected the Laptop directly to the router, using the same cable (not the same port though..) and everything looked normal.
I reran the test you suggested above. I connected the laptop directly to the cable modem. This time I let mtr run for 5 minutes (300 probes against the modem itself, 1 second interval) and voila - unstable connection. Again some packet drops, again around ~3000ms worst case latency.
I am sorry for the confusion above, previous (shorter) tests didn't reveal the issue. Also I ran some other tests previously today (upstream test to twitch, looking at dropped frames due to network issues) and while they gave me issues within 2 minutes with modem <=> router <=> laptop, they ran 10-15 minutes without a hiccup with modem <=> laptop. I have no idea whats going on.
My current thinking is that my router (and OpenWrt!) is probably not the (main?!) problem. I will call my ISP tomorrow and see what they suggest. If you have any other ideas, I am of course curious. Also I will keep this thread updated and correct my posts above as soon as I am definitely sure about what is happening here. Everything being smooth with old LEDE builds seems to be a temporal coincidence at this point.
So I think it looks like there is something already between router and modem that perform sub-optimally.
That something could be either one or more of:
The routers OS.
The router's wan port
The network cable between router and modem
The modem's LAN port.
The modem's OS
Potential issues between modem and CMTS that "knock" out the modem for longer periods of time
Might make sense to re-test these one by one, no? If you are dead certain about any of these just skip that test...
The router OS
Easy to test, by either switching back to an older release of OpenWrt or even the stoch firmware by the manufacturer, or by temporarily testing a different router.
Your laptop test seems to be equivalent to this.
The router's wan port
Reconfigure the router to use one of its LAN ports for wan, requires a bit of VLAN trickery but nothing insurmountable. Alternatively use a known good replacement router. Also follow @rpmoomin's idea and replace the MODEM with the laptop and repeat your test, router-WAN -> ethernet cable -> laptop (you probably need to configure both router-wan and laptop for static IP addresses outside of your internal network address range).
The network cable between router and modem
Just use a known good ethernet cable instead, might make sense to test both cross-over and straight through cables just in case there is an issue with auto-MDI-X.
The modem's LAN port & The modem's OS
The easiest test for these would be a replacement modem, which given the switch to DOCSIS 3.1 might be a good idea independent of whether your current modem is broken or not.
Thanks again for all your suggestions. I called my ISP just now and to my great surprise they immediately agreed to send a replacement for the old modem. I'd assume that the new device is reasonably up to date and also DOCSIS 3.1 compatible, we will see. I will report back, hotline guy estimated 1-2 days.
In the meantime, let me reply to your last points.
That was my initial guess but having tested the last 18.06 release, every single stable 19.x-release, a recent snapshot and having tested the modem without any involvement of the router, I am now doubtful about OpenWrt being the issue. Of course if the issues persist with the new device, I will resort to stock firmware for final clarification.
Again, this seems unlikely to me but I will come back to this if everything else fails. I was already rattling and shaking the port (or rather the cable) slightly and you'd expect to see at least some effect if there are mechanical issues or faulty electronics. Also the issues are very similar or indeed the same when connecting my laptop directly to the modem. This is something I completely missed in the beginning due to good (or rather bad?) luck and which initially led me to believe that OpenWrt/the router is the problem.
This I already did yesterday with no apparent effect.
This is what we are left with and which I hope to be able to confirm/rule out within the next 2 or 3 days. Maybe it is even thinkable that the electronics degraded somewhat in the past 10, 15 years? This is speculation of course but we will see how the new device performs.
Once again, thank you so much for your support! As I said above, I will report back soon and hopefully clear OpenWrt's name in this case.
I am back with an update. It took me some time because the replacement device only arrived here yesterday evening.
To make it short: everything seems to work as expected. OpenWrt seems to be innocent! In a few hours of testing, no packet loss has been observed. The only thing I changed was the cable modem (actually a cable router in bridge mode now). Cables, OpenWrt-router and everything else stayed the same. That seems to pretty conclusively point to the old modem as the culprit. The new device apparently is DOCSIS 3.1 compatible but I am not quite sure which mode is actually used.
For now I am quite happy. I might be back with a few questions concerning optimal SQM-settings in my scenario, especially link layer settings but that might be stuff for another thread maybe? In case moeller0 has any advice straight away, here is the setup:
downlink 32000kbps, uplink 2000 kbps, cable modem. The advertised and measured values pretty much match up. For now 31500, 1900, cake, piece of cake seems to work well, but if there is room for improvement, im always up for it.
So finally thank you all for your suggestions. Lots of these might be helpful in general to diagnose local network issues.
Make sure to replace YOURWANINTERFACEHERE with the name of your wan interface (the output of ifstatus wan should tell you).
This will account for the correct per-packet-overhead (18 bytes) and minimal packet size (64 byte) for a typical cable link.
It will also use ack-filtering on egress, which seems to be the right thing to do on bursty links like docsis.
The "nat dual-xxhost" stanzas in that directionality will configure per internal IP fairness, in which all concurrently active machines get an equal share of the available bandwidth (that is each machine gets all bandwidth if nobody else sends receives something) which often works very well for home links as it isolates the bad consequences of say running a torrent client mainly to the machine running the client while other machines work reasonably well in spite of ongoing background torrenting. If you dislike that simple remove these keywords from the config file. (The nat keyword is required so that the dual-xxxhost options actually can see the correct IP addresses)
The ingress keyword helps to make up for the fact that for download traffic our shaper is technically on the sup-optimal end of the actual bottleneck and will result in less dependence of the shaper on numer of concurrent flows.