Having a somewhat strange issue with openwrt + sqm + openvpn.
Running an open vpn tunnel on a client behind the router (not on the router), and running SQM on the router. The router is running 19.07.3. SQM is configured with piece_of_cake.qos as recommended with all the defaults except setting the downstream and upstream badnwidth. SQM itself is working great.
The router is a DIR-825 and has 64Mb of memory. When not under any particular load, it has around 34Mb free, so it's not like it's exactly memory constrained.
The VPN under low usage also works fine. If I do something that puts the whole thing under high load all hell breaks loose. On the VPN client, I see lots of:
openvpn[96884]: Authenticate/Decrypt packet error: bad packet ID (may be a replay): [ #83487 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
On the router, memory usage starts going up and up, chews through all of the available memory and eventually the OOM killer comes in kills everything and the router reboots.
Various notes of wierdness:
Disabling SQM removes the problem (both the duplicate messages and the memory usage), though of course leaves me with a laggy network. But it's stable under load and the router doesn't crash.
Switching openvpn from UDP transport to TCP transport also fixes the issue (the workaround I have gone with for now).
So:
openvpn-udp + SQM = boom
openvpn-tcp + SQM = no problem
openvpn-udp + No SQM = no problem
openvpn-tcp + no SQM = not tested, but presumably no problem
No idea what to do with this, anyone have any ideas or want any more info? I can reproduce it reliably.
65% idle, doesn't seem to be under particularly high CPU load, just memory.
That's under the failure scenario, I didn't look at the CPU load on the working scenarios, since, well it works.
The CPU of the 825 isn't state of the art, but since it is just forwarding packets without encrypting them, it should be fine. Do you experience the same if you try to download/upload something big without OpenVPN from the same host? UDP and TCP.
What is the internet connection speed and what have you configured in sqm? uci export sqm
Yeah I don't think it's CPU related, it seems even under heavy traffic to be mostly idle.
The connection is only 16mbit down and 2mbit up, so really is shouldn't even be straining that CPU at all (and it's not). UCI output:
As for testing without openvpn, I tried torrenting a linux ISO (no idea whether that was using TCP or UDP possibly a mix), and yes, with SQM on it blew up. With SQM off it was fine. It was also fine doing a direct HTTPS download at the same bit rate.
So I guess that removes openvpn as a component and we're down to:
SQM + bittorrent = boom
SQM + http download = no boom
No SQM = no boom
The SQM doesn't seem to be adjusted properly. You are not supposed to configure the raw internet speed in the upload/download options. Also did you take in consideration the overhead value or is it also random?
Read here for more details how to configure properly and some basic troubleshooting.
While those settings are far from optimal, it should only negatively affect his latency or speed. It shouldn't cause SQM to break down. So we should probably fix those breakdown issues first before we optimize overhead and speed settings
SQM was configured using Luci (which is what the page you linked me recommends you do). That same page also discusses setting the upload and download speed, which is what I did. If you mean what I set them to the actual connection is like a 20/2.5 s, so the limits of 16/2 seem appropriate. In any case if I set them lower to say 14/1.5 (which I tried at one point) it still blows up.
As for the overhead, it's a cable modem connection so I set the type to Ethernet, and the overhead to 22. Which is also what the linked page says to do.
For the basic troubleshooting on that page, most of it is just generic information commands (like ifstatus wan isn't going to change during this test). Catting the sqm config file is just going to show you the uci output I already pasted.
Logread shows nothing printed to syslog. The debug version of starting SQM shows it set it up (which it seems to do correctly), but this isn't a problem with it trying to configure the interface, just what happens once it's up. tc -d qdisc shows it appear to be setup correctly (the qdiscs are all where I'd expect to seem them based on where the SQM commands put them.
tc -s qdisc shows the ingress qdisc using a total of 204k of memory with a backlog of eighteen packets with a total of 17k of space. So that doesn't seem to be eating 30+Mb of memory.
I agree stopping it from exploding is far more important than getting ideal latency, though I'm curious what is far from optimal from those settings. When it's not exploding a huge ball of fire, it actually seems to perform quite well. I can have an HTTP download pegging my connection and still have decent latency in games.
Without SQM if there's something pegging the connection like that, my ISP's buffers helpfully add about 10 seconds of latency....
So SQM itself seems very nice ... if it could avoid the whole running the router out of memory and exploding thing.
Have you tried fq_codel with simple_qos or simplest_qos? Do these cause SQM to blow up as well? I guess it might just simply be too little memory for what your asking from the device, but I am not entirely sure on that.
Would be interesting to see the output of 'tc -s qdisc' from before a torrent test, and from an ongoing test (obviously before it goes bad). I wonder whether there is a crashlogcat /sys/kernel/debug/crashlog visible after the crash?
I tried it with simplest.qos just now and it still blew up.
I'll note from looking at the output of "tc disc" that simplest.qos seems to have the exact same configuration for ingress that piece_of_cake.qos did. The only thing changing the script it was change the egress qdisc, the ingress is the same.
As for not being enough memory? I dunno maybe. Seems strange though that cake is configured for "using a maximum of 4Mb" and tc -s qdisc shows it's doing just that, and yet SOMETHING is eating 35Mb of memory.
I can post the full "tc -s" output if someone wants to tell me how to use spoiler tags or something similar, since if I just post them they are, well long. But the relevant portions as far as memory usage go:
Idle:
backlog 0b 0p requeues 0
memory used: 271808b of 4Mb
Will crash in 5 seconds if I don't kill download:
backlog 18081b 18p requeues 0
memory used: 214272b of 4Mb
As for the crash log, I don't have one handy right now, I'd have to actually let it die (for these tests I've been saving it before it crashes (still with ~25Mb of memory chewed up), but I looked at one before. It just shows the OOM killer kicking in, not finding any particularly large process, killing one anyway, then rinse/repeat, until everything is dead and it reboots.
If you think the output might be useful I can let it explode and get one.
Yes, please do! Both of them. To hide these in the outout simply click the cog icon on the top right of the editor window and select "Hide Details" then add your text. Please frame your pasted tc output with a line only containing three backticks "```" to get fixed font formatting.
Summary
This text will be hidden, and also formatted with a fixed width font, suitable for most console output...
Also It would be good to repeat the test while you are logged into your router and run top -d 1 and look at the % idle and note the minima you see (yes, that is not a terrible precise measurement, but it might tell us something).
With top running the lowest idle I saw was 41% during the "eat all memory" phase. Once the OOM killer started its spree top did get off one last gasp showing 19% idle. Though by then there were only 5 processes left on the poor router at all so I suspect some of that CPU was eaten by the OOM killer itself.
As for the crash log, now I can't get it to reboot. It'll use up so much memory the OOM killer trashes half the processes on the router and things like logread start failing, but it sort of limps along half dead without rebooting now, not that that's much better. But as a substitute, here's some OOM killer gobblygook (which is mostly what the crashlog was the time I saw one).
And by here I mean in next post because there seems to be a character limit.