Edgerouter X SFP random reboots with recent master build

SInce a couple of months I more ore less regularly build from master branch and update my Edgerouter X SFP. This worked fine until this time. I built an image from master df9a62a08584920e3147edec2fd92fce6cf8d77c, where I only added some packages I need, as usual. However, this time the router randomly reboots, first after the update within the first 5 minutes of uptime, then after roughly an hour and so on and so fourth. Currently I reverted back to my build from master f4e3ff5b075bbab279bd06a7d3e0d9c950ee098c which works rock solid.
As I could'nt find anything: are there known issues? Do others experience this? Maybe in the context of the recent switch to kernel 5.10? What info (logs etc.) can I provide for investigation? As the router reboots randomly which also clears all logs, I have no clue where to look for crash logs.
I would be happy about any info and help!

My ER-X (not the SFP version) has been up 4 days so far (with no reboots or any other issues) running SNAPSHOT r18353-444b4ea4a4 with the 5.10.83 kernel.

I updated my AP's ~12 hours ago with SNAPSHOT r18371-5a4685cfa2 using 5.10.87 - no reboot issues, but not a lot of runtime either. The AP's are ipq8064 and ipq40xx all-in-ones - not MT7621, so not particularly relevant to your ER-X SFP.

I download the pre-compiled binaries - I'm not building my own.

Might be worth trying "make dirclean" if you haven't already. There are sometimes issues with code being built using different toolchain versions.

Thanks for your reply and your feedback!
I have one suspect in my mind: I had enabled hardware and software flow offload. Do you have those enabled on your Edgerouter, too?

Thanks for that hint, haven't done that recently. Will do so, rebuild and come back.

I do not. I've turned them on for some brief iperf3 testing in all combinations with and without SQM, but my conclusion was they did nothing: no impact on throughput, and no impact on CPU utilization. So I left them off.

https://forum.openwrt.org/t/throughput-drop-after-upgrading-to-21-02-mt7621/114888/7?u=eginnc

You have two alternatives for this.

  1. In log settings you can define a local file to be written to, but that will wear on the router memory.
  2. Have a log server for example a RaspberryPi with a hard drive that saves all the logs.

I've done a new build after cleaning up using make dirclean. Running fine so far.

With the new build I had again various reboots. I then disabled flow offloading, both software and hardware. Since then, no more reboots and keeps running since 17 hours now. It seems, that flow offloading leads to crashes and reboots using kernel 5.10 - was not the case with kernel 5.4.

Thanks for that hint. Set it up accordingly and will watch, if I can find anything in the logs, when testing with flow offloading. Let's hope, that the interesting log entries arrive at the Pi in time, before the router reboots.

2 Likes

Thank you for sharing that result. I won't experiment with turning it back on then.

The ER-X is our main home gateway. Random reboots would not help my already dubious family network credibility next time I announce I'm "fixing the internet" with the latest snapshot.

To use master snapshot is every time like throwing the dice. So you really doesn’t go for stability to begin with.

Can’t you simply use the 21.02 that is stable?

1 Like

More humor was intended in my comment than probably came through in the words.

I've actually found snapshot to be astonishingly well behaved over the years, with some memorable exceptions requiring a rollback to stable - so I have also personally experienced why I would be better off following your advice to stick with stable :wink:

I have four Archer C7 v3.2, also MT7621AT. As you, I am running custom builds for some time now.

One of them is running as a router, and I add some stuff to the build such as Wireguard and DynamicDNS.

For others running as access points, I do a "light build" removing firewall/iptables, dnsmasq and some other few things.

When upgrading to a recent build I did on Dec 18th (r18371-5a4685cfa2) which already included kernel 5.10.87, I noticed the same reboot issue you described in the router, but not in the access points. Since I'm using software/hardware offload on the router, this seems to explain the issue.

Two days ago I reverted to a previous build I had (r18298-8261b85844) with kernel 5.4.162 the problem was solved, which is running rock solid so far.

So in fact it seems that Kernel 5.10 is causing some sort of regression issues with MT7621AT and software/hardware offload. I will open a topic in the Developers section pointing to this thread so perhaps some dev can take a look at this.

OK, I just posted a more generic report about this issue in the Developers forum section: