EdgeRouter X Crash with 18.06.4

Problem Summary

Since upgrading to 18.06.4 I am seeing a periodic crash on my EdgeRouter X.

  • The router would just become unresponsive (to DHCP requests, ssh, etc).

Reproducable?

  • around once or twice a week

Description

Today, I got lucky and managed to SSH in before it completely hung up its shoes.

Environment

Architecture : MediaTek MT7621 ver:1 eco:3
Firmware Version : OpenWrt 18.06.4 r7808-ef686b7292 / LuCI openwrt-18.06 branch (git-19.170.32094-4d6d8bc)
Kernel: 4.14.131
  • This is the standard sysupgrade from the OpenWrt website (not my own compiled build).

Packages I have installed after upgrade:

luci-app-adblock
ddns-scripts
luci-mod-ddns
luci-mod-rpc
luci-app-openvpn openvpn-openssl
luci-proto-wireguard luci-app-wireguard wireguard kmod-wireguard wireguard-tools
banip - 0.1.4-1

(let me know if you need the full list).

Logs

Here is what I managed to get from logread before it went unresponsive:

Mon Jul 22 13:54:44 2019 daemon.notice odhcpd[930]: Got DHCPv6 request
Mon Jul 22 13:54:44 2019 daemon.warn odhcpd[930]: DHCPV6 REQUEST IA_NA from 00010001233b78ad38539cc351bc on br-lan: ok fdd0:43a3:e55c::e99/128
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.496203] INFO: rcu_sched self-detected stall on CPU
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.506572] 	2-...: (1 GPs behind) idle=866/140000000000001/0 softirq=2472365/2472367 fqs=2998
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.516181] INFO: rcu_sched detected stalls on CPUs/tasks:
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.523900]
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.523925] 	2-...: (1 GPs behind) idle=866/140000000000001/0 softirq=2472365/2472367 fqs=2998
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.534811]  (t=6003 jiffies g=1045463 c=1045462 q=162)
Mon Jul 22 13:56:10 2019 kern.err kernel: [98924.537926]
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.555218] NMI backtrace for cpu 2
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.565597] (detected by 3, t=6006 jiffies, g=1045463, c=1045462, q=162)
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.568718] CPU: 2 PID: 1221 Comm: kworker/2:1 Not tainted 4.14.131 #0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.601919] Workqueue: events 0x80237e28
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.609715] Stack : 00000000 00000000 804baa10 8fc0dd24 00000000 00000000 00000000 00000000
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.626357]         00000000 00000000 00000000 00000000 00000000 00000001 8fc0dce0 53261662
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.643000]         8fc0dd78 00000000 00000000 00003ae0 00000038 804835d8 00000008 00000000
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.659641]         00000000 80530000 00092f3f 00000000 8fc0dcc0 00000000 80550000 00000002
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.676283]         80534480 8052c0ac 000000e0 80530000 00000003 8029b5a8 00000008 80590008
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.692925]         ...
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.697783] Call Trace:
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.697802] [<804835d8>] 0x804835d8
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.709579] [<8029b5a8>] 0x8029b5a8
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.716508] [<80010090>] 0x80010090
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.723437] [<80010098>] 0x80010098
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.730366] [<8046c57c>] 0x8046c57c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.737296] [<80071294>] 0x80071294
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.744226] [<80473474>] 0x80473474
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.751156] [<8000cf90>] 0x8000cf90
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.758085] [<8000cf90>] 0x8000cf90
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.765013] [<80473560>] 0x80473560
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.771943] [<80084b48>] 0x80084b48
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.778875] [<80083fa0>] 0x80083fa0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.785851] [<80087518>] 0x80087518
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.792783] [<800980fc>] 0x800980fc
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.799715] [<8031e700>] 0x8031e700
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.806645] [<8035125c>] 0x8035125c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.813574] [<80077928>] 0x80077928
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.820506] [<80071c40>] 0x80071c40
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.827437] [<80252ed0>] 0x80252ed0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.834367] [<80252d7c>] 0x80252d7c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.841297] [<80252f3c>] 0x80252f3c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.848226] [<80071c40>] 0x80071c40
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.855156] [<804899e4>] 0x804899e4
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.862085] [<80251f0c>] 0x80251f0c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.869018] [<8000b4e8>] 0x8000b4e8
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98924.875944]
Mon Jul 22 13:56:10 2019 kern.info kernel: [98924.878913] Sending NMI from CPU 3 to CPUs 2:
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.436380] NMI backtrace for cpu 2
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.443396] CPU: 2 PID: 1221 Comm: kworker/2:1 Not tainted 4.14.131 #0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.456444] Workqueue: events 0x80237e28
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.464347] task: 8fdc0640 task.stack: 8e7dc000
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.473426] $ 0   : 00000000 00000001 00000000 0000000a
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.484044] $ 4   : 00000004 000c0000 00966c35 00966c35
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.494705] $ 8   : 0000ffff ffff0000 d9f2f27f 00000002
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.505412] $12   : 00000000 00000000 ffffffff 00004f22
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.516008] $16   : 8a297f7c 81231340 8052c1b8 000c0000
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.526637] $20   : 8058b362 8f0d6ab0 000000bc 0000001f
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.537263] $24   : 3b9aca00 8000ce94
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.547895] $28   : 8e7dc000 8e7dddf0 8f17fc00 8006a5f0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.558533] Hi    : 0000000a
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.564328] Lo    : 66666669
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.570147] epc   : 8006a68c 0x8006a68c
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.577952] ra    : 8006a5f0 0x8006a5f0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.585774] Status: 11007c03	KERNEL EXL IE
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.594525] Cause : 50800400 (ExcCode 00)
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.602654] PrId  : 0001992f (MIPS 1004Kc)
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.610922] CPU: 2 PID: 1221 Comm: kworker/2:1 Not tainted 4.14.131 #0
Mon Jul 22 13:56:10 2019 kern.warn kernel: [98928.624048] Workqueue: events 0x80237e28
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.632000] Stack : 00000000 00000000 804baa10 8fc0dd64 00000000 00000000 00000000 00000000
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.648828]         00000000 00000000 00000000 00000000 00000000 00000001 8fc0dd20 53261662
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.665643]         8fc0ddb8 00000000 00000000 000045c0 00000038 804835d8 00000008 00000000
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.682460]         00000000 80530000 000985b0 00000000 8fc0dd00 00000000 80550000 00000002
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.699281]         804c1fcc 8051f4e0 804bfbd8 80530000 00000003 8029b5a8 00000008 80590008
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.716104]         ...
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.721026] Call Trace:
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.721097] [<804835d8>] 0x804835d8
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.732971] [<8029b5a8>] 0x8029b5a8
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.739960] [<80010090>] 0x80010090
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.746948] [<80010098>] 0x80010098
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.753937] [<8046c57c>] 0x8046c57c
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.760923] [<80010154>] 0x80010154
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.767914] [<80473454>] 0x80473454
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.774940] [<8000d0a4>] 0x8000d0a4
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.781897] [<8000d0b4>] 0x8000d0b4
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.788894] [<8009f764>] 0x8009f764
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.795883] [<80058ab8>] 0x80058ab8
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.802905] [<80015590>] 0x80015590
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.809918] [<800727e0>] 0x800727e0
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.816952] [<80072914>] 0x80072914
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.823950] [<800729b8>] 0x800729b8
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.830939] [<8031e700>] 0x8031e700
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.837923] [<8035125c>] 0x8035125c
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.844912] [<80076ca0>] 0x80076ca0
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.851898] [<80071c40>] 0x80071c40
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.858882] [<80071c40>] 0x80071c40
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.865868] [<80252ed0>] 0x80252ed0
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.872855] [<80252d7c>] 0x80252d7c
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.879849] [<80252f3c>] 0x80252f3c
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.886838] [<80071c40>] 0x80071c40
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.893829] [<804899e4>] 0x804899e4
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.900817] [<80251f0c>] 0x80251f0c
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.907821] [<8000b4e8>] 0x8000b4e8
Mon Jul 22 13:56:11 2019 kern.warn kernel: [98928.914793]

  • When I ran htop I could see 2 out of the 4 cores were maxing out at 100% always. However, it didnt show any particilar process in the list that was using more than 1-3% cpu. Strange.

Any ideas about what might be causing this? Thanks for any help!

Hardware acceleration is known to cause issues if you have that enabled.
Have you tried using snapshots?

1 Like

Good point. No I haven't tried that thanks!

1 Like

@fbradyirl Did turning off offloading features fix your issues for good?

Just started seeing similar issues on my (mission-critical) ER-X hardware. Just wondering of there is anything else.

Thank you for starting this thread.

Yes. But I also downgraded my firmware as the last dot release never crashes for me.

Thanks for replying.

Will leave my ER-X at 18.06.4 (H/W + S/W Offloading OFF) and report here if this issue comes back. (All my other boxes are stable on 18.06.4)

One other thing I noticed is I never get any IPV6 address’s on the WAN with S/H offloading enabled on this router.

Any solution for that?
Yesterday one Edgerouter with 18.06.5 crashed (offload not enabled).
Its in an critical enviroment, so is 18.06.03 more stable?

Never seen this in 18.06.2

[SOLVED]
Answering myself:

Problem solved, device simply overheating....
The Edgerouter seems to be a little sensitive related to high temperature.

1 Like

That is ... curious. Being in the market for the ERX myself, do you mind telling how much stress you put on the device, and how you positioned/buried it? It really should not be able to overheat in just regular operation.

I have one. It's been very dependable. They come with a small heatsink on the CPU though there is space for a larger one. I'm still using the stock one.

There was another thread where someone plugged a non-POE device into port 4 without turning the GPIO pin off. I've done that too. That shorts out the power circuit and problems will occur.

1 Like

I have a lot of these little Edgerouters (for VPN-related things etc..)
Rock solid exept HEAT.
This belongs ONLY enviroment temperature, not any load inside OpenWRT.
If you have an primary Router (Cisco etc..) that produces heat, putting the Edgerouter on top of this is not a good idea.

This little beast really needs cool air all around.

Reading the official Hardware specs showing:
Operating Temperature -10 to 45° C (14 to 113° F)

1 Like

Ah, alright. This is generally true for pretty much every router except for maybe the really low-specced ones. Even my Lantiqs act up when not reasonably well ventilated (especially their modem chipset.) Thanks for the elaboration.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.