Hey all,
I’ve been running Openwrt x86 in the following configuration
Hardware
CPU: i510210U
NIC(s): 6x intel i225v 2.5G
RAM: 2x8GB 2666mhz Samsung
HD: Samsung 1TB 980 NVME
Brand: Yanling (Protectli OEM)
Software
Software version: 23.03.5
1 interface configured as WAN
5 interfaces configured as a bridged virtual switch
Adblock
Unbound
CAKE SQM
Collectd graphs
Latest intel Microcode package
Issue:
The router crashes randomly after long periods of sustained load, I usually test it by streaming 4K YouTube videos to multiple different clients on different LAN ports. Will crash anywhere from a couple of hours to a day or so.
I’ve been monitoring it using ncat and I’ve caught a couple of suspect things in the logs:
This error on boot randomly (not every-time) :
Hardware event. This is not a software error.
CPU 0 BANK 1
MISC 86 ADDR 34afec0
TIME 1687213342 Mon Jun 19 17:22:22 2023
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
SRAR
MCA: Data CACHE Level-1 Write Error
STATUS ff80000000000124 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 142 Step 12
This one after streaming 4K video for a while (usually precedes the crash, but not always):
Hardware event. This is not a software error.
CPU 1 BANK 0 TSC 3ffd0cf796e6
ADDR 1ffff810ec637
TIME 1687142516 Sun Jun 18 21:41:56 2023
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: Instruction CACHE Level-1 Instruction-Fetch Error
STATUS 9400004000040150 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 142 Step 12
SOCKET 0 APIC 2 microcode f4
Both errors shown above are the outputs decoded by MCElog
this message popped up in the logs, and after it did, CPU temp and usage spiked for quite awhile (6-7hours) but did not crash and eventually returned to normal:
Advanced->Power & Performance->CPU - Power Management Control->HwP Lock = Disabled (but not necessary since not changing anything at runtime)
Troubleshooting:
I know that some of the above errors seem like the hardware may be the issue, but I’ve done the following:
Multiple runs of memtest suite that were flawless
Multiple variety’s of OCCT stress tests that didn’t generate any windows hardware errors. These tests absolutely slam the cpu + caches, power supply, memory, temps, in a bunch of different ways so I feel like it would have shown an actual cpu/hardware defect here.
Looking for any help or advice.