X86 crashing issue

chuckt94 · June 23, 2023, 6:57pm

Hey all,

I’ve been running Openwrt x86 in the following configuration

Hardware
CPU: i510210U
NIC(s): 6x intel i225v 2.5G
RAM: 2x8GB 2666mhz Samsung
HD: Samsung 1TB 980 NVME
Brand: Yanling (Protectli OEM)

Software
Software version: 23.03.5
1 interface configured as WAN
5 interfaces configured as a bridged virtual switch
Adblock
Unbound
CAKE SQM
Collectd graphs
Latest intel Microcode package

Issue:

The router crashes randomly after long periods of sustained load, I usually test it by streaming 4K YouTube videos to multiple different clients on different LAN ports. Will crash anywhere from a couple of hours to a day or so.

I’ve been monitoring it using ncat and I’ve caught a couple of suspect things in the logs:

This error on boot randomly (not every-time) :

Hardware event. This is not a software error.
CPU 0 BANK 1 
MISC 86 ADDR 34afec0 
TIME 1687213342 Mon Jun 19 17:22:22 2023
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
SRAR
MCA: Data CACHE Level-1 Write Error
STATUS ff80000000000124 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 142 Step 12

This one after streaming 4K video for a while (usually precedes the crash, but not always):

Hardware event. This is not a software error.
CPU 1 BANK 0 TSC 3ffd0cf796e6 
ADDR 1ffff810ec637 
TIME 1687142516 Sun Jun 18 21:41:56 2023
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: Instruction CACHE Level-1 Instruction-Fetch Error
STATUS 9400004000040150 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 142 Step 12
SOCKET 0 APIC 2 microcode f4

Both errors shown above are the outputs decoded by MCElog

this message popped up in the logs, and after it did, CPU temp and usage spiked for quite awhile (6-7hours) but did not crash and eventually returned to normal:

Advanced->Power & Performance->CPU - Power Management Control->HwP Lock = Disabled (but not necessary since not changing anything at runtime)

Troubleshooting:

I know that some of the above errors seem like the hardware may be the issue, but I’ve done the following:

Multiple runs of memtest suite that were flawless

Multiple variety’s of OCCT stress tests that didn’t generate any windows hardware errors. These tests absolutely slam the cpu + caches, power supply, memory, temps, in a bunch of different ways so I feel like it would have shown an actual cpu/hardware defect here.

Looking for any help or advice.

frollic · June 23, 2023, 7:47pm

I'd start by running using only one DIMM, if it still crashes, swap them.

frank92735 · June 24, 2023, 9:29am

Just a shot in the dark here but the i225v is why i didn't buy that generation of mini-pcs.

[quote="chuckt94, post:1, topic:163859"]
NIC(s): 6x intel i225v 2.5G

I225v are known to be crashy at 2.5Gbit. I know cable internet providers are still distributing modems with PUMA devices for 1Gbit connections and seems stable.

You could force them to 1Gbit for a test?
You could try disabling all power throttling to encourage the i225v to stay up at all times?
Or rollback the OpenWRT firmware?
Not sure if you are running on bare metal or in a container - try both?
Try a different power supply? Make sure you know what you are doing here or you could fry the mini-pc. I presume you do but not everybody does.

A quick search on the error code:

https://bbs.archlinux.org/viewtopic.php?id=103227

You could try and select a lower speed for the RAM to run cooler and perhaps more stable?

https://ark.intel.com/content/www/us/en/ark/products/195436/intel-core-i510210u-processor-6m-cache-up-to-4-20-ghz.html

Disable Turbo Mode to run cooler and more stable?

I don't think you can underclock/undervolt it because intel locked most of its chips down.

https://www.reddit.com/r/intel/comments/pmn6um/how_to_unlock_i510210u/

chuckt94 · June 24, 2023, 9:23pm

I’ll definitely give this a shot this weekend to rule out.

I did run multiple memtest suites with both sticks in and each stick separately, but stranger things have happened!

chuckt94 · June 24, 2023, 9:29pm

The only thing is, I’ve seen a lot of people in the wild running i225 based devices and not reporting this type of behavior.

As far as the temp, it stays at a good temp during operation typically.

I’ll try messing around with the turbo boost settings.

I also thought about enabling hwp_dynamic_boost as speed step is enabled in the bios.

M10 · June 24, 2023, 9:40pm

When i readed that was some problems with those intel 2.5gig nics i've decided to go 10gig way and i didn't had any problems with intel / mellanox / broadcom / chellsio nics. Now with testing 40gig on intel x710 based cards zero problems too. Intel made crapy 2.5 gig nics and after 4 rev same crap.

Mogician0123 · September 26, 2023, 8:06am

I'm experiencing the same thing, have you solved the problem? Also, where did you get the log, my syslog showed nothing when it crashes.

mindwolf · October 4, 2023, 7:18pm

It seems like a hardware failure in the core to me that needs to be replaced.

frollic · October 4, 2023, 8:26pm

Feel free to explain how you replace a CPU core...

bluewavenet · October 5, 2023, 10:46am

Careful use of a Dremel? /S

efahl · October 5, 2023, 3:19pm

You need kmod-screw-driver and the core-bolt utils.

bluewavenet · October 5, 2023, 7:30pm

Of course yes. I forgot that dremel-utils contains a proprietary binary blob.
Shame, because the dremel-core-extractor and dremmel-core-injector commands are very efficient in cases such as this.

frollic · October 5, 2023, 7:36pm

(c) BCM ??

bluewavenet · October 5, 2023, 7:39pm

We know what you mean and you may well be correct...
Feel free to join in the fun

bluewavenet · October 5, 2023, 7:41pm

(C) Albert J. Dremel

mindwolf · October 7, 2023, 7:52pm

The entire CPU must be replaced, or in this instance, the whole circuit board as the processor is most likely soldered to the board.

A total of 7 posts that take up more time being a smart ass then actually providing support to the OP. I thought all the warez kiddies grew up?

lleachii · October 7, 2023, 7:59pm

Wow, until I read your post about people being "smart", I thought you were honestly suggesting the OP replace a built-in CPU.

I guess the suggestions made in this thread were lost in context.

mindwolf · October 7, 2023, 8:14pm

Maybe so...

The suggestion of testing/replacing the DIMM is incorrect in two ways:

It's clearly not the ram as displayed by the overheating and errors
DIMM is for larger form factors such as desktops and servers. SODIMM is for laptops/notebooks, mini pc's, etc. If we're being technical lol.

Kermit Tea GIFs | Tenor

chuckt94 · October 15, 2023, 11:25pm

Sorry for being MIA.

Haven’t had a lot of chance to test stuff. But I highly doubt that it’s a memory or CPU issue.

I heavily stressed the CPU with a variety of CPU stress tests that do all sorts of variable loads, starts and stops, thermal stress, different cores/caches etc, and it didn’t pop any detectable errors.

For the memory I did multiple runs of memtest with both sticks in, one stick out, swapped spots, etc, and no errors there.

I haven’t had a lot of time to devote to this thing, but am planning on messing with it some this week. Maybe try some different router distros to see if I could replicate the behavior.

chuckt94 · October 15, 2023, 11:27pm

I was streaming the error log to a laptop and capturing it there.