X86 freezes randomly and at reboot

Hello folks,

I have an Intel j1900 mini-PC here, running OpenWRT 19.07 [I know, old :slight_smile: ]

Of late, every 2-3 days this box freezes up.
It also freezes on a few reboots.

The power & port LEDs remain ON, but apart from that everything else stops working.

Even the power switch button pressing doesn't respond. The only way to reboot from this state is power cord removal.

I am planning to plug in VGA monitor to check what's going on, but I suspect VGA port will also be frozen.

In the reboot leading to freeze up, it appears that it starts to reboot fine, I almost get in by ssh [asked password], but then it just hangs there.

What would be the possible reasons?
Anyway to fix this?

Sort of sounds like a hardware issue, so you'd better plug in that monitor and see if it shows anything... Maybe make a boot USB with memtest64 and run that?

Both suggestions are worth testing, another would be testing to add intel_idle.max_cstate=1 as kernel parameter in grub. Background, many baytrail-d systems are plagued by a power delivery bug while waking up from deeper cstates, which make the system crash, as the CPU doesn't get enough power quickly enough, letting the voltage drop below the threshold needed to operate. This is a hardware bug, but the frequency of it happening varies widely, depending on the mainboard design, but also how often the kernel goes into deeper cstates, respectively how quickly it wants to wake up. This kernel parameter disables the cstates, keeping the CPU from entering them, in the hope to avoid the error condition.

Depending on your board, it may 'fix' the crashes, help - or do nothing/ don't help at all, you can only test it (and the cause of your issue might be something completely different altogether).

If you have run this x86 rig for 4years 24/7 plus the actual age of the actual hardware I seriously doubt a config change will do any difference now if it suddenly start to show signs of critical hardware failures now when the design life has been reached.

The easiest thing and pretty much the only thing you can change in that machine is the harddisk and/or psu.
Maybe the ram if you actually find something that old that actually fit the memory socket.

But after that it is all or nothing…

Maybe, maybe not. I (and many others on bugzilla.kernel.org) have been seeing this on and off again on many baytrail-d devices, the frequency of these issues varies a lot (from basically never to multiple times a day, depending on kernel, current workload, etc.). It sadly isn't an exact science when it comes to this hardware bug, as it really depends on the exact transitioning phases between cstate 5+ and wakeup calls (graphics demands are also part of the story) - so really subtle differences in usage patterns (versions of kernel and userspace) might make- or break it.

But yes, these oldest of these devices are surpassing the one decade mark of their service life, so hardware damage/ deterioration can't be ruled out either - but the above is an easy test.

The issue occurs while it's in use and also while it's in IDLE.
Difficult to suspect it's related to sleep states, for my particular case.
Thanks for passing on this valuable info for some other day.

That is not at all at odds with the issue described above, the issue is with transition between (and out of-) cstates, something the CPU is doing all the time (pretty much regardless of idle or being stressed).

Disk errors are covered with fsck? If I don't see any errors?

In the issue state, even the power switch button doesn't work.
Can that be RAM issue?
Does the CPU need RAM to process the power button?

Also, I am able to hit the issue once in 5 reboots. Whereras, leaving it ON, it takes 2-3 days to hit. Does it provide any debug info? What special happens post reboot?

And I did cross-check, it does come up fine for 2-3 seconds, DHCP serves IP, router responds to ping and then hangs.

Disk errors are at best covered with SMART data, if the disk actually start and is alive.

No. But the POST test check basic life in cpu, RAM and HDD. If they are dead the POST test will fail and BIOS will not even start the boot.

But if the 5VSB voltage or APCI circuitry is broken in any way the power switch won’t work.