Troubleshoot memory issue?

RacerRoses · January 8, 2020, 7:39pm

We have a fleet of about 100 OpenWRT devices. Across the entire fleet we observe about 5 unexpected per day. We have streaming logs set up via rsyslog but we don't see any messages from before the restart.

As far as I can tell we are compiling OpenWRT with KERNEL_CRASHLOG enabled, but I don't see anything in /sys/kernel/debug/crashlog. Other advice on the forum suggests compiling with the default configuration but that's not practical for us.

The absence of other evidence makes me suspect it's a memory issue. Do you have any tips on how to detect this, or debug unexpected crashes in general? Logging onto 50 boxes and hoping to see something in the terminal traceback is not great. We could sample the memory every 5-10 seconds and log it? If the memory is growing slowly over time we could see that but not if the memory spikes all at once. We could deploy an OOM killer to our development OpenWRT instances and see if anything gets killed due to OOM?

Is there some way to write a file to disk in the event the system is about to crash due to OOM?

Thanks for your help. This is my first post, I hope I followed the rules and please let me know if I should have posted this in a different forum.

lleachii · January 8, 2020, 10:45pm

@RacerRoses, welcome to the community!

I understood this to mean "unexpected reboots" - because I would see it as a blessing if 5 devices appear daily.

I haven't seen that. I have seen suggestions of installing the official firmware to determine if the issues persists (for testing purposes).

So, can you tell us what packages you've compiled/installed into your custom firmware?
Also, can you tell us the make/model of the crashing device?
Is the same device rebooting 5 times; or is it 5 different machines?
- If so, is there anything common/unique to their hardware and software than the others?

Which file?

fantom-x · January 8, 2020, 11:01pm

Did you try Configuring the OOM Killer ? You could use kernel.panic to delay a reboot and then login into the affected box to maybe see what happened.

There are some parameters that could help you control the behaviour. According to these, oom should not cause a reboot, but an oops would...

sysctl -a 2>&1 | grep panic
kernel.panic = 3
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_warn = 0
vm.panic_on_oom = 0