File corruption being seen in OpenWrt

StuA · February 2, 2022, 4:02pm

Hi,

We're using a squashfs+f2fs filesystem on a Gateworks Newport platform and are seeing some file corruptions causing our application to fail to run. As there
is no access to a power off button our units are being switched off by pulling the power. Could this be causing the corruption? We are regularly writing to internal config (json) files so if the sudden shutdown happens at the wrong time I guess that could cause our problem. Any suggestions about the best way to prevent this?

I suppose ensuring a last good write of the config file is made and then available should there be corruption of the live file? Then a recovery method to detect the corruption on power up to the live one and fall back to the last good one?

thanks for any advice
Stuart

trendy · February 2, 2022, 4:32pm

By default the OpenWrt will not write anything to the flash, except the permanent configuration changes which should survive a reboot. Logs and other temporary stuff are written on the tmpfs, which is the ram.
I don't know much about your application but you might consider using a different storage medium to avoid corruption to the internal flash. NFS might be a solution if you don't want to use usb flash.

StuA · February 2, 2022, 4:53pm

Thanks for the response trendy.
We have onboard eMMC providing our storage on an Overlay squashfs+f2fs system.
There are quite a lot of application files and local config kept on this storage and these actively written to. We are also writing application logs over to a microSD card which is mounted at startup.
We were thinking the fsync() call might be a way to ensure any changes are immediately written
and not left hanging in the overlay after a write call in the application but my understanding of squashFS etc. is limited.

Cheddoleum · February 2, 2022, 5:08pm

If you're running arbitrary applications you need to at least execute the "poweroff" command before pulling the power. Depending on the device this will not necessarily cause a system shutdown but it will sync the disks and shut down applications that are running via the init system.

StuA · February 8, 2022, 2:14pm

Thanks Cheddoleum,

Unfortunately we do not have any external soft button we can hook a poweroff into.
Does anyone know when the sync is triggered? It is obviously not done immediately
after every write else we wouldn't see the corruption (or at least not as often).
Would it be wise to force a sync within the application write operation to reduce
the time any changed files are in limbo?

eduperez · February 8, 2022, 5:30pm

Is this issue a file corruption or a filesystem corruption?

As a general "best practice", I would advise to always write to a temporary file, and rename it after closing. This way, you always keep a "good" version of the file.

StuA · February 8, 2022, 5:47pm

Hi Eduperez,

It's definitely just a file corruption, the rest of the system is fine.
One of our json config files is regularly written to by one of our microservices and it is this file that seems to have the problem.
All the other unchanged files are ok, even in the same directory.

Would this two-step process force a sync to happen? I suppose the
rename is quick and if there's a power-off the worst that happens is that there's likely some unsaved changes, not corruption. But the overlay/sync problem might still exist.

Cheddoleum · February 8, 2022, 6:05pm

Well the purpose of "poweroff" in this case is mostly just to shut down your application, I was thinking maybe do it in ssh, or via the API. Pulling the power plug on any application that does frequent or routine disk writes is going to have a problem on any platform unless you plan for it first and foremost in your choice of filesystem and its configuration options, and secondarily in the design of the application itself.

I suspect you're not in a position to go down that road in the near term, so I'd suggest you explore ways you can stop your application before pulling the plug. You can shut down via the web UI, you can send remote commands via SSH or the HTTP API, scripted or noninteractively. It all depends on your use case.

But however you do it, by far the easiest way to solve your problem is to put in place a step to stop your application normally before pulling the plug.

eduperez · February 9, 2022, 4:29pm

The point is that renaming a file is atomic: either it works or it doesn't, while writing to a file might always break on the middle of an operation.

So, my proposal is to rework your application around this fact, and make it resilient against interruptions.

Another option is to write on a new file on each update, then use a symlink to point to the latest version. No matter what happens or when, your symlink will always point to the latest valid file.