USB-SSD not waking (spinning) up quick enough, causing read/write-errors

nicommander · November 4, 2019, 1:09pm

Hello

I am using a D-Link DIR-860L with a Kingston USB3.0 Case which has a Kingston SATA-SSD Drive installed. This particular USB-Case sends the SATA-Drive to Sleep after 1 Minute without any access.
That is good so far, because most of the time the Drive is sleeping and this way it saves about 90% of the Energy.

(Before anybody suggests that I use a different usb case and send the drive to sleep using any openwrt command - that is not working - I tried it allready - thats why I use it this way).

When I access the Drive is wakes up and can be used.
Unfortunately sometimes this does not happen quick enough.
Then I get errors like the following in the Log:
[ 1070.035054] EXT4-fs error (device sda1): __ext4_new_inode:1120: comm smbd: failed to insert inode 22020103: doubly allocated?
(this was a writing attempt)

But it can also happen with reading attempts, then it says something like "directory /sda1/xxx can not be accessed"

The actual entry in the error log can be different but the cause is most likely just that the drive does not respond quick enough because it is sleeping. Reading Errors on a file showing the file as corrupt can be solved by simply rebooting the machine. Then the file is ok again. This shows that the drive and file is ok, just the machine is the problem.

The Filesystem is fine too and shows no errors.

I remember that I read a lot about too long spinning up times of harddisks causing similar problems. But today I can not find them anymore.
Maybe somebody has an Idea how to tell the openwrt that this drive needs a little longer to respond (spin up) and that this would be ok.
How do I do that ?

Thanks
Nico

eduperez · November 4, 2019, 2:00pm

I am afraid I have questions not answers...

Do SSD disks sleep down at all? They obviously cannot spin down...
Even if they sleep down, recovery time should be instantaneous, there should be no delay at all.
Assuming they sleep down, and assuming there is a delay, it should be way shorter than the delay on a spinning HDD, and I have never seen such errors on spinning HDDs.
And, if the that delay is causing those errors, I would expect to see "I/O timeout" errors, not filesystem errors.
Rebooting the device prompts EXT4 to do a filesystem check, and that could be fixing the filesystem errors silently.
All this leads me to the grand question: are you sure it's the energy saving that is causing those errors? could it be that you are experiencing filesystem corruption because of something else?

nicommander · November 4, 2019, 2:28pm

Yes thats correct, but:
The SATA Controller is able to send a sleep commend to a SATA Drive.
The Drive then does what it is programmed to do with this command.
(This is what makes it so complicated here, because I can not send such SATA Commend through the USB-Connection)
I don't know all about it, but I can tell that the LED Light on the USB Case turns OFF after 1 Minute of not using it and then also the Power Consumption goes down. When Data is needed it turns ON again.
So its perfect. I did use this combination for years on my old router. Never had such problem.

2.+3. Yes wake-up-time is way shorter than spin-up-time. But since the openwrt device does not know the drive is sleeping it expects response within no time. According to the log, the wakeup-call and error are less 0.8sec apart. this might be too short.

I did see lots of different errors in the log. Once the drive is awake there are no problems at all.
Its no solution to repair what goes wrong. It would be better to avoid the problem at all.
Yes I am sure. I checked the filesystem and all. Files were unable to play and reported to be currupted. This would not be repaired by a reboot of the machine.

anon50098793 · November 4, 2019, 6:11pm

test this with another service and see if similar corruption/fs issues persist...

maybe scp after the drive has gone to sleep

eduperez · November 4, 2019, 6:47pm

Could you post some logs, please?

nicommander · November 5, 2019, 8:23am

I tried that with

miniDLNA with TV as Mediaplayer
first file accessed (which wakes the drive) cannot be played and is reported corrupt. When I reboot and choose this file second it works
samba / network share
first read or write attempt fails
if I reboot and try with woken drive it works

What does scp mean ?

nicommander · November 5, 2019, 8:35am

What do you want to see ?
The booting shows no errors at all, except that the minidlna can not access the drive because that is mounted later and the minidlna then restartet.

The error itself shows different log entries.
So the actual error entry is not much helping I think.
It can be like this:
[ 1070.035054] EXT4-fs error (device sda1): __ext4_new_inode:1120: comm smbd: failed to insert inode 22020103: doubly allocated?
(writing attempt)

The only useful thing I saw was the connection between the problem (unplayable file), the error-log-entry and the fact that the previous entry in the error-log is always a wakeup-call to the drive. The time gap between the wakeup-call and the actual error line is less than 0.8sec, which I think is a little short.

The wakeup-call is logged like this:
usb 2-1: reset SuperSpeed USB device number 2 using xhci-mtk
this line always comes when the drive is accessed in sleep-mode and needs to be woken.
This actually indicates that the openwrt knows that the drive is sleeping or that it does that because the drive did not respond to a previous access-call. However it does not give the drive much time after that.
After less than 0.8sec it reports an error which can be totally different.
Only the time gap is nearly the same.

anon50098793 · November 5, 2019, 9:43am

Its a command that will allow you to read/write to the drive remotely... ( without smb or dnla )

You need to isolate the drive from the services... disable minidnla... disable samba.... and try with something else once the drive is asleep.

Also set your debug level higher...

eduperez · November 5, 2019, 9:58am

What I want to see is this:

nicommander · November 5, 2019, 9:59am

Wouldn't it just be possible to somehow tell the openwrt to wait longer for drive responses ?
I think then just everything will be fine.

I know I read about this before when people reported a similar problem with (real) HDDs not spinning up fast enough and the system did not detect them or had slimilar reading problems.

I think it would not matter how I access the drive or which application or service I use.
I could try on the command line to do some stuff and maybe it will cause a different error entry
but the cause is still the same. Drive not responding in time.

How do I change the Log-Level ?
Maybe I can find out more with that.

anon50098793 · November 5, 2019, 10:15am

98% of solving a problem is defining it....

OpenWrt is an ecosystem consisting of;

1) VENDOR HARDWARE ( internal and external )
2) VENDOR FIRMWARE ( internal and external )
3) The Linux Kernel
4) Linux Kernel Drivers ( usb / ext4 )
5) Runtime Binaries ( init / ubus / netifd )
6) Application Binaries ( samba / minidnla )
7) Userland Scripts and Utilities ( shell / hdparm / etc. )

So which part of OpenWrt did you want to "tell"?

nicommander · November 5, 2019, 10:16am

At first there were some Filesystem errors which showed on booting.
I repaired these. They were most likely caused by failed writing attempts.
After that the booting was fine.

Accessing the drive showed different errors like this one:
[ 1070.035054] EXT4-fs error (device sda1): __ext4_new_inode:1120: comm smbd: failed to insert inode 22020103: doubly allocated?
or another one (which I did not copy)
there is said something like "cannot access directory sda/dlna/something"
This was a data access to a directory listing which failed. The directory was then reportet to be empty (which it actually was not)
there were also some other errors which I do not remember exactly.
(If really needed I can copy them the next time I see them)
The error is only different because I am accessing the drive differently.
Like read a file, write a file, read file listing of directory etc.
The cause is still the same, the drive does not respond in time.

As I said in another reply there are so far 2 important log lines.
1st. is a wakeup call to the drive which looks like this:
usb 2-1: reset SuperSpeed USB device number 2 using xhci-mtk
this actually is weird because the system can not know that the drive is in sleep mode.
either it does know or this line is already a (second) attempt to access the drive to maybe reset it or something because a previous attempt failed (or was not answered in time).

2nd. is the error line with whatever content which shows less than 0.8sec after the first line.
This might indicate that the drive did not respond in time.

I do not know if there was another access attempt before this usb reset attempt.
As soon as I do access data, the LED on the USB Case goes on. And then either
it works (response came quick enough) or it does not (response came too late - error is reported)

I did use the same USB case before on another router.
There this was all working well because the router did wait for the drive.
Waiting means maybe 1 or 2sec not more.

nicommander · November 5, 2019, 10:21am

I know that its a complex structure.
The part I want to tell must be a system part which handles the access to drives or to usb devices.

Maybe we can find out which part reports the problem first.
Like "who knows first must be responsible"

Would this be possible be setting the log level higher ?
How do I do that ?

anon50098793 · November 5, 2019, 10:24am

FWIW... there is a fair chance your issue is between 4 and 5...

This happens to be where hotplug resides...

Lookup some articles on hotplug and loglevels... when the drive puts itself to sleep... it should notify the kernel...

it looks as if something is broken at this stage of events. but first it's wise to exclude applications ...

nicommander · November 5, 2019, 10:29am

I did have a different problem before, I installed the
opkg install block-mount
package to make the usb work and this detected the drive again as soon as it went to sleep.
(the sleep was still handled by the usb case not the system)
Anytime the case send the drive to sleep, the block mount package detected a new(?) USB device and woke it up again just a sec later.
thats why I removed this package.

Then I needed to mount the drive manually, but thats ok.
So now there is most likely no such hot plug function anymore.

nicommander · November 5, 2019, 10:39am

Please tell me how to do that.
So that I can collect some new data here

eduperez · November 5, 2019, 11:09am

Copy-pasting the actual error messages would have been one hundred times more informative for us, and one thousand times less time-consuming for you.

I suspect we are focusing on the drive sleeping, but I would like to explore other possibilities, starting with the USB controller going to sleep.

nicommander · November 5, 2019, 11:29am

I did not copy all error messages since they did not point to a specific problem.
I am a programmer since 20 years and I have seen similar things before.
When the error message is totally different each time then thats not the problem but something else is.
Thats why the error message says something like "cannot access directory" but it actually can.
Same with writing or access files.

Therefore my focus is on the usb / sleep possibility.
But since the USB Case handles that sleeping mode function itself,
I am not sure if the system gets informed of that at all.

When I still had the block mount package installed it redetected the usb drive as soon as it went to sleep.
And by that waking it up again each time it went to sleep. So no sleep possible at all.
Maybe the system is informed when sleeping mode is enabled. I don't know.

on wakeup there is an entry in the log which is:
usb 2-1: reset SuperSpeed USB device number 2 using xhci-mtk
It is possible that this is done because the system knows the drive is sleeping and wakes it up with that (but then why not wait for it to wake up...? - this could take 5-10sec or more ...)
or
This is done because the drive did not respond (usually it should within notime) and the system just tries to reset it to make it work again. After this reset usb entry in the log, the next entry is an error (of any kind) following not more than 0.8sec after it. So there is no waiting.

Still it is strange that even if the system knows that the drive is in sleep, then why not wait for it to wake up ?

tmomas · November 5, 2019, 11:30am

Can you please copy & paste a complete log, instead of atomar snippets?

nicommander · November 5, 2019, 11:36am

Before that it would be good to raise the log level to see more that happens in that moment.
How do I do that ?