[Solved] WD MyBook Live Duo / two disks

No real idea either. I think procd/init gets stuck waiting for something to happen (i.e. blocked io) between:

s = service_alloc(name);

and

ret = service_update(s, tb, add);

I think it must be somewhere in service_update() just before the call to rc() in

rc(s->name, "running");

Because otherwise we would see this message to appear on the MBL DUO:

  [    6.210242] procd: running /etc/init.d/ubus running

As for the urandom messages:

 [    6.225896] random: ubusd: uninitialized urandom read (4 bytes read)

These are generated by the kernel's random.c:
https://elixir.bootlin.com/linux/v4.14.54/source/drivers/char/random.c#L1868

And I think the reason why they show up in the bootlog is because ubusd uses /dev/urandom for random id generation in ubus_alloc_id().

I don't think the problem is in ubusd, since otherwise we would simply see a timeout after 50 seconds. Something must be blocking in procd and because of this the procd daemon does no longer ping the watchdog.... And hence the MBL DUO reboots.

One more strange thing (two disks, untouched openwrt):

[    8.780355] eth0: link is up, 1000 FDX, pause enabled
[    8.785520] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[    8.791385] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1024.543024] random: crng init done

It seems that the random process is not initialized properly in the boot process. This message always comes a few minutes later. I haven't seen such a thing on any other device.

edit:
There is delay on other platforms as well (x86 virtual machine with 18.06-rc1), but much less:

[    5.533512] kmodloader: done loading kernel modules from /etc/modules.d/*
[   15.060086] random: crng init done
[   15.070062] random: 1 urandom warning(s) missed due to ratelimiting

If you could compile such a kernel, I would be happy to test it...

I played around with x86 virtual machine with two disks. Created two identical disks, and did the initial setup (network, dhcp, new partitions, etc.), like I would do on MBL Duo. I know it's not necessarily the same error, but after a few successful reboot, the machine suddenly started to boot off the second disk. :open_mouth: Weird... could there be any problem with disk enumeration?

Peeking into the detailed kernel logs, I found that during device creation the /sbin/hotplug-call script runs several times. In a successful boot these log entries are in pairs: launched/finished.

[    4.614312] procd: Launched hotplug exec instance, pid=554
...
[    4.773213] procd: Finished hotplug exec instance, pid=554

In a failed boot there is one process without "finished" entry.

[    6.410719] procd: Launched hotplug exec instance, pid=594

If anyone is interested I can attach these detailed logs, but they are ~50kb each...

hm, kinda. The x86 image uses either root=UUID (for ext4) or rootfs=squashfs to discover the rootfs on all partitions and disks (and this can be racy). On the MBLs this is currently fixed to root=/dev/sda2. So there's a slight difference there.

hm, could it be that the problem is the "concurrent access"? I.e. access to both disks at the same time? If so, you should be able to reproduce the bug in the failsafe by loading the drives on the same time... ie:

# dd if=/dev/sda of=/dev/null &
# dd if=/dev/sdb of=/dev/null &

Well, I pushed a patch to the ML:

apm821xx: attempt to fix sata access freezes. It would be somewhat comical, if all that was necessary was to set a few bits. But oh well...
Anyway, once the patch gets picked up by master, the buildbots will automatically create images.

Thank you very much for all your efforts. Can i try your patch somehow? I am willing to try it. Do i need to go through the image generation process?
I am currently on 17.01.4 and experiencing the same issues as the other forum members.

Thanks

The phase1 builder just finished the process of making an -snapshot / -master image.

http://phase1.builds.lede-project.org/builders/apm821xx%2Fsata/builds/911

you can get the image from http://downloads.lede-project.org/snapshots/targets/apm821xx/sata/
(make sure the date is 16th July! in case you hit some sort of cache).

Oh, make no mistake. It's an "attempt". See apm821xx: attempt to fix sata access freezes.

so, it will depend on whenever it magically fixes the problem or not. If it does then it will get ported to all the active releases [17.01.6 (if there is such a release) and 18.06].

Hello,

I can report that your fix worked! Boots up fine with two drives even with more partitions, and no more reboot on concurrent disk access!

Thanks again!

Edit: One issue I found but I think is irrelevant with your fix is that when I edit mount points through luci, openwrt will freeze and needs a hard restart

1 Like

Well, I was about to say, that it would be nice to have some "Tested-by: " tags. But...

Ok, do you think you can write down what you did? I can certainly give it a try on the MBL Single too.

To replicate the issue with the mount points, do the following:
install luci
go to System->Mount points
add
select /dev/sda3 by uuid
mount to /mnt/data
save and apply

The applying configuration timer times out, I am thrown out of ssh, I can still ping the device but not ssh no http!
More testing shows that this issue existed on 17.01.4 as well

Still testing your fix by the way, everything is working smoothly

1 Like

Oh boy.

Well, the good news is: I quickly found this PR for luci. It talks about this very same issue

Default behaviour of changes to fstab (Mount Points) was
to use /etc/init.d/fstab restart, however this unmounts
filesystems via block umount which can cause the device
to fail, so replace the initscript call with an exec
of 'block mount'.

The bad news: The PR has been abandoned. I mean you can manually apply the patch

and try it again. At least for the MBL Single it did the trick.

You are correct once again! I made the change and it worked

Thanks

As silly as it sounds, at the moment, I don't have the two unused disks to test on my MBL Duo.

What I do have is a single spare disk and nothing short of awesome news: It seems your patch unleashed the full power of the SATA port. Where I was previously hitting a really hard limit at around 82 MB/s for reading and 27 MB/s for writing, I am now getting this:

root@OpenWrt:/mnt# time dd if=/dev/zero of=tempfile bs=1M count=1024
1024+0 records in
1024+0 records out
real    0m 13.65s
user    0m 0.01s
sys     0m 11.89s

root@OpenWrt:/mnt# time dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
real    0m 8.41s
user    0m 0.01s
sys     0m 4.70s

This means: 121 MB/s reading and 75 MB/s writing!

Seems like our collective assumption -- that the original MBL firmware's significantly better read/write performance was owed to the nonstandard 64 kB block size -- was wrong. It was the SATA driver all along.

Edit: The drive is a WD Green WD10EARX taken from an older MBL Single. I repeated the test a few times with even larger files to rule out any caching, I'm still seeing the same great performance. OpenWrt is now completely on par with the original MBL firmware's performance.

well, I left a comment about it:

you can maybe do the same.

well, I can add that to the 18.06 commit message.

1 Like
root@RED:/mnt/data# time dd if=/dev/zero of=tempfile bs=1M count=1024
1024+0 records in
1024+0 records out
real    0m 15.38s
user    0m 0.01s
sys     0m 11.19s

root@RED:/mnt/data# time dd if=tempfile of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
real    0m 6.71s
user    0m 0.00s
sys     0m 5.36s

Thank you for confirming my findings. Your results correspond to 152 MB/s reading, 66 MB/s writing (which seems a tad low, I wonder if your drive is still lazy init'ing the file system).

In any case, I posted the patches to 18.06 and 17.01 (no idea if there will be a 17.01.6 or not) with a updated commit message.

https://patchwork.ozlabs.org/project/openwrt/list/?submitter=72473

Also "discord being discord":

Let others join the conversation

This topic is clearly important to you – you've posted more than 20% of the replies here.

Are you sure you’re providing adequate time for other people to share their points of view, too?

1 Like