The SD card becomes read-only mounted at random times during heavy writing. I've reproduced this SD card crash at least 15 times, and logread
always showed usb 1-1.1: reset high-speed USB device number 3 using ehci-platform
.
To enable USB storage logging, I compiled OpenWrt 22.03.4 on a desktop at home with a small patch to config/Config-kernel.in
:
config KERNEL_USB_STORAGE_DEBUG
bool
default y
Here are a few notable lines from the configuration:
CONFIG_TARGET_ath79_generic_DEVICE_glinet_gl-ar750=y
CONFIG_PCAP_HAS_USB=y
CONFIG_PACKAGE_kmod-usbmon=y
The SD card has an ext4 partition that is mounted at /dev/sda1
.
Then, I started tcpdump -i usbmon1 -w - | nc 192.168.123.456 1234
in the background. This captures USB packets and sends them to my desktop.
A fio
benchmark reproduced the issue, this writes files as fast as the disk allows.
Expand the `fio` output
root@OpenWrt ~ # rm /mnt/sda1/fio/*; fio --ioengine=sync --fallocate=none --rw=write --filesize=1m-9m --nrfiles=11000 --openfiles=16 --directory=/mnt/sda1/fio --bs=1M --file_service_type=sequential --name=a --eta-newline=1m --runtime=1d --time_based
a: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=1
fio-3.34
Starting 1 thread
a: Laying out IO files (11000 files / total 49900MiB)
Jobs: 1 (f=1): [W(1)][0.1%][w=14.0MiB/s][w=13 IOPS][eta 23h:58m:59s]
Jobs: 1 (f=1): [W(1)][0.1%][w=12.0MiB/s][w=12 IOPS][eta 23h:57m:59s]
Jobs: 1 (f=1): [W(1)][0.2%][w=7175KiB/s][w=7 IOPS][eta 23h:56m:59s]
Jobs: 1 (f=1): [W(1)][0.3%][w=8183KiB/s][w=7 IOPS][eta 23h:55m:58s]
Jobs: 1 (f=1): [W(1)][0.3%][w=13.0MiB/s][w=13 IOPS][eta 23h:54m:58s]
Jobs: 1 (f=1): [W(1)][0.4%][w=9206KiB/s][w=8 IOPS][eta 23h:53m:58s]
Jobs: 1 (f=1): [W(1)][0.5%][w=8192KiB/s][w=8 IOPS][eta 23h:52m:58s]
Jobs: 1 (f=1): [W(1)][0.6%][w=12.0MiB/s][w=12 IOPS][eta 23h:51m:58s]
Jobs: 1 (f=1): [W(1)][0.6%][w=6144KiB/s][w=6 IOPS][eta 23h:50m:58s]
Jobs: 1 (f=1): [W(1)][0.7%][w=14.0MiB/s][w=14 IOPS][eta 23h:49m:58s]
Jobs: 1 (f=1): [W(1)][0.8%][w=9206KiB/s][w=8 IOPS][eta 23h:48m:58s]
Jobs: 1 (f=1): [W(1)][0.8%][w=8175KiB/s][w=7 IOPS][eta 23h:47m:58s]
Jobs: 1 (f=1): [W(1)][0.9%][w=12.0MiB/s][w=11 IOPS][eta 23h:46m:58s]
Jobs: 1 (f=1): [W(1)][1.0%][w=7160KiB/s][w=6 IOPS][eta 23h:45m:57s]
Jobs: 1 (f=1): [W(1)][1.0%][w=14.0MiB/s][w=13 IOPS][eta 23h:44m:57s]
Jobs: 1 (f=1): [W(1)][1.1%][w=9197KiB/s][w=8 IOPS][eta 23h:43m:57s]
Jobs: 1 (f=1): [W(1)][1.2%][w=14.0MiB/s][w=14 IOPS][eta 23h:42m:57s]
Jobs: 1 (f=1): [W(1)][1.3%][w=7160KiB/s][w=6 IOPS][eta 23h:41m:57s]
Jobs: 1 (f=1): [W(1)][1.3%][w=11.0MiB/s][w=10 IOPS][eta 23h:40m:56s]
Jobs: 1 (f=1): [W(1)][1.4%][w=11.0MiB/s][w=11 IOPS][eta 23h:39m:57s]
Jobs: 1 (f=1): [W(1)][1.5%][w=13.0MiB/s][w=12 IOPS][eta 23h:38m:57s]
Jobs: 1 (f=1): [W(1)][1.5%][w=10.9MiB/s][w=10 IOPS][eta 23h:37m:57s]
Jobs: 1 (f=1): [W(1)][1.6%][w=7153KiB/s][w=6 IOPS][eta 23h:36m:57s]
Jobs: 1 (f=1): [W(1)][1.7%][w=11.0MiB/s][w=10 IOPS][eta 23h:35m:56s]
Jobs: 1 (f=1): [W(1)][1.7%][w=6137KiB/s][w=5 IOPS][eta 23h:34m:56s]
Jobs: 1 (f=1): [W(1)][1.8%][w=8192KiB/s][w=8 IOPS][eta 23h:33m:56s]
fio: io_u error on file /mnt/sda1/fio/a.0.3721: Read-only file system: write offset=3244032, buflen=950272
fio: pid=7639, err=30/file:io_u.c:1889, func=io_u error, error=Read-only file system
a: (groupid=0, jobs=1): err=30 (file:io_u.c:1889, func=io_u error, error=Read-only file system): pid=7639: Thu Apr 20 12:15:08 2023
write: IOPS=10, BW=10.5MiB/s (11.1MB/s)(16.5GiB/1596981msec); 0 zone resets
clat (msec): min=15, max=671, avg=92.46, stdev=68.28
lat (msec): min=16, max=672, avg=93.62, stdev=68.11
clat percentiles (msec):
| 1.00th=[ 33], 5.00th=[ 41], 10.00th=[ 45], 20.00th=[ 57],
| 30.00th=[ 66], 40.00th=[ 71], 50.00th=[ 77], 60.00th=[ 80],
| 70.00th=[ 85], 80.00th=[ 94], 90.00th=[ 159], 95.00th=[ 234],
| 99.00th=[ 405], 99.50th=[ 439], 99.90th=[ 584], 99.95th=[ 600],
| 99.99th=[ 659]
bw ( KiB/s): min= 2035, max=24576, per=100.00%, avg=10912.92, stdev=3699.94, samples=3159
iops : min= 1, max= 24, avg=10.28, stdev= 3.66, samples=3159
lat (msec) : 20=0.15%, 50=13.75%, 100=68.29%, 250=13.48%, 500=4.04%
lat (msec) : 750=0.28%
cpu : usr=0.60%, sys=19.52%, ctx=151604, majf=13, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,16847,0,0 short=0,1,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=10.5MiB/s (11.1MB/s), 10.5MiB/s-10.5MiB/s (11.1MB/s-11.1MB/s), io=16.5GiB (17.7GB), run=1596981-1596981msec
Disk stats (read/write):
sda: ios=2/149864, merge=0/3683, ticks=42/1532370, in_queue=1532413, util=98.06%
root@OpenWrt ~ #
This is a snippet of the logread
output, see the full log.
Thu Apr 20 12:14:56 2023 kern.debug kernel: [77784.155898] xfer 13 bytes
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658754] Status code 0; transferred 13/13
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658772] -- transfer complete
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658780] Bulk status result = 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658791] Bulk Status S 0x53425355 T 0x2f2cea R 122880 Stat 0x1
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658801] -- transport indicates command failure
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658808] -- unexpectedly short transfer
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658815] Issuing auto-REQUEST_SENSE
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658833] Bulk Command S 0x43425355 T 0x2f2ceb L 18 F 128 Trg 0 LUN 0 CL 6
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658840] xfer 31 bytes
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658944] Status code -32; transferred 0/31
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658956] clearing endpoint halt for pipe 0xc0010300
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658968] rq=01 rqtype=02 value=0000 index=02 len=0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659559] result = 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659570] Bulk command transfer result=2
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659579] -- auto-sense failure
Thu Apr 20 12:15:02 2023 kern.info kernel: [77789.778086] usb 1-1.1: reset high-speed USB device number 3 using ehci-platform
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.930975] usb_reset_device returns 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.930996] scsi cmd done, result=0x70000
Notice how the first two log lines have a time gap of 5.5 seconds, that is anomalous. The driver code that logged auto-sense failure
is: https://elixir.bootlin.com/linux/v5.10.161/source/drivers/usb/storage/transport.c#L759.
This is the usbmon capture at the same period, see the .pcap file for the rest. It seems like the driver never waits for the 'sense response data', it sends CLEAR FEATURE Request
while I expected to see URB_BULK in
instead (though I have no expertise in the SCSI protocol).
What I already tried:
-
Checking for bad blocks returned none
Expand
e2fsck -cc
outputroot@OpenWrt ~ # e2fsck -cc /dev/sda1 e2fsck 1.46.5 (30-Dec-2021) Checking for bad blocks (non-destructive read-write test) Testing with random pattern: done /dev/sda1: Updating bad block inode. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sda1: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sda1: 11013/3866624 files (20.8% non-contiguous), 13098623/15449344 blocks root@OpenWrt ~ # dumpe2fs -b /dev/sda1 dumpe2fs 1.46.5 (30-Dec-2021) root@OpenWrt ~ #
-
Using a different SD card from another brand did not help
How would you troubleshoot this issue further? Could it be a Linux driver bug, a misbehaving USB controller, or something else?