USB storage driver resets at random when writing SD

The SD card becomes read-only mounted at random times during heavy writing. I've reproduced this SD card crash at least 15 times, and logread always showed usb 1-1.1: reset high-speed USB device number 3 using ehci-platform.

To enable USB storage logging, I compiled OpenWrt 22.03.4 on a desktop at home with a small patch to config/Config-kernel.in:

config KERNEL_USB_STORAGE_DEBUG
    bool
    default y

Here are a few notable lines from the configuration:

CONFIG_TARGET_ath79_generic_DEVICE_glinet_gl-ar750=y
CONFIG_PCAP_HAS_USB=y
CONFIG_PACKAGE_kmod-usbmon=y

The SD card has an ext4 partition that is mounted at /dev/sda1.

Then, I started tcpdump -i usbmon1 -w - | nc 192.168.123.456 1234 in the background. This captures USB packets and sends them to my desktop.

A fio benchmark reproduced the issue, this writes files as fast as the disk allows.

Expand the `fio` output
root@OpenWrt ~ # rm /mnt/sda1/fio/*; fio --ioengine=sync --fallocate=none --rw=write --filesize=1m-9m --nrfiles=11000 --openfiles=16 --directory=/mnt/sda1/fio --bs=1M --file_service_type=sequential --name=a --eta-newline=1m --runtime=1d --time_based
a: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=1
fio-3.34
Starting 1 thread
a: Laying out IO files (11000 files / total 49900MiB)
Jobs: 1 (f=1): [W(1)][0.1%][w=14.0MiB/s][w=13 IOPS][eta 23h:58m:59s]
Jobs: 1 (f=1): [W(1)][0.1%][w=12.0MiB/s][w=12 IOPS][eta 23h:57m:59s]
Jobs: 1 (f=1): [W(1)][0.2%][w=7175KiB/s][w=7 IOPS][eta 23h:56m:59s]
Jobs: 1 (f=1): [W(1)][0.3%][w=8183KiB/s][w=7 IOPS][eta 23h:55m:58s]
Jobs: 1 (f=1): [W(1)][0.3%][w=13.0MiB/s][w=13 IOPS][eta 23h:54m:58s]
Jobs: 1 (f=1): [W(1)][0.4%][w=9206KiB/s][w=8 IOPS][eta 23h:53m:58s]
Jobs: 1 (f=1): [W(1)][0.5%][w=8192KiB/s][w=8 IOPS][eta 23h:52m:58s]
Jobs: 1 (f=1): [W(1)][0.6%][w=12.0MiB/s][w=12 IOPS][eta 23h:51m:58s]
Jobs: 1 (f=1): [W(1)][0.6%][w=6144KiB/s][w=6 IOPS][eta 23h:50m:58s]
Jobs: 1 (f=1): [W(1)][0.7%][w=14.0MiB/s][w=14 IOPS][eta 23h:49m:58s]
Jobs: 1 (f=1): [W(1)][0.8%][w=9206KiB/s][w=8 IOPS][eta 23h:48m:58s]
Jobs: 1 (f=1): [W(1)][0.8%][w=8175KiB/s][w=7 IOPS][eta 23h:47m:58s]
Jobs: 1 (f=1): [W(1)][0.9%][w=12.0MiB/s][w=11 IOPS][eta 23h:46m:58s]
Jobs: 1 (f=1): [W(1)][1.0%][w=7160KiB/s][w=6 IOPS][eta 23h:45m:57s]
Jobs: 1 (f=1): [W(1)][1.0%][w=14.0MiB/s][w=13 IOPS][eta 23h:44m:57s]
Jobs: 1 (f=1): [W(1)][1.1%][w=9197KiB/s][w=8 IOPS][eta 23h:43m:57s]
Jobs: 1 (f=1): [W(1)][1.2%][w=14.0MiB/s][w=14 IOPS][eta 23h:42m:57s]
Jobs: 1 (f=1): [W(1)][1.3%][w=7160KiB/s][w=6 IOPS][eta 23h:41m:57s]
Jobs: 1 (f=1): [W(1)][1.3%][w=11.0MiB/s][w=10 IOPS][eta 23h:40m:56s]
Jobs: 1 (f=1): [W(1)][1.4%][w=11.0MiB/s][w=11 IOPS][eta 23h:39m:57s]
Jobs: 1 (f=1): [W(1)][1.5%][w=13.0MiB/s][w=12 IOPS][eta 23h:38m:57s]
Jobs: 1 (f=1): [W(1)][1.5%][w=10.9MiB/s][w=10 IOPS][eta 23h:37m:57s]
Jobs: 1 (f=1): [W(1)][1.6%][w=7153KiB/s][w=6 IOPS][eta 23h:36m:57s]
Jobs: 1 (f=1): [W(1)][1.7%][w=11.0MiB/s][w=10 IOPS][eta 23h:35m:56s]
Jobs: 1 (f=1): [W(1)][1.7%][w=6137KiB/s][w=5 IOPS][eta 23h:34m:56s]
Jobs: 1 (f=1): [W(1)][1.8%][w=8192KiB/s][w=8 IOPS][eta 23h:33m:56s]
fio: io_u error on file /mnt/sda1/fio/a.0.3721: Read-only file system: write offset=3244032, buflen=950272
fio: pid=7639, err=30/file:io_u.c:1889, func=io_u error, error=Read-only file system

a: (groupid=0, jobs=1): err=30 (file:io_u.c:1889, func=io_u error, error=Read-only file system): pid=7639: Thu Apr 20 12:15:08 2023
  write: IOPS=10, BW=10.5MiB/s (11.1MB/s)(16.5GiB/1596981msec); 0 zone resets
    clat (msec): min=15, max=671, avg=92.46, stdev=68.28
     lat (msec): min=16, max=672, avg=93.62, stdev=68.11
    clat percentiles (msec):
     |  1.00th=[   33],  5.00th=[   41], 10.00th=[   45], 20.00th=[   57],
     | 30.00th=[   66], 40.00th=[   71], 50.00th=[   77], 60.00th=[   80],
     | 70.00th=[   85], 80.00th=[   94], 90.00th=[  159], 95.00th=[  234],
     | 99.00th=[  405], 99.50th=[  439], 99.90th=[  584], 99.95th=[  600],
     | 99.99th=[  659]
   bw (  KiB/s): min= 2035, max=24576, per=100.00%, avg=10912.92, stdev=3699.94, samples=3159
   iops        : min=    1, max=   24, avg=10.28, stdev= 3.66, samples=3159
  lat (msec)   : 20=0.15%, 50=13.75%, 100=68.29%, 250=13.48%, 500=4.04%
  lat (msec)   : 750=0.28%
  cpu          : usr=0.60%, sys=19.52%, ctx=151604, majf=13, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,16847,0,0 short=0,1,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=10.5MiB/s (11.1MB/s), 10.5MiB/s-10.5MiB/s (11.1MB/s-11.1MB/s), io=16.5GiB (17.7GB), run=1596981-1596981msec

Disk stats (read/write):
  sda: ios=2/149864, merge=0/3683, ticks=42/1532370, in_queue=1532413, util=98.06%
root@OpenWrt ~ #

This is a snippet of the logread output, see the full log.

Thu Apr 20 12:14:56 2023 kern.debug kernel: [77784.155898] xfer 13 bytes
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658754] Status code 0; transferred 13/13
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658772] -- transfer complete
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658780] Bulk status result = 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658791] Bulk Status S 0x53425355 T 0x2f2cea R 122880 Stat 0x1
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658801] -- transport indicates command failure
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658808] -- unexpectedly short transfer
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658815] Issuing auto-REQUEST_SENSE
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658833] Bulk Command S 0x43425355 T 0x2f2ceb L 18 F 128 Trg 0 LUN 0 CL 6
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658840] xfer 31 bytes
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658944] Status code -32; transferred 0/31
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658956] clearing endpoint halt for pipe 0xc0010300
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.658968] rq=01 rqtype=02 value=0000 index=02 len=0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659559] result = 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659570] Bulk command transfer result=2
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.659579] -- auto-sense failure
Thu Apr 20 12:15:02 2023 kern.info kernel: [77789.778086] usb 1-1.1: reset high-speed USB device number 3 using ehci-platform
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.930975] usb_reset_device returns 0
Thu Apr 20 12:15:02 2023 kern.debug kernel: [77789.930996] scsi cmd done, result=0x70000

Notice how the first two log lines have a time gap of 5.5 seconds, that is anomalous. The driver code that logged auto-sense failure is: https://elixir.bootlin.com/linux/v5.10.161/source/drivers/usb/storage/transport.c#L759.

This is the usbmon capture at the same period, see the .pcap file for the rest. It seems like the driver never waits for the 'sense response data', it sends CLEAR FEATURE Request while I expected to see URB_BULK in instead (though I have no expertise in the SCSI protocol).

What I already tried:

  • Checking for bad blocks returned none

    Expand e2fsck -cc output
    root@OpenWrt ~ # e2fsck -cc /dev/sda1
    e2fsck 1.46.5 (30-Dec-2021)
    Checking for bad blocks (non-destructive read-write test)
    Testing with random pattern: done
    /dev/sda1: Updating bad block inode.
    Pass 1: Checking inodes, blocks, and sizes
    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information
    
    /dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
    /dev/sda1: 11013/3866624 files (20.8% non-contiguous), 13098623/15449344 blocks
    
    root@OpenWrt ~ # dumpe2fs -b /dev/sda1
    dumpe2fs 1.46.5 (30-Dec-2021)
    root@OpenWrt ~ #
    
  • Using a different SD card from another brand did not help

How would you troubleshoot this issue further? Could it be a Linux driver bug, a misbehaving USB controller, or something else?

I would try to reproduce on vanilla Linux.

Bad SD card, in performance, not the storage itself.

That could indeed rule out OpenWRT specific patches and such, though I'm not sure how to compile vanilla Linux for this router.

Interestingly, running the same fio benchmark on this router targeting a USB flash drive (instead of an SD card) also resulted in usb 1-1.2: reset high-speed USB device number 4 using ehci-platform though the other logs differed much, and the device got unmounted instead of read-only mounted, so perhaps it is unrelated.

Don't.
Use some live Linux image, they should have all the packages required to perform the same kind of benchmark on the card.
You want one running roughly the same kernel as Openwrt.

The failure using a flash drive would however indicate it's not SD/flash related.

1 Like

The following experiments with running the fio command with different hardware (but the same SD card) did not reproduce the error:

  • GEMBIRD MicroSD card reader 1908:0226 plugged in via the USB type A port of the gl-ar750 router

  • Live Linux ISO (Alpine with kernel 5.10.176-0-lts) in a laptop with built-in SD card reader

So, the issue might be specific to gl-ar750's internal SD card reader: 05e3:0752 Genesys Logic STORAGE DEVICE.

Hi TjeuKayim, I'm GL.iNet technical support Hoff, we test lots of SD cards, but don't repeat this problem, can you return this device and your SD to us? we will do some deep test and then try to fix this problem. if you can, we can give your a new free AR750 and buy a new SD card to you. Thank you very much.

1 Like

Hi Hoff Xin,
Thank you so much for looking into this issue.

In my post, I forgot to mention that you either need to use an fio version after 3.30 (I used 3.34) or leave out the --runtime=1d --time_based flags, to increase the chances of getting the errors. Also, the SD cards I tested with were a 64 GB card and a Samsung 128 GB card, perhaps the size matters.

As you proposed, I want to return the router and SD card to the factory, so that you can investigate the problem.

Hi TjeuKayim, can you give me the fio3.34.ipk file? we have speeded much time to test, but we can't repeater your problem, please sent it to me via email(my email is hoff.xin@gl-inet.com), it is easy and fast way to communicate, and I will give your my address to back your AR750 and SD crad and I will give your AR750 and SD crad refund.

Our respective contacts already exchanged details and the router + SD have been sent back for further investigation. Thanks for proposing to have them returned. This has been an interesting issue and I look forward to hearing what you discover.