Invisible 16 TiB block device limit

{
        "kernel": "6.6.119",
        "hostname": "wrt",
        "system": "ARMv7 Processor rev 1 (v7l)",
        "model": "Linksys WRT1200AC",
        "board_name": "linksys,wrt1200ac",
        "rootfs_type": "squashfs",
        "release": {
                "distribution": "OpenWrt",
                "version": "24.10.5",
                "revision": "r29087-d9c5716d1d",
                "target": "mvebu/cortexa9",
                "description": "OpenWrt 24.10.5 r29087-d9c5716d1d",
                "builddate": "1766005702"
        }
}

When connecting an 20 TB (WD) or 26 TB USB (Seagate) hard drive to OpenWrt I ran into a sneaky 16 TiB block device limit. The device is detected with the full size, but the kernel simply won't read or write beyond 16 TiB (LBA 34359738368). The only hint is a complaint about a corrupted backup GPT. Otherwise it will mount a larger filesystem and probably silently corrupt it. This is independent of the file system used.

I don't know if this issue is specific to OpenWrt or to this specific platform. I don't think this is expected behavior. If there was supposed to be hard limit, the kernel would tell outright. It's using 16-byte-commands over USB, it should support 64 bit LBAs, right?

1 Like

ext4 does not really support bigger filesystems, try xfs instead.

cpu 32-bit or 64-bit? If 32-bit then addressing might be limited.

That's not correct, ext4 has a file size limit of 16 TiB with the default logical block size of 4 KiB, not a volume size limit, which should be 1 EiB.

However the issue is totally filesystem-independent and we are operating on 512 bytes LBAs. The OpenWrt Linux kernel simply can't read/write the USB block device (e. g. /dev/sda) past 16 TiB. The kernel reports the correct size for the block device (instead of just 16 TiB), however gdisk reports the secondary GPT as "corrupt", because the kernel simply cannot access it.

dd won't read past 16 TiB and just truncate the output to 0+0. The kernel is totally unaware of this limitation otherwise: It mounts larger filesystems, it just happens that it cannot access partitions and files beyond the limit.

But the kernel shouldn't just silently drop and truncate read/write requests beyond an arbritrary LBA limit while silently corrupting data? Why am I are seeing an arbitrary 35 bit limit here (34359738368 = 2^35), while READ(16) supports 64 bit LBA addressing and the kernel believes it does too, while not actually doing it.

Is it interface or fs problem.

dd if=/dev/sda bs=1k skip=16777216 count=1 of=/dev/null

Maybe it supports petabytes, but badblock tool does not....

my guess would be your 20TB or 26TB drives are AF type drives. You might need to align the sector boundaries. If you just DD from the img file you will by default be misaligned.

1 Like

Are this USB disks, or are you using an external usb-sata convertor. If the latter, are you sure it can handle >16TiB?

1 Like

It is a bug to pretend support which doesn't exist and silently corrupts data. None of the block device or file system code knows anything about that limitation. The drives connect, show up with their full size, the kernel doesn't report a 16 TiB block device if it can only support one, it reports the full size. It just quietly discards data. That is clearly a bug.

"This patch series is running in production now on some NAS devices from a very popular NAS brand."

I assume the patch has never made it upstream?

Thank you.

Try to read a block after 16TB and come back?
Is it silent loss rly?

Your situation so far:

  1. Your CPU is 32-bit. Your memory addressing is constrained by that fact.
  2. Your sector boundaries are misaligned
  3. Someone suggests you might be using USB. So now you are also “double-translating” USB into SATA and then into 512e.

There are plenty of bugs or translation errors in your setup rather than in EXT4 itself.

No grounds for such opinion.

The post pointing to the solution has been deleted, but thankfully I got it by e-mail subscription of this thread. It explained that data written past 16 TiB on the block device is silently discarded on ARMv7 Linux. Any (!) filesystem is naturally corrupted afterwards, as the kernel is silently discarding data.

The above is the reason why in my tests gdisk didn't come back with an error message when writing the backup GPT at the end of the block device. Which is the only write I tried after having suspicious dmesg log messages about the GPT. I didn't dare to mount any filesystem read/write (sparing me the headache of having to deal with a corrupted >20 TB volume).

The on-disk GPT (has been written by another x86_64 system) is just fine, however, the ARMv7 Linux kernel build just told gdisk that it did the write the backup copy at the end of the disk und just actually didn't. The write() command did not return -1 to gdisk. This is "pretend support".

The source for this grave data-corrupting mistake lies in upstream code. It's not an OpenWrt bug.

The (deleted) solution explained it all. Linux on ARMv7 doesn't support block devices larger than 16 TiB AT ALL and is silently corrupting data, if you you dare to connect one. The latter is a bug, not supporting something is not.

It's a bug, because a kernel on a platform that has a hard 16 TiB limit for all block devices should not mount larger filesystem on those platforms. It should just tell you that it doesn't support storage devices this large and bail out.

The hard 16 TiB limit (4096 KiB * 2^32) applies any devices regardless of sector sizes, including devices connected by SATA too. The Linksys device in question actually has a native SATA port, so there is no need to blame USB for anything. The ARMv7 Linux kernel doesn't support larger than 16 TiB SATA storage there either.

This issue is solved (by knowing the cause, not by having a solution, as the latter is patch from a proprietary NAS vendor for its own devices). There is still a data corruption bug, that needs to be fixed upstream. There is no need to hunt USB storage ghosts, that don't exist.


(Year 2010, 32bit)

I have read that ‘solution’ too. It pointed to this blog as justfication, and although it connects the maximum size of block devices to the number of memory pages, is it actually talking about memory page size.

I highly doubt if that connection is actually there. I never heard from it before, and I know of Arm32 devices which can handle (raid) block devices bigger than 16TiB.

Further does it leave some questions. AFAIK the linux file cache is what is suggests. It caches files, not the underlying block device. Of course the underlying blockdevice itself is also a file, but in most cases it’s not entirely cached.

Futher you could read this as the OS being able to handle a total of 16TiB of address space, physical and block devices combined. How whould that work? 2 8TB disks connected, and when I add a 2TB disk, boom, corruption?

I fail to see why a CPU having 32 bit has any impact on the size of the HD you could read. Yes there can be implementation issues on 32 bit, but it is not fundamental at all. You can make a 32-bit operating system that allows reading any size of HD.

1 Like

Yes I have tested armada 385 and badblock limit is at 16TiB .Rest of tested 18TB disk was not available for tools but visible in system. I could fill whole 16TiB space with GPT filesystem table. That was on 512e but did not try 4Kn cluster

XFS doesn't help a bit if the underlying ARMv7 platform doesn't support block devices larger than 16 TiB. RHEL never officially supported ARMv7.

I know it works on x86, but that doesn't help here. It's an ARMv7 platform issue. It's the kind of stuff you use x86 for, not because it is the superior ISA, but to circumvent exotic platform limitations like this one.

With 20/26 TB data center HDDs (inside USB enclosures) I simply surpassed the use case of "that little OpenWrt box can do this", I just wished it had failed more gracefully instead of tripping me off-guard with data corruption. I will probably migrate the use case to x86(_64) at some point in the future if I don't find a usable fix.

The blog leads to the actual LKML discussion of a patch from a NAS vendor Synology which probably never made it upstream. The NAS vendor obviously needed it, because it's affecting all block devices including internal SATA drive bays. A NAS that silently corrupts all drives 18 TB and above is not going to be a very popular product.

The mailing list is convincing enough that the ARMv7 platform needs bigger architectural changes to pass the 16 TiB limit. But what needs to be patched immediately is that the limit must be enforced and communicated to the users, including the upper layers like file system drivers, which have no clue that it even exists, because the block cache just behaves like a black hole above 16 TiB.

The ARMv7 kernel reports a block device at a size it cannot handle and doesn't even make noise about not being able to.

They might have the patch-set mentioned there. The gist I got from there is that you need larger pages to support anything beyond 16 TiB on ARMv7. The patch probably got rejected, because it adds some overhead elsewhere.

Scaringly nobody cared about actually recognizing the existing kernel limit and implementing it correctly.

Linux has a block cache. It indeed caches block devices. It was implemented this way to perform better on floppy disk drives and live CD-ROM drives. Raw device support has been removed from Linux since 5.14.

A simple larger-than-16-TiB block device suffices to get you in trouble on ARMv7.

Does the kernel and/or gdisk complain about the corrupt backup GPT at the end of the 18 TB disk space?

Now 32bit turns armv7, i guess next will be endianness... Stop typing AI bible sized posts and test on your own system. Migjt need real coreutils ipo busybox

You still did not test anything.

Sorry don't remember exactly .I've limited partition span to 16TiB and they worked.