Hide Forgot
Created attachment 1688599 [details] systemtap to reproduce issue Description of problem: When the kernel ioctls BLKIOOPT and BLKIOMIN return invalid values, mkfs.xfs will create a filesystem with bogus sunit/swidth which cannot be mounted. In particular, the BLKIOMIN (minimum IO size) is larger than BLKIOOPT (optimal IO size), resulting in sunit larger than swidth. Prior to the kernel patch: xfs: catch bad stripe alignment configurations filesystems created in this way could be mounted, so the invalid settings were not an issue at mount time. Version-Release number of selected component (if applicable): RHEL 7 xfsprogs xfsprogs-4.5.0-20.el7.x86_64 How reproducible: easy; see steps below Steps to Reproduce: create a large file to hold the filesystem: # truncate -s 9276062957568 /var/tmp/test_fs set up a loop device: # losetup --find --show /var/tmp/test_fs /dev/loop0 In my case, the device I'm modifying is /dev/loop0, so the major/minor numbers are 7:0: # ls -l /dev/loop0 brw-rw----. 1 root disk 7, 0 May 14 12:43 /dev/loop0 modify the top of the attached systemtap with the appropriate major/minor: @define target_dev_major %( 7 %) @define target_dev_minor %( 0 %) start the systemtap; the systemtap will output messages to both the console running systemtap and the kernel logs: # stap -g blkid_ioctls-2.stp 2>&1 loaded systemtap to modify device(7:0) optimal IO size to 262144 and minimum IO size to 1048576 create the filesystem # mkfs.xfs -f /dev/loop0 the systemtap modifies the ioctl return values to the bogus values: mkfs.xfs(29834): blkdev_ioctl(cmd: 4728 - BLKIOMIN) result is 512 - changing to 1048576 mkfs.xfs(29834): blkdev_ioctl(cmd: 4729 - BLKIOOPT) result is 0 - changing to 262144 mkfs.xfs completes with bogus sunit and swidth values: meta-data=/dev/loop0 isize=512 agcount=32, agsize=70770688 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=2264662016, imaxpct=5 = sunit=256 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 the filesystem cannot be mounted: mount /dev/loop0 /mnt/tmp mount: mount /dev/loop0 on /mnt/tmp failed: Structure needs cleaning with the following kernel messages: [630936.031317] [<ffffffff9febeca6>] worker_thread+0x126/0x3c0 [630936.031321] [<ffffffff9febeb80>] ? manage_workers.isra.26+0x2a0/0x2a0 [630936.031326] [<ffffffff9fec5b61>] kthread+0xd1/0xe0 [630936.031330] [<ffffffff9fec5a90>] ? insert_kthread_work+0x40/0x40 [630936.031336] [<ffffffffa0593df7>] ret_from_fork_nospec_begin+0x21/0x21 [630936.031340] [<ffffffff9fec5a90>] ? insert_kthread_work+0x40/0x40 [630936.031444] XFS (loop0): SB validate failed with error -117. specifying sunit and swidth values at mount time does not work (mount still fails with the same error). Actual results: the filesystem created cannot be mounted Expected results: mkfs.xfs either returns error when sunit/swidth values are invalid, or changes them to sane defaults Additional info: the filesystem can be 'corrected' using xfs_db to modify the swidth in each sb # i=0 ; pat='^unit = ([0-9]+)$' # while true ; do unit=$(xfs_db -x -c "sb $i" -c "p unit" /dev/loop0 2>&1) [[ $unit =~ $pat ]] || break unit=${BASH_REMATCH[1]} xfs_db -x -c "sb $i" -c "write width $unit" /dev/loop0 >/dev/null 2>&1 i=$(($i+1)) done # mount /dev/loop0 /mnt/tmp # xfs_info /mnt/tmp meta-data=/dev/loop0 isize=512 agcount=32, agsize=70770688 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 data = bsize=4096 blocks=2264662016, imaxpct=5 = sunit=256 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
Created attachment 1688616 [details] metadump of 1 GiB filesystem created with the process outlined The sparse file doesn't need to be anywhere near the size used in the reproducer. 1 GiB is sufficient. Here's a metadump of a 1 GiB filesystem while running the systemap # mkfs.xfs -f /dev/loop0 meta-data=/dev/loop0 isize=512 agcount=8, agsize=32512 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=260096, imaxpct=25 = sunit=256 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=624, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
ok so I think we need a few things here: 1) mkfs.xfs should probably self-validate using the same tests the kernel performs (if it doens't already do this upstream) 2) the customer needs a safe path forward to be able to mount their existing fs, and 3) we may need to evaluate the commit which is catching and causing this, if other customers are likely to have otherwise benign fs geometry that will cause them to fail to mount after an upgrade...
As a genera note on how problematic this could be, if this is the root filesystem, then the customers will not be able to bring up the system and it may not be immediately clear that the issue is due to a change in RHEL 7.8 Frank already linked another case, but this is the second known case that support has seen within a week.
Upstream mkfs validates the values, and fails noisily and unhelpfuly, but at least won't create a filesystem that then fails to mount. However, the validations upstream probably have too many dependencies to bring directly to the RHEL7 codebase but we can still implement the same checks. As for recovery for RHEL7.8 users: mount -o <options> won't work to fix the geometry; we can't get past reading the superblock in order to make the changes in the kernel. repair won't fix it either, because it considers all the superblocks to be invalid. I'm afraid this will take xfs_db surgery to fix. So: The safest thing to do would be to use xfs_db to rewrite the swidth to match the existing sunit; that should lead to pretty much identical behavior as was seen when mounted under the older kernel. And then, I think we need to get a kernel bug opened, with a zstream request, and revert the commit that is validating this geometry because it has turned an invalid but otherwise harmless geometry condition into a much more serious problem; the tradeoff for catching corruption seems like the wrong one to make, in this case. Can support folks get a kernel bug opened, and we'll get the revert sent ASAP and ack it for zstream?
A temporary workaround is to boot into the older kernel, as well. I'd be happy to help review a KB article to be sure we have the best, most accurate guidance for the next customer to hit this...
(In reply to Eric Sandeen from comment #7) > A temporary workaround is to boot into the older kernel, as well. > > I'd be happy to help review a KB article to be sure we have the best, most > accurate guidance for the next customer to hit this... I just finished putting this together: RHEL 7.8: XFS filesystem can not be mounted after a kernel upgrade https://access.redhat.com/solutions/5075561
From one customer: A little update on my end: I've managed to update the FW of the raid card to the latest version available. Buried in the release notes there are these two issues fixed: - Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20(SCGCQ02027889) - xfs_info command run on an XFS file system created on a VD of strip size 1M shows sunit and swidth as 0(SCGCQ02056038) After the update the io sizes changed (for the same array - w/o touching the RAID settings): -- OLD FW -- # cat /sys/block/sdb/queue/optimal_io_size 262144 # cat /sys/block/sdb/queue/minimum_io_size 1048576 -- NEW FW -- # cat /sys/block/sdb/queue/optimal_io_size 262144 # cat /sys/block/sdb/queue/minimum_io_size 262144 With these values a new XFS filesystem created from scratch has sunit=64/swidth=64. Too bad this firmware didn't exist when we put the machines in production. It is still not known how pervasive this will be for customers, but there is hope that storage vendors will have patches for the VPD page discontinuity.
Thanks, that reinforces the notion that maybe we should set swidth==sunit with xfs_db to get customers running again. Unfortunately, firmware updates won't help customers who have already created their filesytems and the invalid values are now permanently on disk in the filesystem superblock. I think we should get a kernel bug for this and push a revert of the validation to zstream ASAP, the risk/reward doesn't seem worth it.
(In reply to Eric Sandeen from comment #10) > Thanks, that reinforces the notion that maybe we should set swidth==sunit > with xfs_db to get customers running again. > > Unfortunately, firmware updates won't help customers who have already > created their filesytems and the invalid values are now permanently on disk > in the filesystem superblock. > > I think we should get a kernel bug for this and push a revert of the > validation to zstream ASAP, the risk/reward doesn't seem worth it. Understood. Revert BZ is Red HatBZ1836292
Thinking about this some more, I'm not sure it's critical to change mkfs, if we revert the kernel commit. These filesystems have obviously been working fine for customers until the oddity got detected. Changing mkfs behavior now may not actually be desirable.
(In reply to Eric Sandeen from comment #14) > Thinking about this some more, I'm not sure it's critical to change mkfs, if > we revert the kernel commit. These filesystems have obviously been working > fine for customers until the oddity got detected. Changing mkfs behavior > now may not actually be desirable. This makes sense. The problem appears limited to a handful of RAID cards which now have an updated firmware fixing the issue. After reverting the kernel commit, the only time filesystems created with these bogus settings will have a problem will be filesystems created in RHEL 7 and moved to RHEL 8 (or later) systems. I don't know how frequently this occurs/will occur, and we have a method for 'correcting' filesystems with this problem, so this seems manageable.
Hm, fair point about the RHEL8 migration... although, many of those filesystems already exist (as we've seen) so changing mkfs now could reduce but won't eliminate that problem. I think Carlos is considering a patch for upstream where at least the values can be changed via mount options, as a slightly better recovery path. We could also consider making it less fatal upstream and in RHEL8, if it makes sense to do so.
Just for the record, here's what RHEL 8 mkfs.xfs looks like in this situation. # truncate -s 1G /var/tmp/test_fs # losetup --find --show /var/tmp/test_fs /dev/loop0 # mkfs.xfs -f /dev/loop0 meta-data=/dev/loop0 isize=512 agcount=8, agsize=32512 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=260096, imaxpct=25 = sunit=256 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=1536, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200 cache_node_purge: refcount was 1, not zero (node=0x560a51cc75b0) SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x3f800/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x3f800/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x7f000/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x7f000/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0xfe000/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0xfe000/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x13d800/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x13d800/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x17d000/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x17d000/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x1bc800/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x1bc800/0x200 SB stripe unit sanity check failed Metadata corruption detected at 0x560a50628d68, xfs_sb block 0xbe800/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0xbe800/0x200 releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!found dirty buffer (bulk) on free list!bad magic number bad magic number Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200 libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200 releasing dirty buffer (bulk) to free list![root@vm9 tmp]# # mount /dev/loop0 /mnt/tmp mount: /mnt/tmp: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. So RHEL 8 is ugly, but at least it won't create a filesystem which can't be mounted. It probably calls for a sanity test of the values returned by the ioctls and an appropriate error message, but doesn't suffer from this bug in this way.
FWIW, VMware virtual machines seem to be affected, so it's not limited to just a handful of RAID cards (comment #15).
(In reply to Dominik Mierzejewski from comment #19) > FWIW, VMware virtual machines seem to be affected, so it's not limited to > just a handful of RAID cards (comment #15). The associated KB article has been updated to clarify Minimal IO size exceeding Optimal IO size when Optimal IO size is greater than zero. Checking if the stride unit exceeds the stride width is also a check that can be done.
Closing this BZ. This particular issue won't be fixed in the RHEL 7 time frame, and resolution of the issues preventing filesystem mounts will be done in the relevant kernel bugzilla - Red Hatbz 1836292