Bug 1835947 - mkfs.xfs can create a filesystem with bogus suint/swidth values if the kernel reports bogus BLKIOOPT and BLKIOMIN
Summary: mkfs.xfs can create a filesystem with bogus suint/swidth values if the kernel...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: xfsprogs
Version: 7.8
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: Zorro Lang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 19:09 UTC by Frank Sorenson
Modified: 2020-07-30 22:02 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-24 19:30:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
systemtap to reproduce issue (1.29 KB, text/plain)
2020-05-14 19:09 UTC, Frank Sorenson
no flags Details
metadump of 1 GiB filesystem created with the process outlined (6.32 KB, application/x-bzip)
2020-05-14 19:25 UTC, Frank Sorenson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5075561 0 None None None 2020-05-19 07:40:55 UTC

Description Frank Sorenson 2020-05-14 19:09:00 UTC
Created attachment 1688599 [details]
systemtap to reproduce issue

Description of problem:

	When the kernel ioctls BLKIOOPT and BLKIOMIN return invalid values, mkfs.xfs will create a filesystem with bogus sunit/swidth which cannot be mounted.

	In particular, the BLKIOMIN (minimum IO size) is larger than BLKIOOPT (optimal IO size), resulting in sunit larger than swidth.


Prior to the kernel patch:
	xfs: catch bad stripe alignment configurations 

filesystems created in this way could be mounted, so the invalid settings were not an issue at mount time.


Version-Release number of selected component (if applicable):

	RHEL 7 xfsprogs xfsprogs-4.5.0-20.el7.x86_64


How reproducible:

	easy; see steps below



Steps to Reproduce:

create a large file to hold the filesystem:
	# truncate -s 9276062957568 /var/tmp/test_fs

set up a loop device:

	# losetup --find --show /var/tmp/test_fs
	/dev/loop0


In my case, the device I'm modifying is /dev/loop0, so the major/minor numbers are 7:0:
	# ls -l /dev/loop0
	brw-rw----. 1 root disk 7, 0 May 14 12:43 /dev/loop0

modify the top of the attached systemtap with the appropriate major/minor:
	@define target_dev_major %( 7 %)
	@define target_dev_minor %( 0 %)

start the systemtap; the systemtap will output messages to both the console running systemtap and the kernel logs:
	# stap -g blkid_ioctls-2.stp 2>&1
	loaded systemtap to modify device(7:0) optimal IO size to 262144 and minimum IO size to 1048576

create the filesystem
	# mkfs.xfs -f /dev/loop0

the systemtap modifies the ioctl return values to the bogus values:
	mkfs.xfs(29834): blkdev_ioctl(cmd: 4728 - BLKIOMIN) result is 512 - changing to 1048576
	mkfs.xfs(29834): blkdev_ioctl(cmd: 4729 - BLKIOOPT) result is 0 - changing to 262144

mkfs.xfs completes with bogus sunit and swidth values:
	meta-data=/dev/loop0             isize=512    agcount=32, agsize=70770688 blks
	         =                       sectsz=512   attr=2, projid32bit=1
	         =                       crc=1        finobt=0, sparse=0
	data     =                       bsize=4096   blocks=2264662016, imaxpct=5
	         =                       sunit=256    swidth=64 blks
	naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
	log      =internal log           bsize=4096   blocks=521728, version=2
	         =                       sectsz=512   sunit=8 blks, lazy-count=1
	realtime =none                   extsz=4096   blocks=0, rtextents=0

the filesystem cannot be mounted:
	mount /dev/loop0 /mnt/tmp
	mount: mount /dev/loop0 on /mnt/tmp failed: Structure needs cleaning

with the following kernel messages:
	[630936.031317]  [<ffffffff9febeca6>] worker_thread+0x126/0x3c0
	[630936.031321]  [<ffffffff9febeb80>] ? manage_workers.isra.26+0x2a0/0x2a0
	[630936.031326]  [<ffffffff9fec5b61>] kthread+0xd1/0xe0
	[630936.031330]  [<ffffffff9fec5a90>] ? insert_kthread_work+0x40/0x40
	[630936.031336]  [<ffffffffa0593df7>] ret_from_fork_nospec_begin+0x21/0x21
	[630936.031340]  [<ffffffff9fec5a90>] ? insert_kthread_work+0x40/0x40
	[630936.031444] XFS (loop0): SB validate failed with error -117.

specifying sunit and swidth values at mount time does not work (mount still fails with the same error).


Actual results:

	the filesystem created cannot be mounted


Expected results:

	mkfs.xfs either returns error when sunit/swidth values are invalid, or changes them to sane defaults


Additional info:

	the filesystem can be 'corrected' using xfs_db to modify the swidth in each sb

	# i=0 ; pat='^unit = ([0-9]+)$'
	# while true ; do
		unit=$(xfs_db -x -c "sb $i" -c "p unit" /dev/loop0 2>&1)
		[[ $unit =~ $pat ]] || break
		unit=${BASH_REMATCH[1]}
		xfs_db -x -c "sb $i" -c "write width $unit" /dev/loop0 >/dev/null 2>&1
		i=$(($i+1))
	done


	# mount /dev/loop0 /mnt/tmp
	# xfs_info /mnt/tmp
	meta-data=/dev/loop0             isize=512    agcount=32, agsize=70770688 blks
	         =                       sectsz=512   attr=2, projid32bit=1
	         =                       crc=1        finobt=0 spinodes=0
	data     =                       bsize=4096   blocks=2264662016, imaxpct=5
	         =                       sunit=256    swidth=256 blks
	naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
	log      =internal               bsize=4096   blocks=521728, version=2
	         =                       sectsz=512   sunit=8 blks, lazy-count=1
	realtime =none                   extsz=4096   blocks=0, rtextents=0

Comment 3 Frank Sorenson 2020-05-14 19:25:37 UTC
Created attachment 1688616 [details]
metadump of 1 GiB filesystem created with the process outlined

The sparse file doesn't need to be anywhere near the size used in the reproducer.  1 GiB is sufficient.

Here's a metadump of a 1 GiB filesystem while running the systemap

# mkfs.xfs -f /dev/loop0
meta-data=/dev/loop0             isize=512    agcount=8, agsize=32512 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=260096, imaxpct=25
         =                       sunit=256    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=624, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Comment 4 Eric Sandeen 2020-05-14 19:33:25 UTC
ok so I think we need a few things here:

1) mkfs.xfs should probably self-validate using the same tests the kernel performs
   (if it doens't already do this upstream)
2) the customer needs a safe path forward to be able to mount their existing fs, and
3) we may need to evaluate the commit which is catching and causing this, if other
   customers are likely to have otherwise benign fs geometry that will cause them to
   fail to mount after an upgrade...

Comment 5 Jacob Shivers 2020-05-14 21:54:25 UTC
As a genera note on how problematic this could be, if this is the root filesystem, then the customers will not be able to bring up the system
and it may not be immediately clear that the issue is due to a change in RHEL 7.8

Frank already linked another case, but this is the second known case that support has seen within a week.

Comment 6 Eric Sandeen 2020-05-14 22:13:35 UTC
Upstream mkfs validates the values, and fails noisily and unhelpfuly, but at least won't create a filesystem that then fails to mount.
However, the validations upstream probably have too many dependencies to bring directly to the RHEL7 codebase but we can still implement the same checks.

As for recovery for RHEL7.8 users:

mount -o <options> won't work to fix the geometry; we can't get past reading the superblock in order to make the changes in the kernel.

repair won't fix it either, because it considers all the superblocks to be invalid.  I'm afraid this will take xfs_db surgery to fix.

So:

The safest thing to do would be to use xfs_db to rewrite the swidth to match the existing sunit; that should lead to pretty much identical behavior as was seen when mounted under the older kernel.


And then, I think we need to get a kernel bug opened, with a zstream request, and revert the commit that is validating this geometry because it has turned an invalid but otherwise harmless geometry condition into a much more serious problem; the tradeoff for catching corruption seems like the wrong one to make, in this case.

Can support folks get a kernel bug opened, and we'll get the revert sent ASAP and ack it for zstream?

Comment 7 Eric Sandeen 2020-05-14 22:19:04 UTC
A temporary workaround is to boot into the older kernel, as well.

I'd be happy to help review a KB article to be sure we have the best, most accurate guidance for the next customer to hit this...

Comment 8 Jacob Shivers 2020-05-14 22:22:57 UTC
(In reply to Eric Sandeen from comment #7)
> A temporary workaround is to boot into the older kernel, as well.
> 
> I'd be happy to help review a KB article to be sure we have the best, most
> accurate guidance for the next customer to hit this...

I just finished putting this together:

RHEL 7.8: XFS filesystem can not be mounted after a kernel upgrade
https://access.redhat.com/solutions/5075561

Comment 9 Jacob Shivers 2020-05-15 13:57:47 UTC
From one customer:


A little update on my end: I've managed to update the FW of the raid card to the latest version available.
Buried in the release notes there are these two issues fixed:

- Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20(SCGCQ02027889)
- xfs_info command run on an XFS file system created on a VD of strip size 1M shows sunit and swidth as 0(SCGCQ02056038)

After the update the io sizes changed (for the same array - w/o touching the RAID settings):

-- OLD FW --
# cat /sys/block/sdb/queue/optimal_io_size
262144
# cat /sys/block/sdb/queue/minimum_io_size
1048576

-- NEW FW --
# cat /sys/block/sdb/queue/optimal_io_size
262144
# cat /sys/block/sdb/queue/minimum_io_size
262144

With these values a new XFS filesystem created from scratch has sunit=64/swidth=64.
Too bad this firmware didn't exist when we put the machines in production.


It is still not known how pervasive this will be for customers, but there is hope that storage vendors will have patches for the VPD page discontinuity.

Comment 10 Eric Sandeen 2020-05-15 14:32:52 UTC
Thanks, that reinforces the notion that maybe we should set swidth==sunit with xfs_db to get customers running again.

Unfortunately, firmware updates won't help customers who have already created their filesytems and the invalid values are now permanently on disk in the filesystem superblock.

I think we should get a kernel bug for this and push a revert of the validation to zstream ASAP, the risk/reward doesn't seem worth it.

Comment 11 Jacob Shivers 2020-05-15 15:08:27 UTC
(In reply to Eric Sandeen from comment #10)
> Thanks, that reinforces the notion that maybe we should set swidth==sunit
> with xfs_db to get customers running again.
> 
> Unfortunately, firmware updates won't help customers who have already
> created their filesytems and the invalid values are now permanently on disk
> in the filesystem superblock.
> 
> I think we should get a kernel bug for this and push a revert of the
> validation to zstream ASAP, the risk/reward doesn't seem worth it.

Understood.
Revert BZ is BZ1836292

Comment 14 Eric Sandeen 2020-05-15 22:02:40 UTC
Thinking about this some more, I'm not sure it's critical to change mkfs, if we revert the kernel commit.  These filesystems have obviously been working fine for customers until the oddity got detected.  Changing mkfs behavior now may not actually be desirable.

Comment 15 Frank Sorenson 2020-05-16 22:47:34 UTC
(In reply to Eric Sandeen from comment #14)
> Thinking about this some more, I'm not sure it's critical to change mkfs, if
> we revert the kernel commit.  These filesystems have obviously been working
> fine for customers until the oddity got detected.  Changing mkfs behavior
> now may not actually be desirable.

This makes sense.

The problem appears limited to a handful of RAID cards which now have an updated firmware fixing the issue.

After reverting the kernel commit, the only time filesystems created with these bogus settings will have a problem will be filesystems created in RHEL 7 and moved to RHEL 8 (or later) systems.  I don't know how frequently this occurs/will occur, and we have a method for 'correcting' filesystems with this problem, so this seems manageable.

Comment 16 Eric Sandeen 2020-05-17 02:16:38 UTC
Hm, fair point about the RHEL8 migration... although, many of those filesystems already exist (as we've seen) so changing mkfs now could reduce but won't eliminate that problem.

I think Carlos is considering a patch for upstream where at least the values can be changed via mount options, as a slightly better recovery path.
We could also consider making it less fatal upstream and in RHEL8, if it makes sense to do so.

Comment 17 Frank Sorenson 2020-05-18 14:51:23 UTC
Just for the record, here's what RHEL 8 mkfs.xfs looks like in this situation.

# truncate -s 1G /var/tmp/test_fs
# losetup --find --show /var/tmp/test_fs
/dev/loop0

# mkfs.xfs -f /dev/loop0
meta-data=/dev/loop0             isize=512    agcount=8, agsize=32512 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=260096, imaxpct=25
         =                       sunit=256    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1536, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200
cache_node_purge: refcount was 1, not zero (node=0x560a51cc75b0)
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x3f800/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x3f800/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x7f000/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x7f000/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0xfe000/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0xfe000/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x13d800/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x13d800/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x17d000/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x17d000/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x1bc800/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x1bc800/0x200
SB stripe unit sanity check failed
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0xbe800/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0xbe800/0x200
releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!found dirty buffer (bulk) on free list!bad magic number
bad magic number
Metadata corruption detected at 0x560a50628d68, xfs_sb block 0x0/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200
releasing dirty buffer (bulk) to free list![root@vm9 tmp]# 


# mount /dev/loop0 /mnt/tmp
mount: /mnt/tmp: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.


So RHEL 8 is ugly, but at least it won't create a filesystem which can't be mounted.  It probably calls for a sanity test of the values returned by the ioctls and an appropriate error message, but doesn't suffer from this bug in this way.

Comment 19 Dominik Mierzejewski 2020-05-19 11:39:05 UTC
FWIW, VMware virtual machines seem to be affected, so it's not limited to just a handful of RAID cards (comment #15).

Comment 23 Jacob Shivers 2020-05-20 18:42:44 UTC
(In reply to Dominik Mierzejewski from comment #19)
> FWIW, VMware virtual machines seem to be affected, so it's not limited to
> just a handful of RAID cards (comment #15).

The associated KB article has been updated to clarify Minimal IO size exceeding Optimal IO size when Optimal IO size is greater than zero.
Checking if the stride unit exceeds the stride width is also a check that can be done.

Comment 24 Frank Sorenson 2020-05-24 19:30:01 UTC
Closing this BZ.  This particular issue won't be fixed in the RHEL 7 time frame, and resolution of the issues preventing filesystem mounts will be done in the relevant kernel bugzilla - bz 1836292


Note You need to log in before you can comment on or make changes to this bug.