+++ This bug was initially created as a clone of Bug #1605222 +++ NOTE: bug#1605222 is focusing on fixing the fact that issuing IO to the request-based DM multipath target ontop of the NVMe device in my testbed causes: kernel: blk_cloned_rq_check_limits: over max segments limit. kernel: device-mapper: multipath: Failing path 259:7. (if others hit this issue I can easily make Ming's fix available) The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15. When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad. Almost like the corruption is temporal (recently accessed regions of the NVMe device)? Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors). But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target. See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes). Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce. Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging). --- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT --- To reproduce this issue using device-mapper-test-suite: 0) Partition an NVMe device. First primary partition with at least a 5GB, seconf primary partition with at least 48GB. NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker. 1) create a request-based multipath device ontop of an NVMe device, e.g.: #!/bin/sh modprobe dm-service-time DEVICE=/dev/nvme1n1p2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath # Just a note for how to fail/reinstate path: # dmsetup message nvme_mpath 0 "fail_path $DEVICE" # dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" 2) checkout device-mapper-test-suite from my github repo: git clone git://github.com/snitm/device-mapper-test-suite.git cd device-mapper-test-suite git checkout -b devel origin/devel 3) follow device-mapper-test-suite's README.md to get it all setup 4) Configure /root/.dmtest/config with something like: profile :nvme_shared do metadata_dev '/dev/nvme1n1p1' #data_dev '/dev/nvme1n1p2' data_dev '/dev/mapper/nvme_mpath' end default_profile :nvme_shared ------ NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device. The configured 'data_dev' is used for dm-cache's "slow" data device. 5) run the test: # tail -f /var/log/messages & # time dmtest run --suite cache -n /split_large_file/ 6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above): dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" --- Additional comment from Mike Snitzer on 2018-07-20 10:39:02 EDT --- (In reply to Mike Snitzer from comment #5) > Just a theory but: the XFS corruption may be due to the DM cache metadata > being corrupted by how bufio is building the IOs it sends down to NVMe (so > XFS is redirected to the "slow" device for a block when it should've gone to > the "fast" device, or vice-versa). ... > I'll leave the DM multipath on NVMe for the "slow" but switch to using a > different (non NVMe) device for "fast" and metadata. I'll still use a > really fast device, e.g. simulate pmem using ram. Testing with different NVMe devices doesn't result in corruption. Also decided to go a different way to test my theory.. if DM cache metadata corruption was occurring, due to malformed IO issued to NVMe by bufio, it _should_ happen regardless of which "slow" device is used. So I switched from request-based to bio-based DM multipath for the "slow" device: cannot hit corruption no matter what I try. SO seems pretty clear something is still wrong with request-based DM multipath ontop of NVMe... sadly we don't have any negative check in blk-core, NVMe or elsewhere to offer any clue :( --- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT --- (In reply to Mike Snitzer from comment #6) > SO seems pretty clear something is still wrong with request-based DM > multipath ontop of NVMe... sadly we don't have any negative check in > blk-core, NVMe or elsewhere to offer any clue :( Building on this comment: "Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)." I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption. I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device). Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing. And it is causing corruption to other NVMe partitions on the same parent NVMe device. Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption. If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption. It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively. But this same observation can be made on completely different hardware using 2 totally different NVMe devices: testbed1: Intel Corporation Optane SSD 900P Series (2700) testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver) topology before starting the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm └─nvme1n1p1 259:4 0 50G 0 part topology during the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─test-dev-652491 253:3 0 40M 0 dm └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds pruning that tree a bit (removing the dm-cache device 253:6) for clarity: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm └─test-dev-652491 253:3 0 40M 0 dm 40M device is dm-cache "metadata" device 4G device is dm-cache "fast" data device 48G device is dm-cache "slow" data device
Doesn't seem to be specific to NVMe. I reproduced this on a system with an Intel NVMe card, but also with a 25GB scsi_debug device configured as follows: modprobe scsi_debug dev_size_mb=25600 Command (m for help): n Partition type: p primary (0 primary, 0 extended, 4 free) e extended Select (default p): p Partition number (1-4, default 1): 1 First sector (2048-52428799, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-52428799, default 52428799): +5G Partition 1 of type Linux and of size 5 GiB is set Command (m for help): n Partition type: p primary (1 primary, 0 extended, 3 free) e extended Select (default p): p Partition number (2-4, default 2): First sector (10487808-52428799, default 10487808): Using default value 10487808 Last sector, +sectors or +size{K,M,G} (10487808-52428799, default 52428799): Using default value 52428799 Partition 2 of type Linux and of size 20 GiB is set Command (m for help): p Disk /dev/sdb: 26.8 GB, 26843545600 bytes, 52428800 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 32768 bytes Disk label type: dos Disk identifier: 0x1707677c Device Boot Start End Blocks Id System /dev/sdb1 2048 10487807 5242880 83 Linux /dev/sdb2 10487808 52428799 20970496 83 Linux #!/bin/sh modprobe dm-service-time modprobe dm-cache-smq DEVICE=/dev/sdb2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode rq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create sdb_mpath [root@rhel-storage-61 ~]# more .dmtest/config profile :sdb_shared do metadata_dev '/dev/sdb1' #data_dev '/dev/sdb2' data_dev '/dev/mapper/sdb_mpath' end default_profile :sdb_shared diff --git a/lib/dmtest/tests/cache/io_use_tests.rb b/lib/dmtest/tests/cache/io_use_tests.rb index 1ada8ad..e68a6a8 100644 --- a/lib/dmtest/tests/cache/io_use_tests.rb +++ b/lib/dmtest/tests/cache/io_use_tests.rb @@ -91,7 +91,7 @@ class IOUseTests < ThinpTestCase :cache_size => gig(4), #:cache_size => gig(46), # would like to get up to 512GB to match customer but... - :data_size => gig(48), + :data_size => gig(19), ##:data_size => gig(210), #:io_mode => :writeback, :io_mode => :writethrough,
(In reply to Ewan D. Milne from comment #1) > Doesn't seem to be specific to NVMe. I reproduced this on a system with an > Intel NVMe card, but also with a 25GB scsi_debug device configured as > follows: Yeap, thanks for doing that test Ewan. It is clear that generic_make_request() isn't used for request-based DM cloned request submission. And that request-based DM simply cannot be layered on conventional DOS partitions as is. Closing this bug as WONTFIX. I may circle back to hardening the kernel to either support it or at least prevent this configuration (e.g. drivers/md/dm-mpath.c:multipath_ctr() could check if device is a partition).
Hello Mike So I need to make sure everybody is aware that we have many customers using fdisk on a multipath device and creating a single (DOS) type partition. They then take this and either create a PV on it for LVM or add it to ASM as a raw device or even run mkfs on the mpath*p1 Its worked forever, but its always been a single partition, but is till a partition. What part am I not understanding about generic request based DM and partitions that has not seen major corruptions since all this of late showed up. Thanks Laurence
(In reply to loberman from comment #3) > Hello Mike > > So I need to make sure everybody is aware that we have many customers using > fdisk on a multipath device and creating a single (DOS) type partition. > > They then take this and either create a PV on it for LVM or add it to ASM as > a raw device or even run mkfs on the mpath*p1 > > Its worked forever, but its always been a single partition, but is till a > partition. > > What part am I not understanding about generic request based DM and > partitions that has not seen major corruptions since all this of late showed > up. If you look at Mike's test, the multipath device itself is on top of the partition. Multipath is not running on the whole device. The multipath tools do not allow this, and never have. The only way you can do this is to manually create the device table, like Mike is doing. Customers whole multipath the whole device, and then create a partition on it (which will get mapped to a kpartx device on top of multipath) will not encounter this bug. > Thanks > Laurence
OK, That makes perfect sense. That was my foolish misunderstanding. Thanks Laurence
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days