Bug 1607527 - data corruption with 'split' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
Summary: data corruption with 'split' workload to XFS on DM cache with its 3 underlyin...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Mike Snitzer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-23 16:16 UTC by Mike Snitzer
Modified: 2023-09-14 04:31 UTC (History)
42 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1605222
Environment:
Last Closed: 2018-07-25 15:12:08 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Mike Snitzer 2018-07-23 16:16:20 UTC
+++ This bug was initially created as a clone of Bug #1605222 +++

NOTE: bug#1605222 is focusing on fixing the fact that issuing IO to the request-based DM multipath target ontop of the NVMe device in my testbed causes:
 kernel: blk_cloned_rq_check_limits: over max segments limit.
 kernel: device-mapper: multipath: Failing path 259:7.
(if others hit this issue I can easily make Ming's fix available)

The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15.  When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad.  Almost like the corruption is temporal (recently accessed regions of the NVMe device)?

Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors).  But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target.  See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes).

Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce.  Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging).


--- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT ---

To reproduce this issue using device-mapper-test-suite:

0) Partition an NVMe device.  First primary partition with at least a 5GB, seconf primary partition with at least 48GB.
NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker.

1) create a request-based multipath device ontop of an NVMe device, e.g.:

#!/bin/sh

modprobe dm-service-time

DEVICE=/dev/nvme1n1p2
SIZE=`blockdev --getsz $DEVICE`

echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath

# Just a note for how to fail/reinstate path:
# dmsetup message nvme_mpath 0 "fail_path $DEVICE"
# dmsetup message nvme_mpath 0 "reinstate_path $DEVICE"

2) checkout device-mapper-test-suite from my github repo:

git clone git://github.com/snitm/device-mapper-test-suite.git
cd device-mapper-test-suite
git checkout -b devel origin/devel

3) follow device-mapper-test-suite's README.md to get it all setup

4) Configure /root/.dmtest/config with something like:

profile :nvme_shared do
   metadata_dev '/dev/nvme1n1p1'
   #data_dev '/dev/nvme1n1p2'
   data_dev '/dev/mapper/nvme_mpath'
end

default_profile :nvme_shared

------
NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device.  The configured 'data_dev' is used for dm-cache's "slow" data device.

5) run the test:
# tail -f /var/log/messages &
# time dmtest run --suite cache -n /split_large_file/

6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above):
 dmsetup message nvme_mpath 0 "reinstate_path $DEVICE"


--- Additional comment from Mike Snitzer on 2018-07-20 10:39:02 EDT ---

(In reply to Mike Snitzer from comment #5)

> Just a theory but: the XFS corruption may be due to the DM cache metadata
> being corrupted by how bufio is building the IOs it sends down to NVMe (so
> XFS is redirected to the "slow" device for a block when it should've gone to
> the "fast" device, or vice-versa).
...
> I'll leave the DM multipath on NVMe for the "slow" but switch to using a
> different (non NVMe) device for "fast" and metadata.  I'll still use a
> really fast device, e.g. simulate pmem using ram.

Testing with different NVMe devices doesn't result in corruption.

Also decided to go a different way to test my theory.. if DM cache metadata corruption was occurring, due to malformed IO issued to NVMe by bufio, it _should_ happen regardless of which "slow" device is used.

So I switched from request-based to bio-based DM multipath for the "slow" device: cannot hit corruption no matter what I try.

SO seems pretty clear something is still wrong with request-based DM multipath ontop of NVMe... sadly we don't have any negative check in blk-core, NVMe or elsewhere to offer any clue :(

--- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT ---

(In reply to Mike Snitzer from comment #6)

> SO seems pretty clear something is still wrong with request-based DM
> multipath ontop of NVMe... sadly we don't have any negative check in
> blk-core, NVMe or elsewhere to offer any clue :(

Building on this comment:

"Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)."

I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption.  I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device).

Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing.  And it is causing corruption to other NVMe partitions on the same parent NVMe device.  Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption.

If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption.  It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively.  But this same observation can be made on completely different hardware using 2 totally different NVMe devices:
testbed1: Intel Corporation Optane SSD 900P Series (2700)
testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03)

Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver)

topology before starting the device-mapper-test-suite test:

# lsblk /dev/nvme1n1
NAME           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1        259:1    0 745.2G  0 disk
├─nvme1n1p2    259:5    0 695.2G  0 part
│ └─nvme_mpath 253:2    0 695.2G  0 dm
└─nvme1n1p1    259:4    0    50G  0 part

topology during the device-mapper-test-suite test:

# lsblk /dev/nvme1n1
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1                 259:1    0 745.2G  0 disk
├─nvme1n1p2             259:5    0 695.2G  0 part
│ └─nvme_mpath          253:2    0 695.2G  0 dm
│   └─test-dev-458572   253:5    0    48G  0 dm
│     └─test-dev-613083 253:6    0    48G  0 dm   /root/snitm/git/device-mapper-test-suite/kernel_builds
└─nvme1n1p1             259:4    0    50G  0 part
  ├─test-dev-126378     253:4    0     4G  0 dm
  │ └─test-dev-613083   253:6    0    48G  0 dm   /root/snitm/git/device-mapper-test-suite/kernel_builds
  └─test-dev-652491     253:3    0    40M  0 dm
    └─test-dev-613083   253:6    0    48G  0 dm   /root/snitm/git/device-mapper-test-suite/kernel_builds

pruning that tree a bit (removing the dm-cache device 253:6) for clarity:

# lsblk /dev/nvme1n1
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1                 259:1    0 745.2G  0 disk
├─nvme1n1p2             259:5    0 695.2G  0 part
│ └─nvme_mpath          253:2    0 695.2G  0 dm
│   └─test-dev-458572   253:5    0    48G  0 dm
└─nvme1n1p1             259:4    0    50G  0 part
  ├─test-dev-126378     253:4    0     4G  0 dm
  └─test-dev-652491     253:3    0    40M  0 dm

40M device is dm-cache "metadata" device
4G device is dm-cache "fast" data device
48G device is dm-cache "slow" data device

Comment 1 Ewan D. Milne 2018-07-24 18:19:34 UTC
Doesn't seem to be specific to NVMe.  I reproduced this on a system with an
Intel NVMe card, but also with a 25GB scsi_debug device configured as follows:

modprobe scsi_debug dev_size_mb=25600
Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-52428799, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-52428799, default 52428799): +5G
Partition 1 of type Linux and of size 5 GiB is set

Command (m for help): n
Partition type:
   p   primary (1 primary, 0 extended, 3 free)
   e   extended
Select (default p): p
Partition number (2-4, default 2): 
First sector (10487808-52428799, default 10487808): 
Using default value 10487808
Last sector, +sectors or +size{K,M,G} (10487808-52428799, default 52428799): 
Using default value 52428799
Partition 2 of type Linux and of size 20 GiB is set

Command (m for help): p

Disk /dev/sdb: 26.8 GB, 26843545600 bytes, 52428800 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 32768 bytes
Disk label type: dos
Disk identifier: 0x1707677c

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    10487807     5242880   83  Linux
/dev/sdb2        10487808    52428799    20970496   83  Linux

#!/bin/sh

modprobe dm-service-time
modprobe dm-cache-smq

DEVICE=/dev/sdb2
SIZE=`blockdev --getsz $DEVICE`

echo "0 $SIZE multipath 2 queue_mode rq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create sdb_mpath

[root@rhel-storage-61 ~]# more .dmtest/config
profile :sdb_shared do
   metadata_dev '/dev/sdb1'
   #data_dev '/dev/sdb2'
   data_dev '/dev/mapper/sdb_mpath'
end

default_profile :sdb_shared

diff --git a/lib/dmtest/tests/cache/io_use_tests.rb b/lib/dmtest/tests/cache/io_use_tests.rb
index 1ada8ad..e68a6a8 100644
--- a/lib/dmtest/tests/cache/io_use_tests.rb
+++ b/lib/dmtest/tests/cache/io_use_tests.rb
@@ -91,7 +91,7 @@ class IOUseTests < ThinpTestCase
                         :cache_size => gig(4),
                         #:cache_size => gig(46),
                         # would like to get up to 512GB to match customer but...
-                        :data_size => gig(48),
+                        :data_size => gig(19),
                         ##:data_size => gig(210),
                         #:io_mode => :writeback,
                         :io_mode => :writethrough,

Comment 2 Mike Snitzer 2018-07-25 15:12:08 UTC
(In reply to Ewan D. Milne from comment #1)
> Doesn't seem to be specific to NVMe.  I reproduced this on a system with an
> Intel NVMe card, but also with a 25GB scsi_debug device configured as
> follows:

Yeap, thanks for doing that test Ewan.

It is clear that generic_make_request() isn't used for request-based DM cloned request submission.  And that request-based DM simply cannot be layered on conventional DOS partitions as is.

Closing this bug as WONTFIX.  I may circle back to hardening the kernel to either support it or at least prevent this configuration (e.g. drivers/md/dm-mpath.c:multipath_ctr() could check if device is a partition).

Comment 3 loberman 2018-07-25 15:17:07 UTC
Hello Mike

So I need to make sure everybody is aware that we have many customers using fdisk on a multipath device and creating a single (DOS) type partition.

They then take this and either create a PV on it for LVM or add it to ASM as a raw device or even run mkfs on the mpath*p1

Its worked forever, but its always been a single partition, but is till a partition.

What part am I not understanding about generic request based DM and partitions that has not seen major corruptions since all this of late showed up.

Thanks
Laurence

Comment 4 Ben Marzinski 2018-07-25 16:06:33 UTC
(In reply to loberman from comment #3)
> Hello Mike
> 
> So I need to make sure everybody is aware that we have many customers using
> fdisk on a multipath device and creating a single (DOS) type partition.
> 
> They then take this and either create a PV on it for LVM or add it to ASM as
> a raw device or even run mkfs on the mpath*p1
> 
> Its worked forever, but its always been a single partition, but is till a
> partition.
> 
> What part am I not understanding about generic request based DM and
> partitions that has not seen major corruptions since all this of late showed
> up.

If you look at Mike's test, the multipath device itself is on top of the partition. Multipath is not running on the whole device. The multipath tools do not allow this, and never have. The only way you can do this is to manually create the device table, like Mike is doing.  Customers whole multipath the whole device, and then create a partition on it (which will get mapped to a kpartx device on top of multipath) will not encounter this bug.


> Thanks
> Laurence

Comment 5 loberman 2018-07-25 16:08:42 UTC
OK, That makes perfect sense.
That was my foolish misunderstanding.

Thanks
Laurence

Comment 6 Red Hat Bugzilla 2023-09-14 04:31:54 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.