Bug 1798631

Summary:	possible xfs corruption of snapshot volume taken of a writecached filesystem
Product:	Red Hat Enterprise Linux 8	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	David Teigland <teigland>
lvm2 sub component:	Cache Logical Volumes	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED ERRATA	Docs Contact:	Marek Suchánek <msuchane>
Severity:	high
Priority:	medium	CC:	agk, heinzm, jbrassow, lmanasko, lmiksik, mcsontos, mpatocka, msnitzer, pasik, prajnoha, teigland, zkabelac
Version:	8.2
Target Milestone:	rc
Target Release:	8.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.03.08-2.el8	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-28 16:59:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2020-02-05 17:25:38 UTC

Description of problem:

I've reproduced this twice so far in roughly two dozen attempts. I created the snapshot volumes "larger" than the origin volumes due do the re-encryption of the origin volume. I'll continue to attempt to boil this down, but for now let me know if anyone sees an operation that is blatantly shooting myself in the foot. This exact scenario always worked with dm-cache and 90%+ of the time with writecache so I don't think i'm doing anything too crazy.

SCENARIO - [cache_origin_rename_in_between_luks_encryption_operations]
Create a writecache volume with fs data, encrypt origin, and then rename the cache origin volume (pool rename not supported) in between re-encryption stack operations


*** Cache info for this scenario ***
*  origin (slow):  /dev/sdb1
*  pool (fast):    /dev/sdf1
************************************

Adding "slow" and "fast" tags to corresponding pvs
Create origin (slow) volume
lvcreate --wipesignatures y  -L 4G -n rename_orig_A writecache_sanity @slow

Create cache data and cache metadata (fast) volumes
lvcreate  -L 4G -n rename_pool_A writecache_sanity @fast

Deactivte both fast and slow volumes before conversion to write cache
Create writecached volume by combining the cache pool (fast) and origin (slow) volumes
lvconvert --yes --type writecache --cachesettings ' low_watermark=42 high_watermark=77 writeback_jobs=2173 autocommit_blocks=1091 autocommit_time=1354' --cachevol writecache_sanity/rename_pool_A writecache_sanity/rename_orig_A
Activating volume: rename_orig_A

Encrypting rename_orig_A volume
cryptsetup luksFormat /dev/writecache_sanity/rename_orig_A
cryptsetup luksOpen /dev/writecache_sanity/rename_orig_A luks_rename_orig_A
Placing an xfs filesystem on origin volume
Mounting origin volume

Writing files to /mnt/rename_orig_A
Checking files on /mnt/rename_orig_A

syncing before snap creation...
Making 1st snapshot of origin volume
lvcreate  -s /dev/writecache_sanity/rename_orig_A -c 64 -n snap1 -L 4296704
Renaming cache ORIGIN volume...
lvrename writecache_sanity/rename_orig_A writecache_sanity/rename_orig_B
Mounting 1st snap volume
cryptsetup luksOpen /dev/writecache_sanity/snap1 luks_snap1
Checking files on /mnt/snap1

Writing files to /mnt/rename_orig_A

syncing before snap creation...
(ONLINE) Re-encrypting rename_orig_A volume
cryptsetup reencrypt --resilience none --active-name luks_rename_orig_A

Making 2nd snapshot of origin volume
lvcreate  -s /dev/writecache_sanity/rename_orig_B -c 64 -n snap2 -L 4296704
Mounting 2nd snap volume
cryptsetup luksOpen /dev/writecache_sanity/snap2 luks_snap2
Checking files on /mnt/rename_orig_A
Checking files on /mnt/snap2
Can not stat qdiihtdlbbvxycvnyjlwphytajrgfmcyskhagknjnkjpogd: Structure needs cleaning
checkit verify failed

Feb  4 11:45:49 host-083 qarshd[3079]: Running cmdline: /usr/tests/sts-rhel8.2/bin/checkit -w /mnt/snap2 -f /tmp/checkit_origin_1 -v
Feb  4 11:46:11 host-083 restraintd[1461]: *** Current Time: Tue Feb 04 11:46:11 2020 Localwatchdog at:  * Disabled! *
Feb  4 11:46:13 host-083 kernel: XFS (dm-16): Metadata CRC error detected at xfs_dir3_data_read_verify+0xc2/0xe0 [xfs], xfs_dir3_data block 0x39ea8
Feb  4 11:46:13 host-083 kernel: XFS (dm-16): Unmount and run xfs_repair
Feb  4 11:46:13 host-083 kernel: XFS (dm-16): First 128 bytes of corrupted metadata buffer:
Feb  4 11:46:13 host-083 kernel: 00000000: df 01 4e 5d 42 83 e8 b9 06 aa 90 4d 3f fb e2 3d  ..N]B......M?..=
Feb  4 11:46:13 host-083 kernel: 00000010: 68 94 57 9b 53 58 7a 13 11 42 06 06 5b a7 64 e2  h.W.SXz..B..[.d.
Feb  4 11:46:13 host-083 kernel: 00000020: 94 e7 01 ec 40 95 5a 84 dc 27 e2 85 b3 63 2d a4  ....@.Z..'...c-.
Feb  4 11:46:13 host-083 kernel: 00000030: 95 58 09 ff f8 ce 72 74 f7 7f b0 45 c1 07 cd b3  .X....rt...E....
Feb  4 11:46:13 host-083 kernel: 00000040: 2b 90 43 60 ce 7a c4 9c 9b c3 fe 32 fa c3 0d 45  +.C`.z.....2...E
Feb  4 11:46:13 host-083 kernel: 00000050: e4 a8 e9 a6 ec c1 8a 4b c1 56 cd 39 fd 87 9d 93  .......K.V.9....
Feb  4 11:46:13 host-083 kernel: 00000060: 27 78 56 ea 7d ab 48 fb 90 de 2c 91 06 4a de 78  'xV.}.H...,..J.x
Feb  4 11:46:13 host-083 kernel: 00000070: da 9a da 82 9d e9 3e 21 ac f9 e6 c9 0d 53 c1 ca  ......>!.....S..
Feb  4 11:46:13 host-083 kernel: XFS (dm-16): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x39ea8 len 8 error 74


[root@host-083 ~]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/luks_rename_orig_A   4.0G  778M  3.3G  20% /mnt/rename_orig_A
/dev/mapper/luks_snap1           4.0G  421M  3.6G  11% /mnt/snap1
/dev/mapper/luks_snap2           4.0G  778M  3.3G  20% /mnt/snap2


[root@host-083 ~]# ls -l /mnt/rename_orig_A | wc -l
2401
[root@host-083 ~]# ls -l /mnt/snap1 | wc -l
1201
[root@host-083 ~]# ls -l /mnt/snap2 | wc -l
ls: reading directory '/mnt/snap2': Structure needs cleaning
958


[root@host-083 ~]# umount /mnt/snap2
[root@host-083 ~]# xfs_repair /dev/mapper/luks_snap2
[...]
entry "ppyeoxxlwwomxtfqv" at block 12 offset 920 in directory inode 128 references free inode 709403
        clearing inode number in entry at offset 920...
entry "hmanrjtwiwspdtpxuqlucanjjjavwvueiwscgoyrothtpar" at block 12 offset 952 in directory inode 128 references free inode 709404
        clearing inode number in entry at offset 952...
entry "kcsxbukaaudxniyxnfvkawsxxr" at block 12 offset 1016 in directory inode 128 references free inode 709405
        clearing inode number in entry at offset 1016...
entry "bdekkgy" at block 12 offset 1056 in directory inode 128 references free inode 709406
        clearing inode number in entry at offset 1056...
entry "ldbkrbvngmcqidcqocai" at block 12 offset 1080 in directory inode 128 references free inode 709407
        clearing inode number in entry at offset 1080...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
Metadata CRC error detected at 0x56145067c919, xfs_dir3_data block 0x39ea8/0x1000
corrupt block 10 in directory inode 128: junking block
Metadata CRC error detected at 0x56145067c919, xfs_dir3_data block 0x44330/0x1000
corrupt block 11 in directory inode 128: junking block
rebuilding directory inode 128
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 556862, moving to lost+found
disconnected inode 556863, moving to lost+found
disconnected inode 596673, moving to lost+found
[...]
disconnected inode 633594, moving to lost+found
disconnected inode 633595, moving to lost+found
disconnected inode 633596, moving to lost+found
disconnected inode 633597, moving to lost+found
disconnected inode 633598, moving to lost+found
disconnected inode 633599, moving to lost+found
disconnected inode 709417, moving to lost+found
Phase 7 - verify and correct link counts...
done


[root@host-083 ~]# mount /dev/mapper/luks_snap2 /mnt/snap2
mount: /mnt/snap2: wrong fs type, bad option, bad superblock on /dev/mapper/luks_snap2, missing codepage or helper program, or other error.




Version-Release number of selected component (if applicable):
4.18.0-173.el8.x86_64

kernel-4.18.0-173.el8    BUILT: Fri Jan 24 06:02:03 CST 2020
lvm2-2.03.07-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
lvm2-libs-2.03.07-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-libs-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-event-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-event-libs-1.02.167-1.el8    BUILT: Mon Dec  2 00:09:32 CST 2019
device-mapper-persistent-data-0.8.5-3.el8    BUILT: Wed Nov 27 07:05:21 CST 2019
cryptsetup-2.2.2-1.el8    BUILT: Mon Nov 18 06:06:33 CST 2019
cryptsetup-libs-2.2.2-1.el8    BUILT: Mon Nov 18 06:06:33 CST 2019



How reproducible:
seldom

Comment 1 Corey Marthaler 2020-02-05 17:29:12 UTC

Here are the specific commands being run:


# set up
pvcreate /dev/sd[bc]1
vgcreate writecache_sanity  /dev/sd[bc]1
pvchange --addtag slow /dev/sdc1
pvchange --addtag fast /dev/sdb1
lvcreate --wipesignatures y -L 4G -n rename_orig_A writecache_sanity @slow
lvcreate -L 4G -n rename_pool_A writecache_sanity @fast
lvchange -an writecache_sanity/rename_orig_A writecache_sanity/rename_pool_A
lvconvert --yes --type writecache --cachesettings ' low_watermark=42 high_watermark=77 writeback_jobs=2173 autocommit_blocks=1091 autocommit_time=1354' --cachevol writecache_sanity/rename_pool_A writecache_sanity/rename_orig_A
lvchange -vvvv -ay writecache_sanity/rename_orig_A
echo Str0ngP455w0rd### | cryptsetup luksFormat /dev/writecache_sanity/rename_orig_A
sync
echo Str0ngP455w0rd### | cryptsetup luksOpen /dev/writecache_sanity/rename_orig_A luks_rename_orig_A
cryptsetup -v status luks_rename_orig_A
mkfs.xfs -f /dev/mapper/luks_rename_orig_A
mkdir -p /mnt/rename_orig_A
mount /dev/mapper/luks_rename_orig_A /mnt/rename_orig_A

# i/o
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/rename_orig_A -f /tmp/checkit_origin_1 -n 1200
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/rename_orig_A -f /tmp/checkit_origin_1 -v
sync
lvs --noheadings -o lv_size --units k writecache_sanity/rename_orig_A
lvs --noheadings -o lv_attr writecache_sanity/rename_orig_A

# snapshot
lvcreate -s /dev/writecache_sanity/rename_orig_A -c 64 -n snap1 -L 4296704
lvs --noheadings -o lv_attr writecache_sanity/snap1

# origin rename
lvrename writecache_sanity/rename_orig_A writecache_sanity/rename_orig_B

# snap usage
echo Str0ngP455w0rd### | cryptsetup luksOpen /dev/writecache_sanity/snap1 luks_snap1
cryptsetup -v status luks_snap1
mkdir -p /mnt/snap1
mount -o nouuid /dev/mapper/luks_snap1 /mnt/snap1
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/snap1 -f /tmp/checkit_origin_1 -v
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/rename_orig_A -f /tmp/checkit_origin_2 -n 1200
sync

# reencrypt the origin
echo Str0ngP455w0rd### | cryptsetup reencrypt --resilience none --active-name luks_rename_orig_A
cryptsetup -v status luks_rename_orig_A
lvs --noheadings -o lv_attr writecache_sanity/rename_orig_B

# create another snap
lvcreate -s /dev/writecache_sanity/rename_orig_B -c 64 -n snap2 -L 4296704
lvs --noheadings -o lv_attr writecache_sanity/snap2
echo Str0ngP455w0rd### | cryptsetup luksOpen /dev/writecache_sanity/snap2 luks_snap2
cryptsetup -v status luks_snap2
sync
mkdir -p /mnt/snap2
mount -o nouuid /dev/mapper/luks_snap2 /mnt/snap2
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/rename_orig_A -f /tmp/checkit_origin_1 -v
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/snap2 -f /tmp/checkit_origin_1 -v

Comment 2 Corey Marthaler 2020-02-05 20:20:01 UTC

Actually, are just having snapshots of writecache origin volumes the issue here?


SCENARIO - [simple_writecache_snap_merge]
Create snaps of cache origin with fs data, verify data on snaps, change data on origin, merge data back to origin, verify origin data

*** Cache info for this scenario ***
*  origin (slow):  /dev/sdc1
*  pool (fast):    /dev/sde1
************************************

Adding "slow" and "fast" tags to corresponding pvs
Create origin (slow) volume
lvcreate --wipesignatures y  -L 4G -n corigin writecache_sanity @slow

Create cache data and cache metadata (fast) volumes
lvcreate  -L 2G -n pool writecache_sanity @fast

Deactivte both fast and slow volumes before conversion to write cache
Create writecached volume by combining the cache pool (fast) and origin (slow) volumes
lvconvert --yes --type writecache --cachesettings ' low_watermark=37 high_watermark=93 writeback_jobs=2567 autocommit_blocks=2257 autocommit_time=2912' --cachevol writecache_sanity/pool writecache_sanity/corigin

Placing an xfs filesystem on origin volume
Mounting origin volume

Writing files to /mnt/corigin

Checking files on /mnt/corigin


Making a snapshot of the origin volume, mounting, and verifying original data
lvcreate  -s /dev/writecache_sanity/corigin -c 64 -n merge -L 500M

+++ Mounting and verifying snapshot merge data +++
Checking files on /mnt/merge
Can not stat byqckoqikfbgprmlmkmyvprgug: Structure needs cleaning

Feb  5 14:11:49 hayes-02 qarshd[37391]: Running cmdline: mount -o nouuid /dev/writecache_sanity/merge /mnt/merge
Feb  5 14:11:49 hayes-02 kernel: XFS (dm-5): Mounting V5 Filesystem
Feb  5 14:11:49 hayes-02 kernel: XFS (dm-5): Starting recovery (logdev: internal)
[516010.515258] XFS (dm-5): Ending recovery (logdev: internal)
Feb  5 14:11:49 hayes-02 kernel: XFS (dm-5): Ending recovery (logdev: internal)
Feb  5 14:11:50 hayes-02 systemd[1]: Started qarsh Per-Connection Server (10.15.80.223:41534).
Feb  5 14:11:50 hayes-02 qarshd[37403]: Talking to peer ::ffff:10.15.80.223:41534 (IPv6)
Feb  5 14:11:50 hayes-02 qarshd[37403]: Running cmdline: /usr/tests/sts-rhel8.2/bin/checkit -w /mnt/merge -f /tmp/original.1199439 -v
Feb  5 14:11:50 hayes-02 kernel: XFS (dm-5): Metadata corruption detected at xfs_da3_node_read_verify+0x8b/0x160 [xfs], xfs_da3_node block 0x1b430 
Feb  5 14:11:50 hayes-02 kernel: XFS (dm-5): Unmount and run xfs_repair
Feb  5 14:11:50 hayes-02 kernel: XFS (dm-5): First 128 bytes of corrupted metadata buffer:
Feb  5 14:11:50 hayes-02 kernel: 00000000: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000010: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000020: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000030: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000040: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000050: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000060: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: 00000070: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  5 14:11:50 hayes-02 kernel: XFS (dm-5): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x1b430 len 8 error 



[root@hayes-02 ~]# lvs -a -o +devices
  LV               VG                Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert Devices          
  corigin          writecache_sanity owi-aoC---   4.00g [pool_cvol] [corigin_wcorig]                                         corigin_wcorig(0)
  [corigin_wcorig] writecache_sanity owi-aoC---   4.00g                                                                      /dev/sdc1(0)     
  merge            writecache_sanity swi-aos--- 500.00m             corigin          0.45                                    /dev/sdb1(0)     
  [pool_cvol]      writecache_sanity Cwi-aoC---   2.00g                                                                      /dev/sde1(0)     

[root@hayes-02 ~]# df -h
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/writecache_sanity-corigin  4.0G  417M  3.6G  11% /mnt/corigin
/dev/mapper/writecache_sanity-merge    4.0G  417M  3.6G  11% /mnt/merge

[root@hayes-02 ~]# lvs -a -o +devices
  LV               VG                Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert Devices          
  corigin          writecache_sanity owi-aoC---   4.00g [pool_cvol] [corigin_wcorig]                                         corigin_wcorig(0)
  [corigin_wcorig] writecache_sanity owi-aoC---   4.00g                                                                      /dev/sdc1(0)     
  merge            writecache_sanity swi-aos--- 500.00m             corigin          0.45                                    /dev/sdb1(0)     
  [pool_cvol]      writecache_sanity Cwi-aoC---   2.00g                                                                      /dev/sde1(0)

Comment 3 David Teigland 2020-02-05 20:30:41 UTC

It doesn't make sense to do anything with the origin LV while it has a writecache, it's not a valid view of the data.  Will need to find the place to stop that.

Comment 4 David Teigland 2020-02-05 20:51:00 UTC

This is confusing because of the naming/terms being used.  In your example the main/original LV is named "corigin", but the word corigin also implies a role/position of the LV which is what I thought you meant.

Say we have two linear LVs "main" and "fast".  In that state neither of them have the role of a cache origin.

Once fast is attached to main to start caching, then we have LVs: "main", "fast" and "main_wcorig".  In that state, "main_wcorig" is the cache origin.  In your example, you're taking a snapshot of "main" (the top level LV which doesn't really have a term), not "main_wcorig" (the writecache origin in lvm terms).

Given all that, it does conceptually make sense to take a snapshot of "main", but technically there could be issues with the way that the snapshot target is doing i/o in the kernel that's not compatible with using writecache under it.  Will need to investigate that.

Comment 5 David Teigland 2020-02-05 21:04:08 UTC

> In that state, "main_wcorig" is the cache origin.
> In your example, you're taking a snapshot of "main" (the top level LV which
> doesn't really have a term), not "main_wcorig" (the writecache origin in lvm
> terms).

Now that I think about it, lvm uses the word "origin" everywhere for everything so I think I'll give up trying to make sense of that word.

Comment 6 David Teigland 2020-02-05 21:29:08 UTC

I did a similar sort of test but without any cachesettings (so using defaults), and didn't see a problem.  When I tried using the same cachesettings you have above I did get an xfs corruption, but after removing the snapshot.  Could you try without changing any settings?

Comment 7 Corey Marthaler 2020-02-06 15:55:15 UTC

> Once fast is attached to main to start caching, then we have LVs: "main",
> "fast" and "main_wcorig".  In that state, "main_wcorig" is the cache origin.
> In your example, you're taking a snapshot of "main" (the top level LV which
> doesn't really have a term), not "main_wcorig" (the writecache origin in lvm
> terms).

I'll try and make sure we have tests for both scenarios, however, once a writecached volume, your only choice is to snap "main".


### origin "slow" snap attempt of *existing* writecache
[root@hayes-02 ~]# lvs -a -o +devices
  LV               VG                Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert Devices          
  corigin          writecache_sanity owi-aoC---   4.00g [pool_cvol] [corigin_wcorig]                                         corigin_wcorig(0)
  [corigin_wcorig] writecache_sanity owi-aoC---   4.00g                                                                      /dev/sdc1(0)     
  [pool_cvol]      writecache_sanity Cwi-aoC---   2.00g                                                                      /dev/sde1(0)     

[root@hayes-02 ~]# lvcreate -s writecache_sanity/corigin_wcorig -n snap_attempt -L 100M
  Snapshots of hidden volumes are not supported.



### origin "slow" snap attempt *prior* to becoming writecache
[root@hayes-02 ~]# lvcreate --wipesignatures y  -L 4G -n corigin writecache_sanity @slow
  Logical volume "corigin" created.
[root@hayes-02 ~]# lvcreate  -L 2G -n pool writecache_sanity @fast
  Logical volume "pool" created.
[root@hayes-02 ~]# lvcreate -s writecache_sanity/corigin -n snap_attempt -L 100M
  Logical volume "snap_attempt" created.
[root@hayes-02 ~]# lvchange -an writecache_sanity

[root@hayes-02 ~]# lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/corigin
  Logical volume writecache_sanity/corigin now has write cache.
[root@hayes-02 ~]# lvs -a -o +devices
  LV               VG                Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert Devices          
  corigin          writecache_sanity owi---C---   4.00g [pool_cvol] [corigin_wcorig]                                         corigin_wcorig(0)
  [corigin_wcorig] writecache_sanity owi---C---   4.00g                                                                      /dev/sdc1(0)     
  [pool_cvol]      writecache_sanity Cwi---C---   2.00g                                                                      /dev/sde1(0)     
  snap_attempt     writecache_sanity swi---s--- 100.00m             corigin                                                  /dev/sdb1(0)

Comment 8 Corey Marthaler 2020-02-06 16:49:28 UTC

I can reliably hit this even with default cache settings. Here's a "boiled down" set of cmds:

pvchange --addtag slow /dev/sdf1
pvchange --addtag fast /dev/sde1
lvcreate --wipesignatures y -L 4G -n corigin writecache_sanity @slow
lvcreate -L 2G -n pool writecache_sanity @fast 
lvchange -an writecache_sanity/corigin writecache_sanity/pool
lvconvert --yes --type writecache --cachevol writecache_sanity/pool writecache_sanity/corigin
lvchange -ay writecache_sanity/corigin 
mkfs.xfs -f /dev/writecache_sanity/corigin
mkdir -p /mnt/corigin
mount /dev/writecache_sanity/corigin /mnt/corigin
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/corigin -f /tmp/original.1217267 -n 1200
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/corigin -f /tmp/original.1217267 -v
lvcreate -s /dev/writecache_sanity/corigin -n merge -L 1G
mkdir -p /mnt/merge
mount -o nouuid /dev/writecache_sanity/merge /mnt/merge 
/usr/tests/sts-rhel8.2/bin/checkit -w /mnt/merge -f /tmp/original.1217267 -v

Feb  6 10:33:15 hayes-02 kernel: XFS (dm-5): Metadata corruption detected at xfs_da3_node_read_verify+0x8b/0x160 [xfs], xfs_da3_node block 0x2f5b0
Feb  6 10:33:15 hayes-02 kernel: XFS (dm-5): Unmount and run xfs_repair
Feb  6 10:33:15 hayes-02 kernel: XFS (dm-5): First 128 bytes of corrupted metadata buffer:
Feb  6 10:33:15 hayes-02 kernel: 00000000: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000010: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000020: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000030: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000040: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000050: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000060: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: 00000070: 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35  5555555555555555
Feb  6 10:33:15 hayes-02 kernel: XFS (dm-5): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x2f5b0 len 8 error 117

Comment 9 David Teigland 2020-02-06 17:32:43 UTC

Added a check to prevent snapshots of writecache until we figure this out:
https://sourceware.org/git/?p=lvm2.git;a=commit;h=ffea7daec3d09fb4315104bb346cfa9d6db6f9e9

Mikulas, do you have any ideas about possible problems with putting a snapshot on top of writecache?

Comment 10 Mikuláš Patočka 2020-02-07 10:58:08 UTC

I don't know how writecache could interfere with a snapshot.

Comment 11 Mikuláš Patočka 2020-02-12 15:25:59 UTC

I have found a bug in dm-writecache that could cause crash or data corruption if the writecache device was suspended.

Try this patch:
https://www.redhat.com/archives/dm-devel/2020-February/msg00106.html

Comment 15 Corey Marthaler 2020-02-27 17:08:22 UTC

The following scratch kernel does *not* fix this snap (with inactive origin and pool conversion) issue..

[root@hayes-02 ~]# uname -ar
Linux hayes-02.lab.msp.redhat.com 4.18.0-184.el8.test.x86_64 #1 SMP Wed Feb 26 14:24:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@hayes-02 ~]# rpm -qa | grep kernel
kernel-4.18.0-184.el8.test.x86_64
kernel-modules-4.18.0-184.el8.test.x86_64
kernel-modules-extra-4.18.0-184.el8.test.x86_64
kernel-tools-libs-4.18.0-184.el8.test.x86_64
kernel-tools-4.18.0-184.el8.test.x86_64
kernel-core-4.18.0-184.el8.test.x86_64

Create writecached volume by combining the cache pool (fast) and origin (slow) volumes
lvconvert --yes --type writecache --cachesettings ' low_watermark=42 high_watermark=90 writeback_jobs=1637 autocommit_blocks=2916 autocommit_time=1414' --cachevol writecache_sanity/pool writecache_sanity/cworigin
Placing an xfs filesystem on origin volume
Mounting origin volume

Writing files to /mnt/cworigin

Checking files on /mnt/cworigin

Making a snapshot of the origin volume, mounting, and verifying original data
lvcreate  -s /dev/writecache_sanity/cworigin -c 64 -n merge -L 500M
+++ Mounting and verifying snapshot merge data +++
Checking files on /mnt/merge
Can not stat bbulnrywkusukrbcdjxkbnwpkfgvbcoffjqueprqmmla: Structure needs cleaning

Feb 27 11:03:35 hayes-02 qarshd[7485]: Running cmdline: /usr/tests/sts-rhel8.2/bin/checkit -w /mnt/merge -f /tmp/origv
Feb 27 11:03:35 hayes-02 kernel: XFS (dm-5): Metadata corruption detected at xfs_da3_node_read_verify+0x8b/0x160 [xfs 
Feb 27 11:03:35 hayes-02 kernel: XFS (dm-5): Unmount and run xfs_repair
Feb 27 11:03:35 hayes-02 kernel: XFS (dm-5): First 128 bytes of corrupted metadata buffer:
Feb 27 11:03:35 hayes-02 kernel: 00000000: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000010: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000020: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000030: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000040: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000050: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000060: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: 00000070: 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d 4d  MMMMMMMMMMMMMMMM
Feb 27 11:03:35 hayes-02 kernel: XFS (dm-5): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x55138 len 8 er7

Comment 16 Corey Marthaler 2020-02-27 21:38:11 UTC

These bugs still exist with the latest msnitzer kernel (4.18.0-178.el8.bz1798631.x86_64) as well.

Comment 17 David Teigland 2020-03-05 16:56:54 UTC

Mikulas, it looks like the patch in comment 11 didn't resolve the issue.  We're trying to decide if attaching writecache to an active LV should be disabled in lvm for 8.2, or if you think a kernel fix is still possible?

Comment 18 Corey Marthaler 2020-03-05 22:08:40 UTC

Looks like this can be avoided by providing a non default block size (512) at conversion time. I'll remove the Testblocker flag since we can work around this. 


# Use the default block size

[root@hayes-03 ~]# lvs -a -o +devices
  LV                 VG                Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices     
  cache_during_fs_io writecache_sanity -wi------- 2.00g                                                     /dev/sdn1(0)
  cworigin           writecache_sanity -wi-ao---- 4.00g                                                     /dev/sdk1(0)

[root@hayes-03 ~]# pvscan
  PV /dev/sdb1   VG writecache_sanity   lvm2 [446.62 GiB / 446.62 GiB free]
  PV /dev/sdk1   VG writecache_sanity   lvm2 [<1.82 TiB / 1.81 TiB free]
  PV /dev/sdl1   VG writecache_sanity   lvm2 [<1.82 TiB / <1.82 TiB free]
  PV /dev/sdn1   VG writecache_sanity   lvm2 [<1.82 TiB / <1.82 TiB free]
  Total: 4 [5.89 TiB] / in use: 4 [5.89 TiB] / in no VG: 0 [0   ]

[root@hayes-03 ~]# dmsetup table writecache_sanity/cworigin
0 8388608 linear 8:161 2048

[root@hayes-03 ~]#  xfs_info /dev/writecache_sanity/cworigin
meta-data=/dev/mapper/writecache_sanity-cworigin isize=512    agcount=4, agsize=262144 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1048576, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# Use default block_size, cause corruption
[root@hayes-03 ~]# lvconvert --yes --type writecache --cachesettings ' low_watermark=26 high_watermark=69 writeback_jobs=2649 autocommit_blocks=1616 autocommit_time=2443' --cachevol writecache_sanity/cache_during_fs_io writecache_sanity/cworigin
  Logical volume writecache_sanity/cworigin now has write cache.

Mar  5 15:57:06 hayes-03 kernel: device-mapper: writecache: I/O is not aligned, sector 4214122, size 1024, block size 6
Mar  5 15:57:06 hayes-03 kernel: XFS (dm-0): metadata I/O error in "xlog_iodone" at daddr 0x404d6a len 64 error 5
Mar  5 15:57:06 hayes-03 kernel: XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1272 of file fs/xfs/xfs_log.c2
Mar  5 15:57:06 hayes-03 kernel: XFS (dm-0): Log I/O Error Detected. Shutting down filesystem
Mar  5 15:57:06 hayes-03 kernel: XFS (dm-0): Please unmount the filesystem and rectify the problem(s)

[root@hayes-03 ~]#  dmsetup table writecache_sanity/cworigin
0 8388608 writecache s 253:2 253:1 4096 10 high_watermark 69 low_watermark 26 writeback_jobs 2649 autocommit_blocks 1616 autocommit_time 2443
[root@hayes-03 ~]#  xfs_info /dev/writecache_sanity/cworigin
/mnt/cworigin: Input/output error





# Start over, and provide a non default block size of 512

[root@hayes-03 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/writecache_sanity-cworigin  4.0G  416M  3.6G  11% /mnt/cworigin

[root@hayes-03 ~]# lvs -a -o +devices
  LV                 VG                Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices     
  cache_during_fs_io writecache_sanity -wi------- 2.00g                                                     /dev/sdn1(0)
  cworigin           writecache_sanity -wi-ao---- 4.00g                                                     /dev/sdk1(0)

[root@hayes-03 ~]# pvscan
  PV /dev/sdb1   VG writecache_sanity   lvm2 [446.62 GiB / 446.62 GiB free]
  PV /dev/sdk1   VG writecache_sanity   lvm2 [<1.82 TiB / 1.81 TiB free]
  PV /dev/sdl1   VG writecache_sanity   lvm2 [<1.82 TiB / <1.82 TiB free]
  PV /dev/sdn1   VG writecache_sanity   lvm2 [<1.82 TiB / <1.82 TiB free]
  Total: 4 [5.89 TiB] / in use: 4 [5.89 TiB] / in no VG: 0 [0   ]

[root@hayes-03 ~]# dmsetup table writecache_sanity/cworigin
0 8388608 linear 8:161 2048

[root@hayes-03 ~]# xfs_info /dev/writecache_sanity/cworigin
meta-data=/dev/mapper/writecache_sanity-cworigin isize=512    agcount=4, agsize=262144 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1048576, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# Provide 512 block_size, *no* corruption
[root@hayes-03 ~]# lvconvert --yes --type writecache --cachesettings 'block_size=512 low_watermark=26 high_watermark=69 writeback_jobs=2649 autocommit_blocks=1616 autocommit_time=2443' --cachevol writecache_sanity/cache_during_fs_io writecache_sanity/cworigin
  Logical volume writecache_sanity/cworigin now has write cache.

[root@hayes-03 ~]#  dmsetup table writecache_sanity/cworigin
0 8388608 writecache s 253:2 253:1 512 10 high_watermark 69 low_watermark 26 writeback_jobs 2649 autocommit_blocks 1616 autocommit_time 2443

[root@hayes-03 ~]#  xfs_info /dev/writecache_sanity/cworigin
meta-data=/dev/mapper/writecache_sanity-cworigin isize=512    agcount=4, agsize=262144 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1048576, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Comment 19 David Teigland 2020-03-05 22:35:48 UTC

We only thought of this issue previously as it applies to mounting the file system. In that case xfs won't mount, the user can then detach the writecache, and go back to using the fs without the writecache. We didn't recognize that attaching the writecache to an active LV with a mounted fs would not give us the simple error path to catch and avoid this combination. The man page notes regarding this as it relates to mount copied below. The question now is how easily lvm could detect this invalid combination when the user attempts to attach writecache to an active LV so that we can refuse. Without an easy solution for that we should disable attaching writecache to active LVs.

dm-writecache block size

The dm-writecache block size can be 4096 bytes (the default), or 512
bytes. The default 4096 has better performance and should be used
except when 512 is necessary for compatibility. The dm-writecache
block size is specified with --cachesettings block_size=4096|512 when
caching is started.

When a file system like xfs already exists on the main LV prior to
caching, and the file system is using a block size of 512, then the
writecache block size should be set to 512. (The file system will
likely fail to mount if writecache block size of 4096 is used in this
case.)

Check the xfs sector size while the fs is mounted:

$ xfs_info /dev/vg/main
Look for sectsz=512 or sectsz=4096

The writecache block size should be chosen to match the xfs sectsz
value.

It is also possible to specify a sector size of 4096 to mkfs.xfs when
creating the file system. In this case the writecache block size of
4096 can be used.

Comment 20 Mikuláš Patočka 2020-03-06 14:48:18 UTC

> The question now is how easily lvm could detect this invalid combination when
> the user attempts to attach writecache to an active LV so that we can refuse.
> Without an easy solution for that we should disable attaching writecache to
> active LVs.

libblkid (starting with version 2.35) can detect filesystem block size for majority of filesystems. So, we should backport it to RHEL8 and use it in lvm to select the default block size when attaching writecache or integrity to an existing logical volume.

Comment 21 David Teigland 2020-03-06 16:36:00 UTC

This bug has become confusing with different bugs/issues:

1. bug when taking a snapshot of a writecache LV.  There was a kernel patch for this which probably works, but we disabled writecache snapshots in lvm based on the uncertainty caused by the other points.  We could probably reenable writecache snapshots, but I don't know if we want to take the time to revisit that.

2. bug when adding writecache to an active LV.  This can be avoided by users doing the right thing, but it's so easy to do the wrong thing, and the consequences are severe enough, that we should disable adding writecache to an active LV for 8.2.

3. bug 1630978 should be used to enable active conversions to writecache in 8.3 by using libblkid to select the correct block size.

Which issues should this bug be used for?

Comment 22 Corey Marthaler 2020-03-06 19:29:45 UTC

I think this bug should be used for #1. We move this to ON_QA and verify that in 8.2 snapshots of writecache volumes are not allowed. 

We then create an RFE in 8.3 (or whenever) to enable them again. 

For #2, we should be using/commenting on bug 1808012. I wrongly added comment #18 (and then subsequently #19) to this bug when it should have been added to 1808012. If we turn off active write cache conversion, then bug 1808012 can be used to verify that it's no longer allowed, and similar to this bug, we can open a new one in 8.3 to enable it again.

Comment 23 David Teigland 2020-03-09 16:16:32 UTC

Using this bug to disable snapshots of writecache LVs in 8.2.

Comment 26 Corey Marthaler 2020-03-10 16:57:14 UTC

Verifying what is mentioned in comments #22 and #23, that snapshots are *not* allowed in the latest rpms. This is *not* a verification that the corruption has been fixed.

kernel-4.18.0-187.el8    BUILT: Fri Mar  6 22:09:03 CST 2020

lvm2-2.03.08-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020
lvm2-libs-2.03.08-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020

device-mapper-1.02.169-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020
device-mapper-libs-1.02.169-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020
device-mapper-event-1.02.169-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020
device-mapper-event-libs-1.02.169-2.el8    BUILT: Mon Feb 24 11:21:38 CST 2020


[root@hayes-01 ~]# lvs -a -o +devices
  LV                VG                Attr       LSize Pool        Origin            Data%  Meta% Devices            
  cworigin          writecache_sanity Cwi-aoC--- 4.00g [pool_cvol] [cworigin_wcorig] 18.81        cworigin_wcorig(0)         
  [cworigin_wcorig] writecache_sanity owi-aoC--- 4.00g                                            /dev/sdb1(0)               
  [pool_cvol]       writecache_sanity Cwi-aoC--- 2.00g                                            /dev/sdd1(0)               

[root@hayes-01 ~]# lvcreate  -s /dev/writecache_sanity/cworigin -c 64 -n merge -L 500M
  Snapshots of writecache are not supported.

Comment 28 Mikuláš Patočka 2020-04-23 17:46:36 UTC

Perhaps this bug was already fixed with this patch: https://www.redhat.com/archives/dm-devel/2020-April/msg00159.html

Could someone recheck this?

Comment 30 errata-xmlrpc 2020-04-28 16:59:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1881

Comment 32 Red Hat Bugzilla 2023-09-15 00:21:11 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days