Bug 2203265

Summary: protection against splitting off mirror legs during up conversion appears to be missing now when sync is close to completion with fast nvme devices
Product: Red Hat Enterprise Linux 9 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID QA Contact: cluster-qe <cluster-qe>
Status: NEW --- Docs Contact:
Severity: medium    
Priority: high CC: agk, heinzm, jbrassow, msnitzer, prajnoha, zkabelac
Version: 9.3Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2023-05-11 16:49:40 UTC
Description of problem:
This sort of operation used to be protected against, which makes me believe this is a regression.

[root@grant-03 ~]# lvcreate --yes -n fs_to_mirror -L 1.5G mirror_sanity
  Logical volume "fs_to_mirror" created.
[root@grant-03 ~]# lvs -a -o +devices
  LV           VG            Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  fs_to_mirror mirror_sanity -wi-a-----    1.50g                                                     /dev/nvme0n1p1(0)

[root@grant-03 ~]# lvconvert --type mirror -m 1 mirror_sanity/fs_to_mirror &
[1] 1560199
[root@grant-03 ~]#   
  Logical volume mirror_sanity/fs_to_mirror being converted.
  mirror_sanity/fs_to_mirror: Converted: 3.12%
  ABORTING: Mirror percentage check failed.

[1]+  Exit 5                  lvconvert --type mirror -m 1 mirror_sanity/fs_to_mirror

<WHILE ABOVE CMD IS RUNNING>
[root@grant-03 ~]# lvconvert --splitmirrors 1 --name new mirror_sanity/fs_to_mirror
  Logical volume mirror_sanity/fs_to_mirror converted.

[root@grant-03 ~]# lvs -a -o +devices
  LV           VG            Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices
  fs_to_mirror mirror_sanity -wi-a-----    1.50g                                                     /dev/nvme0n1p1(0)
  new          mirror_sanity -wi-a-----    1.50g                                                     /dev/nvme1n1p1(0)


May 11 18:31:15 grant-03 systemd[1]: Started LVM2 poll daemon.
May 11 18:31:16 grant-03 dmeventd[1559481]: mirror_sanity-fs_to_mirror is now in-sync.
May 11 18:31:20 grant-03 dmeventd[1559481]: No longer monitoring mirror device mirror_sanity-fs_to_mirror for events.
May 11 18:31:30 grant-03 lvmpolld[1560234]: W: #011LVPOLL: PID 1560237: STDERR: '  ABORTING: Mirror percentage check failed.'
May 11 18:31:30 grant-03 lvmpolld[1560234]: LVMPOLLD: lvm2 cmd (PID 1560237) failed (retcode: 5)
May 11 18:32:30 grant-03 systemd[1]: lvm2-lvmpolld.service: Deactivated successfully.


Version-Release number of selected component (if applicable):
kernel-5.14.0-284.11.1.el9_2    BUILT: Wed Apr 12 04:16:55 PM CEST 2023
lvm2-2.03.21-1.el9    BUILT: Fri Apr 21 02:33:33 PM CEST 2023
lvm2-libs-2.03.21-1.el9    BUILT: Fri Apr 21 02:33:33 PM CEST 2023


How reproducible:
Everytime

Comment 1 Corey Marthaler 2023-05-11 17:05:34 UTC
This appears to only happen with fast nvme devices. I haven't been able to reproduce this with SSDs. Also, if I bump the size of the mirror way up with the nvme devices, then I will see the "Unable to split mirror mirror_sanity/fs_to_mirror that is not in-sync." messages again. 

So this appears to be a timing issue where LVM thinks the sync is complete, or close to complete, but it's not really.

Comment 2 Heinz Mauelshagen 2023-05-22 12:34:01 UTC
(In reply to Corey Marthaler from comment #1)
> This appears to only happen with fast nvme devices. I haven't been able to
> reproduce this with SSDs. Also, if I bump the size of the mirror way up with
> the nvme devices, then I will see the "Unable to split mirror
> mirror_sanity/fs_to_mirror that is not in-sync." messages again. 
> 
> So this appears to be a timing issue where LVM thinks the sync is complete,
> or close to complete, but it's not really.

So it seems.

FYI: we've seen similar issues with dm-raid on (small) LVs backed by fast PVs whch have been fixed a few years ago.

No reproducer in VM here (virtio-scsi -> host thin -> host PCIe4 NVMe):

# lvconvert -y --ty mirror -m1 t/t&lvconvert -y --splitmirrors 1 --name n t/t
[1] 5089
  Logical volume t/t being converted.
  Unable to split mirror t/t that is not in-sync.
  t/t: Converted: 75.00%

# lvs
  LV   VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log      Cpy%Sync Convert
  root rhel_vm45 -wi-ao---- <17.00g                                                         
  swap rhel_vm45 -wi-ao----   2.00g                                                         
  t    t         mwi-a-m--- 128.00m                                [t_mlog] 100.00          

# lvconvert -m0 t  t/t: Converted: 100.00%
/t
  Logical volume t/t converted.
[root@vm45 ~]# lvs
  LV   VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root rhel_vm45 -wi-ao---- <17.00g                                                    
  swap rhel_vm45 -wi-ao----   2.00g                                                    
  t    t         -wi-a----- 128.00m                                                    

# lvconvert -y --ty mirror -m1 t/t&sleep 0.1;lvconvert -y --splitmirrors 1 --name n t/t
[2] 5404
  Logical volume t/t being converted.
[1]   Done                    lvconvert -y --ty mirror -m1 t/t
  Unable to split mirror t/t that is not in-sync.
  t/t: Converted: 71.88%