Created attachment 1663111 [details] logs Description of problem: Live storage migration in RHV is broken with libvirt 6.0 because locking changes on destination image chain. The flow is: - Vm running with one (or more) disk - Vdsm create a snapshot on the source storage - Vdsm create empty image chain replicating the original disk on destination storage - Vdsm call virDomainBlockCopy with the destination chain, using flags: flags = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW | libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT | libvirt.VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB) self._dom.blockCopy(drive.name, destxml, flags=flags) - Libvit starts a mirror job mirroring the active layer in source and destination storage - Vdsm copy the rest of the chain from source to destination storage using qemu-img convert - qemu-img convert fails the copy with: qemu-img: /rhev/data-center/mnt/nfs1:_ export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220: error while converting raw: Failed to get "write" lock\nIs another process using the image [/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991 -773c-41e1-bbc9-651cd50b0220]? - The operation fails because qemu is holding a read lock on the destination image chain. In previous version there was no such issue, so I assume qemu did not take these locks. I guess this is a side effect of the blockdev changes. Version-Release number of selected component (if applicable): # rpm -q libvirt-daemon qemu-kvm libvirt-daemon-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64 qemu-kvm-4.2.0-9.module+el8.2.0+5699+b5331ee5.x86_64 How reproducible: 100% Steps to Reproduce: 1. In RHV, move a disk to file based storage while a VM is running Actual results: Operation fails because qemu-img convert cannot take a write lock on the base volume in the destination storage. Expected results: qemu does not hold read lock on the destination image chain during blockCopy. The read lock should be taken at the point we pivot to destination chain. To debug this I added a sleep in vdsm before starting the copy, and collected vm xml and lslocks output after blockCopy was issued, but before starting qemu-img convert. This is the VM XML before starting blockCopy: <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='block' index='3'> <format type='qcow2'/> <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <serial>11aae027-23a3-4c56-a1cb-4883c323e708</serial> <boot order='1'/> <alias name='ua-11aae027-23a3-4c56-a1cb-4883c323e708'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> This is the VM XML after starting blockCopy, before starting qemu-img convert: <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' index='4'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='block' index='1'> <format type='qcow2'/> <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='block' index='3'> <format type='qcow2'/> <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> </backingStore> </backingStore> <mirror type='file' file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' format='qcow2' job='copy' ready='yes'> <format type='qcow2'/> <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' index='5'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='file' index='6'> <format type='qcow2'/> <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='file' index='7'> <format type='raw'/> <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> </backingStore> </backingStore> </mirror> <target dev='vda' bus='virtio'/> <serial>11aae027-23a3-4c56-a1cb-4883c323e708</serial> <boot order='1'/> <alias name='ua-11aae027-23a3-4c56-a1cb-4883c323e708'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> These are the locks taken by qemu at this point: qemu-kvm 2335 POSIX 192.1K READ 0 100 100 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13 qemu-kvm 2335 POSIX 192.1K READ 0 201 201 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13 qemu-kvm 2335 POSIX 192.1K READ 0 203 203 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13 qemu-kvm 2335 POSIX 6G READ 0 201 201 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220 qemu-kvm 2335 POSIX 6G READ 0 203 203 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220 This is the image chain in the destination storage (nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed): f4c29b2d-ba32-425d-8348-791cd7ffe3be c1b7a23d-76f0-438e-87a6-4adbbfcafd13 66e23991-773c-41e1-bbc9-651cd50b0220 qemu-kvm holds read locks on: c1b7a23d-76f0-438e-87a6-4adbbfcafd13 66e23991-773c-41e1-bbc9-651cd50b0220 qemu-img convert command is: qemu-img convert -p -t none -T none -f qcow2 \ /rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220 \ -O raw \ /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220 The operation fails since qemu-img convert cannot take a write lock when qemu-kvm is holding a read lock on same image: Failed to get "write" lock Is another process using the image [/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220]? These are the QMP commands adding the destination images: # grep QEMU_MONITOR_SEND_MSG libvirtd.log | grep blockdev-add | grep 66e23991-773c-41e1-bbc9-651cd50b0220 2020-02-14 11:45:48.804+0000: 18433: info : qemuMonitorSend:996 : QEMU_MONITOR_SEND_MSG: mon=0x7f6ac4026910 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220","aio":"native","node-name":"libvirt-7-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-424"} # grep QEMU_MONITOR_SEND_MSG libvirtd.log | grep blockdev-add | grep c1b7a23d-76f0-438e-87a6-4adbbfcafd13 2020-02-14 11:45:48.810+0000: 18433: info : qemuMonitorSend:996 : QEMU_MONITOR_SEND_MSG: mon=0x7f6ac4026910 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13","aio":"native","node-name":"libvirt-6-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-426"} Can we control locking in this commands? There is another issue - looks like there is no write lock on the top image in the destination chain: /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be
How this bug affects RHV: Moving disks to another storage while VM is running is not possible. Users need to stop the VM to move the disk to another storage, which can take significant time (minutes, hours) with large disks. This is also a regression compared to previous version (RHEL 7.7).
As this would literally disable one of the important RHV features, moving this to 8.2 release. Peter, if you have any way how LSM (Live Storage Migration) can work anyhow, please share with Nir.
This breaks because RHV abused a quirk in qemu's handling of the backing chain which was opened only when attempting to finish the job. Libvirt's handling expected the backing store to be present right away. I'll try to see whether we can move the opening of the backing chain later to preserve the previous functionality.
When does oVirt establish the backing files of the copied image? Are they present when calling virDomainBlockCopy? Is the metadata correct in the images? We need to know this as we need to inspect the backing chain and this effectively tries to bake the semantics into our API.
(In reply to Peter Krempa from comment #4) > When does oVirt establish the backing files of the copied image? Before calling virDomainBlockCopy. > Are they present when calling virDomainBlockCopy? Yes > Is the metadata correct in the images? Yes. Basically we replicate the image chain on the destination storage, using the same image format and sizes as the original chain. At this point all the images are empty. If the VM is using a template disk, the template disk must exist on the destination storage, so this image will be complete at this point. We don't copy templates as part of live storage migration. The base volume in the chain may have different format compared with the source. For example the source may be raw sparse file on file storage, and the destination may be qcow2 thin disk on block storage. Note that RHV may like to change the destination chain in the future, for example collapsing all the images to single image, but we don't support this yet. Then we call virDomainBlockCopy() with top volume on the destination storage, and libvirt/qemu start to mirror the active layer from source to destination. At this point libvirt can query the image chain to get the structure of the chain. Then we start to copy the disks from source to destination using qemu-img convert. We copy the volumes starting with the base and ending with the last volume before the active layer. We never access the top image on the destination, so it can use write lock at the point virtDomainBlockCopy() is called. When the copy is done we check the state of the block copy job, and when it is ready we pivot to the new chain. So at the point we abort and pivot, we do not access the disks.
The main problem and reason why behaviour changed in the first place is that to integrate with -blockdev in qemu we started asking qemu to open the file. Currently it's required to open it right away when starting the block job as the backing chain must be installed as backing images of the target of the copy. We can arguably move that code right before the 'pivot/block-job-complete' step but that requires us to install the backing files later. I discussed this issue briefly with Kevin Wolf. He suggested to use qemu's snapshot API which is stable. Unfortunately that didn't really work out. I've also tried using the not-yet-supported-by-upstream 'x-blockdev-reopen' which worked well in my crude fix. Unfortunately that wouldn't be an upstream ready solution for now even if I clean up the patches. I've thus filed bug 1805143 on qemu which tries to summarize this problem in the qemu language.
A demonstration of the above mentioned fix was posted upstream for review: https://www.redhat.com/archives/libvir-list/2020-February/msg01125.html
Fixed upstream by: cc7868a8b3 qemu: blockcopy: Allow late opening of the backing chain of a shallow copy d6498be165 qemu: capabilities: Introduce QEMU_CAPS_BLOCKDEV_SNAPSHOT_ALLOW_WRITE_ONLY c2b6bc7377 qemuDomainBlockCopyCommon: Record updated flags to block job a89ba6524c qemuDomainBlockPivot: Move check prior to executing the pivot steps 7c7fda15f3 qemu: capabilities: Update qemu-5.0.0 capabilities for x86_64
In previous test I made a mistake - using wrong vm when do blockcopy, changed to following test steps and result is PASS Version: libvirt-6.0.0-14.module+el8.2.0+6069+78a1cb09.x86_64 1. Make 3 snapshots for a vm disk vda [root@dell-per740xd-11 inc_bkup]# virsh domblklist checker Target Source ------------------------------------------------------ vda /var/lib/libvirt/images/vda.full.backup.s3 [root@dell-per740xd-11 inc_bkup]# qemu-img info -U --backing-chain /var/lib/libvirt/images/vda.full.backup.s3 | egrep "^image|^backing file:" image: /var/lib/libvirt/images/vda.full.backup.s3 backing file: /var/lib/libvirt/images/vda.full.backup.s2 image: /var/lib/libvirt/images/vda.full.backup.s2 backing file: /var/lib/libvirt/images/vda.full.backup.s1 image: /var/lib/libvirt/images/vda.full.backup.s1 backing file: /var/lib/libvirt/images/vda.full.backup.copy image: /var/lib/libvirt/images/vda.full.backup.copy [root@dell-per740xd-11 inc_bkup]# virsh snapshot-list checker Name Creation Time State --------------------------------------------------- s1 2020-03-27 03:18:17 -0400 disk-snapshot s2 2020-03-27 03:18:56 -0400 disk-snapshot s3 2020-03-27 03:19:15 -0400 disk-snapshot 2. prepare a target image for blockcopy [root@dell-per740xd-11 inc_bkup]# qemu-img create -f qcow2 /tmp/vda.copy 10G Formatting '/tmp/vda.copy', fmt=qcow2 size=10737418240 cluster_size=65536 lazy_refcounts=off refcount_bits=16 3. start the shallow blockcopy [root@dell-per740xd-11 inc_bkup]# virsh blockcopy checker vda /tmp/vda.copy --shallow --reuse-external --transient-job Block Copy started [root@dell-per740xd-11 inc_bkup]# virsh blockjob checker vda Block Copy: [100 %] 4. do qemu-img convert for the images in the backing chain [root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.copy -O qcow2 /tmp/vda.base.copy (100.00/100%) [root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s1 -O raw /tmp/s1 (100.00/100%) [root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s2 -O qcow2 /tmp/s2 (100.00/100%) 5. The mirror and active layer image cannot be converted, but I think this is expected. [root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /tmp/vda.copy -O qcow2 /tmp/copy.copy qemu-img: Could not open '/tmp/vda.copy': Failed to get "consistent read" lock Is another process using the image [/tmp/vda.copy]? [root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s3 -O qcow2 /tmp/s3 qemu-img: Could not open '/var/lib/libvirt/images/vda.full.backup.s3': Failed to get shared "write" lock Is another process using the image [/var/lib/libvirt/images/vda.full.backup.s3]?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017