1803092 – live storage migration broken due locking on destination chain during blockCopy

Bug 1803092 - live storage migration broken due locking on destination chain during blockCopy

Summary: live storage migration broken due locking on destination chain during blockCopy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.0
Assignee:	Peter Krempa
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:	1805143
Blocks:	1785939 1798072
TreeView+	depends on / blocked

Reported:	2020-02-14 12:57 UTC by Nir Soffer
Modified:	2020-05-05 09:58 UTC (History)
CC List:	13 users (show)
Fixed In Version:	libvirt-6.0.0-11.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1805143 (view as bug list)
Environment:
Last Closed:	2020-05-05 09:57:19 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs (226.25 KB, application/gzip) 2020-02-14 12:57 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2017	0	None	None	None	2020-05-05 09:58:26 UTC

Description Nir Soffer 2020-02-14 12:57:21 UTC

Created attachment 1663111 [details]
logs

Description of problem:

Live storage migration in RHV is broken with libvirt 6.0 because locking
changes on destination image chain.

The flow is:

- Vm running with one (or more) disk
- Vdsm create a snapshot on the source storage
- Vdsm create empty image chain replicating the original disk on destination
  storage
- Vdsm call virDomainBlockCopy with the destination chain, using flags:
  
        flags = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW |
                 libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT |
                 libvirt.VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB)

        self._dom.blockCopy(drive.name, destxml, flags=flags)

- Libvit starts a mirror job mirroring the active layer in source and
  destination storage
- Vdsm copy the rest of the chain from source to destination storage using
  qemu-img convert
- qemu-img convert fails the copy with:

qemu-img: /rhev/data-center/mnt/nfs1:_
export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220: error while converting raw: Failed to get "write" lock\nIs another process using the image [/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991
-773c-41e1-bbc9-651cd50b0220]?

- The operation fails because qemu is holding a read lock on the destination
  image chain.

In previous version there was no such issue, so I assume qemu did not take
these locks. I guess this is a side effect of the blockdev changes.

Version-Release number of selected component (if applicable):
# rpm -q libvirt-daemon qemu-kvm
libvirt-daemon-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64
qemu-kvm-4.2.0-9.module+el8.2.0+5699+b5331ee5.x86_64


How reproducible:
100%

Steps to Reproduce:
1. In RHV, move a disk to file based storage while a VM is running

Actual results:
Operation fails because qemu-img convert cannot take a write lock on the 
base volume in the destination storage.

Expected results:
qemu does not hold read lock on the destination image chain during blockCopy.
The read lock should be taken at the point we pivot to destination chain.


To debug this I added a sleep in vdsm before starting the copy, and collected
vm xml and lslocks output after blockCopy was issued, but before starting
qemu-img convert.


This is the VM XML before starting blockCopy:

    <disk type='block' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native' iothread='1'/>
      <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='block' index='3'>
        <format type='qcow2'/>
        <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <serial>11aae027-23a3-4c56-a1cb-4883c323e708</serial>
      <boot order='1'/>
      <alias name='ua-11aae027-23a3-4c56-a1cb-4883c323e708'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>


This is the VM XML after starting blockCopy, before starting qemu-img convert:

    <disk type='block' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native' iothread='1'/>
      <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' index='4'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='block' index='1'>
        <format type='qcow2'/>
        <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore type='block' index='3'>
          <format type='qcow2'/>
          <source dev='/rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'>
            <seclabel model='dac' relabel='no'/>
          </source>
          <backingStore/>
        </backingStore>
      </backingStore>
      <mirror type='file' file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' format='qcow2' job='copy' ready='yes'>
        <format type='qcow2'/>
        <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be' index='5'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore type='file' index='6'>
          <format type='qcow2'/>
          <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13'>
            <seclabel model='dac' relabel='no'/>
          </source>
          <backingStore type='file' index='7'>
            <format type='raw'/>
            <source file='/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220'>
              <seclabel model='dac' relabel='no'/>
            </source>
            <backingStore/>
          </backingStore>
        </backingStore>
      </mirror>
      <target dev='vda' bus='virtio'/>
      <serial>11aae027-23a3-4c56-a1cb-4883c323e708</serial>
      <boot order='1'/>
      <alias name='ua-11aae027-23a3-4c56-a1cb-4883c323e708'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>


These are the locks taken by qemu at this point:

qemu-kvm         2335  POSIX 192.1K READ  0   100        100 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13
qemu-kvm         2335  POSIX 192.1K READ  0   201        201 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13
qemu-kvm         2335  POSIX 192.1K READ  0   203        203 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13
qemu-kvm         2335  POSIX     6G READ  0   201        201 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220
qemu-kvm         2335  POSIX     6G READ  0   203        203 /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220


This is the image chain in the destination storage
(nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed):

    f4c29b2d-ba32-425d-8348-791cd7ffe3be
        c1b7a23d-76f0-438e-87a6-4adbbfcafd13
            66e23991-773c-41e1-bbc9-651cd50b0220


qemu-kvm holds read locks on:

        c1b7a23d-76f0-438e-87a6-4adbbfcafd13
            66e23991-773c-41e1-bbc9-651cd50b0220


qemu-img convert command is:

    qemu-img convert -p -t none -T none -f qcow2 \
        /rhev/data-center/mnt/blockSD/91630622-c645-4397-a9fe-9ddf26690500/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220 \
        -O raw \
        /rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220

The operation fails since qemu-img convert cannot take a write lock when qemu-kvm
is holding a read lock on same image:

Failed to get "write" lock
Is another process using the image [/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220]?


These are the QMP commands adding the destination images:

# grep QEMU_MONITOR_SEND_MSG libvirtd.log | grep blockdev-add | grep 66e23991-773c-41e1-bbc9-651cd50b0220
2020-02-14 11:45:48.804+0000: 18433: info : qemuMonitorSend:996 : QEMU_MONITOR_SEND_MSG: mon=0x7f6ac4026910 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/66e23991-773c-41e1-bbc9-651cd50b0220","aio":"native","node-name":"libvirt-7-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-424"}

# grep QEMU_MONITOR_SEND_MSG libvirtd.log | grep blockdev-add | grep c1b7a23d-76f0-438e-87a6-4adbbfcafd13
2020-02-14 11:45:48.810+0000: 18433: info : qemuMonitorSend:996 : QEMU_MONITOR_SEND_MSG: mon=0x7f6ac4026910 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/c1b7a23d-76f0-438e-87a6-4adbbfcafd13","aio":"native","node-name":"libvirt-6-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-426"}

Can we control locking in this commands?


There is another issue - looks like there is no write lock on the top image
in the destination chain:
/rhev/data-center/mnt/nfs1:_export_2/55255570-983a-4d82-907a-19b964abf7ed/images/11aae027-23a3-4c56-a1cb-4883c323e708/f4c29b2d-ba32-425d-8348-791cd7ffe3be

Comment 1 Nir Soffer 2020-02-14 13:03:59 UTC

How this bug affects RHV:
Moving disks to another storage while VM is running is not possible. Users
need to stop the VM to move the disk to another storage, which can take 
significant time (minutes, hours) with large disks.

This is also a regression compared to previous version (RHEL 7.7).

Comment 2 Martin Tessun 2020-02-18 15:45:06 UTC

As this would literally disable one of the important RHV features, moving this to 8.2 release.

Peter, if you have any way how LSM (Live Storage Migration) can work anyhow, please share with Nir.

Comment 3 Peter Krempa 2020-02-18 15:51:00 UTC

This breaks because RHV abused a quirk in qemu's handling of the backing chain which was opened only when attempting to finish the job. Libvirt's handling expected the backing store to be present right away. I'll try to see whether we can move the opening of the backing chain later to preserve the previous functionality.

Comment 4 Peter Krempa 2020-02-19 08:40:50 UTC

When does oVirt establish the backing files of the copied image? Are they present when calling virDomainBlockCopy? Is the metadata correct in the images? We need to know this as we need to inspect the backing chain and this effectively tries to bake the semantics into our API.

Comment 5 Nir Soffer 2020-02-19 09:34:00 UTC

(In reply to Peter Krempa from comment #4)
> When does oVirt establish the backing files of the copied image?

Before calling virDomainBlockCopy.

> Are they present when calling virDomainBlockCopy?

Yes

> Is the metadata correct in the images?

Yes.

Basically we replicate the image chain on the destination storage, using the same 
image format and sizes as the original chain. At this point all the images are empty.

If the VM is using a template disk, the template disk must exist on the destination
storage, so this image will be complete at this point. We don't copy templates as
part of live storage migration.

The base volume in the chain may have different format compared with the source. For
example the source may be raw sparse file on file storage, and the destination may
be qcow2 thin disk on block storage.

Note that RHV may like to change the destination chain in the future, for example
collapsing all the images to single image, but we don't support this yet.

Then we call virDomainBlockCopy() with top volume on the destination storage, and
libvirt/qemu start to mirror the active layer from source to destination.

At this point libvirt can query the image chain to get the structure of the chain.

Then we start to copy the disks from source to destination using qemu-img convert.
We copy the volumes starting with the base and ending with the last volume before
the active layer.

We never access the top image on the destination, so it can use write lock at the
point virtDomainBlockCopy() is called.

When the copy is done we check the state of the block copy job, and when it is ready we
pivot to the new chain. So at the point we abort and pivot, we do not access the disks.

Comment 6 Peter Krempa 2020-02-20 16:12:19 UTC

The main problem and reason why behaviour changed in the first place is that to integrate with -blockdev in qemu we started asking qemu to open the file. Currently it's required to open it right away when starting the block job as the backing chain must be installed as backing images of the target of the copy.

We can arguably move that code right before the 'pivot/block-job-complete' step but that requires us to install the backing files later.

I discussed this issue briefly with Kevin Wolf. He suggested to use qemu's snapshot API which is stable. Unfortunately that didn't really work out.

I've also tried using the not-yet-supported-by-upstream 'x-blockdev-reopen' which worked well in my crude fix. Unfortunately that wouldn't be an upstream ready solution for now even if I clean up the patches.

I've thus filed bug 1805143 on qemu which tries to summarize this problem in the qemu language.

Comment 7 Peter Krempa 2020-03-04 17:28:20 UTC

A demonstration of the above mentioned fix was posted upstream for review:

https://www.redhat.com/archives/libvir-list/2020-February/msg01125.html

Comment 11 Peter Krempa 2020-03-16 15:04:36 UTC

Fixed upstream by:

cc7868a8b3 qemu: blockcopy: Allow late opening of the backing chain of a shallow copy
d6498be165 qemu: capabilities: Introduce QEMU_CAPS_BLOCKDEV_SNAPSHOT_ALLOW_WRITE_ONLY
c2b6bc7377 qemuDomainBlockCopyCommon: Record updated flags to block job
a89ba6524c qemuDomainBlockPivot: Move check prior to executing the pivot steps
7c7fda15f3 qemu: capabilities: Update qemu-5.0.0 capabilities for x86_64

Comment 17 yisun 2020-03-27 08:08:37 UTC

In previous test I made a mistake - using wrong vm when do blockcopy, changed to following test steps and result is PASS
Version: libvirt-6.0.0-14.module+el8.2.0+6069+78a1cb09.x86_64

1. Make 3 snapshots for a vm disk vda
[root@dell-per740xd-11 inc_bkup]# virsh domblklist checker
 Target   Source
------------------------------------------------------
 vda      /var/lib/libvirt/images/vda.full.backup.s3


[root@dell-per740xd-11 inc_bkup]# qemu-img info -U --backing-chain /var/lib/libvirt/images/vda.full.backup.s3  | egrep "^image|^backing file:"
image: /var/lib/libvirt/images/vda.full.backup.s3
backing file: /var/lib/libvirt/images/vda.full.backup.s2
image: /var/lib/libvirt/images/vda.full.backup.s2
backing file: /var/lib/libvirt/images/vda.full.backup.s1
image: /var/lib/libvirt/images/vda.full.backup.s1
backing file: /var/lib/libvirt/images/vda.full.backup.copy
image: /var/lib/libvirt/images/vda.full.backup.copy


[root@dell-per740xd-11 inc_bkup]# virsh snapshot-list checker
 Name   Creation Time               State
---------------------------------------------------
 s1     2020-03-27 03:18:17 -0400   disk-snapshot
 s2     2020-03-27 03:18:56 -0400   disk-snapshot
 s3     2020-03-27 03:19:15 -0400   disk-snapshot


2. prepare a target image for blockcopy
[root@dell-per740xd-11 inc_bkup]# qemu-img create -f qcow2 /tmp/vda.copy 10G
Formatting '/tmp/vda.copy', fmt=qcow2 size=10737418240 cluster_size=65536 lazy_refcounts=off refcount_bits=16

3. start the shallow blockcopy
[root@dell-per740xd-11 inc_bkup]# virsh blockcopy checker vda /tmp/vda.copy --shallow --reuse-external --transient-job
Block Copy started
[root@dell-per740xd-11 inc_bkup]# virsh blockjob checker vda
Block Copy: [100 %]

4. do qemu-img convert for the images in the backing chain
[root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2  /var/lib/libvirt/images/vda.full.backup.copy -O qcow2 /tmp/vda.base.copy
    (100.00/100%)

[root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s1 -O raw /tmp/s1
    (100.00/100%)

[root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s2 -O qcow2 /tmp/s2
    (100.00/100%)


5. The mirror and active layer image cannot be converted, but I think this is expected.
[root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /tmp/vda.copy -O qcow2 /tmp/copy.copy
qemu-img: Could not open '/tmp/vda.copy': Failed to get "consistent read" lock
Is another process using the image [/tmp/vda.copy]?

[root@dell-per740xd-11 inc_bkup]# qemu-img convert -p -t none -T none -f qcow2 /var/lib/libvirt/images/vda.full.backup.s3 -O qcow2 /tmp/s3
qemu-img: Could not open '/var/lib/libvirt/images/vda.full.backup.s3': Failed to get shared "write" lock
Is another process using the image [/var/lib/libvirt/images/vda.full.backup.s3]?

Comment 19 errata-xmlrpc 2020-05-05 09:57:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017

Note You need to log in before you can comment on or make changes to this bug.