Bug 1818655 - Failed to do block commit in rhev4.4 after VM migration
Summary: Failed to do block commit in rhev4.4 after VM migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 8.0
Assignee: Peter Krempa
QA Contact: chhu
URL:
Whiteboard: libvirt_RHV_INT
Depends On: 1820016
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-30 02:23 UTC by chhu
Modified: 2021-01-14 09:06 UTC (History)
12 users (show)

Fixed In Version: libvirt-6.0.0-16.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-05 09:59:02 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xml-startvm (1.32 KB, text/plain)
2020-03-30 02:27 UTC, chhu
no flags Details
xml-before-migrate (2.41 KB, text/plain)
2020-03-30 02:28 UTC, chhu
no flags Details
xml-after-migrate (2.41 KB, text/plain)
2020-03-30 02:29 UTC, chhu
no flags Details
xml-deleted-s3 (2.04 KB, text/plain)
2020-03-30 02:30 UTC, chhu
no flags Details
backing-chain and libvirtd, vdsm logs (1.10 MB, application/gzip)
2020-03-30 02:35 UTC, chhu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2017 0 None None None 2020-05-05 09:59:45 UTC

Description chhu 2020-03-30 02:23:22 UTC
Description of problem:
In rhev4.4, delete snapshot after VM migration, failed with error:
"libvirt.libvirtError: Requested operation is not valid: can't keep relative backing relationship"

Version-Release number of selected component (if applicable):
libvirt-daemon-kvm-6.0.0-14.module+el8.2.0+6069+78a1cb09.x86_64
qemu-kvm-core-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64
kernel: 4.18.0-190.el8.x86_64(source) 4.18.0-187.el8.x86_64(target)

How reproducible:
100%

Steps to Reproduce:
1. Start VM on host A with glusterfs disk,
   the xml is in file: xml-startvm

2. Create snapshots for VM, s1(without memory),s2,s3
   the xml is in file: xml-before-migrate
--------------------------------------------------------------------
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads'/>
      <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/4a3c3fda-6ec6-4e04-9eec-0435d90c49f1' index='5'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='file' index='4'>
        <format type='qcow2'/>
        <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/f007375f-4082-4649-8512-f161498bc1f2'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore type='file' index='3'>
          <format type='qcow2'/>
          <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/e6c30bfc-c9d5-450b-9b62-171d2235e95e'>
            <seclabel model='dac' relabel='no'/>
          </source>
          <backingStore type='file' index='1'>
            <format type='raw'/>
            <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/ab737847-8486-4265-b6f9-fc44f42c1cf5'>
              <seclabel model='dac' relabel='no'/>
            </source>
            <backingStore/>
          </backingStore>
        </backingStore>
      </backingStore>
------------------------------------------------------------------------
   backing chain is in file: backing-chain-before-migrate

3. Migrate VM to host B successfully, but the disk index in xml are changed.
   the xml is in file: xml-after-migrate
--------------------------------------------------------------------
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads'/>
      <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/4a3c3fda-6ec6-4e04-9eec-0435d90c49f1' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='file' index='2'>
        <format type='qcow2'/>
        <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/f007375f-4082-4649-8512-f161498bc1f2'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore type='file' index='3'>
          <format type='qcow2'/>
          <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/e6c30bfc-c9d5-450b-9b62-171d2235e95e'>
            <seclabel model='dac' relabel='no'/>
          </source>
          <backingStore type='file' index='4'>
            <format type='raw'/>
            <source file='/rhev/data-center/mnt/glusterSD/*.243:_meili-gv0/68a803e5-fdb5-4c57-a461-d233b205b94a/images/b62c20eb-370c-4a6a-b7a4-d84ef60b1bb9/ab737847-8486-4265-b6f9-fc44f42c1cf5'>
              <seclabel model='dac' relabel='no'/>
            </source>
            <backingStore/>
          </backingStore>
        </backingStore>
      </backingStore>
------------------------------------------------------------
   backing chain is in file: backing-chain-after-migrate 

5. Delete snapshot s3 successfully
   the xml after delete s3 is in file: xml-deleted-s3
   backing chain is in file: backing-chain-deleted-s3

6. Try to delete s1 failed with error in vdsm.log:
--------------------------------------------------------------------
 ERROR (jsonrpc/5) [virt.vm] (vmId='4dcf9d4e-b65b-4e1a-8852-d44cd229911d') Live merge failed (job: 618668b9-8213-447d-a198-f314e5ebc38a) (vm:5344)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5342, in merge
    bandwidth, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 823, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirt.libvirtError: Requested operation is not valid: can't keep relative backing relationship
-------------------------------------------------------------------


Actual results:
In step6, failed to delete snapshot s1

Expected results:
In step6, delete snapshot s1 successfully

Additional info:
- libvirtd and vdsm log

Comment 1 chhu 2020-03-30 02:27:05 UTC
Created attachment 1674595 [details]
xml-startvm

Comment 2 chhu 2020-03-30 02:28:13 UTC
Created attachment 1674596 [details]
xml-before-migrate

Comment 3 chhu 2020-03-30 02:29:36 UTC
Created attachment 1674597 [details]
xml-after-migrate

Comment 4 chhu 2020-03-30 02:30:25 UTC
Created attachment 1674598 [details]
xml-deleted-s3

Comment 5 chhu 2020-03-30 02:35:35 UTC
Created attachment 1674599 [details]
backing-chain and libvirtd, vdsm logs

Comment 6 yisun 2020-03-30 03:26:55 UTC
The snapshot index changed due to the fix of Bug 1451398 - [RFE] Add index for the active layer in disk chain
And I checked the log, vdsm used correct index numbers:
In vdsm log and libvirtd log, the 'top' and 'base' set to index=3 and index=4, this is correct due to comment 0 "xml-after-migrate"
Vdsm log
2020-03-29 21:56:53,282-0400 INFO  (jsonrpc/5) [virt.vm] (vmId='4dcf9d4e-b65b-4e1a-8852-d44cd229911d') Starting merge with jobUUID='618668b9-8213-447d-a198-f314e5ebc38a', original chain=ab737847-8486-4265-b6f9-fc44f42c1cf5 < e6c30bfc-c9d5-450b-9b62-171d2235e95e < f007375f-4082-4649-8512-f161498bc1f2 (top), disk='sda', base='sda[4]', top='sda[3]', bandwidth=0, flags=8 (vm:5338)
2020-03-29 21:56:53,283-0400 ERROR (jsonrpc/5) [virt.vm] (vmId='4dcf9d4e-b65b-4e1a-8852-d44cd229911d') Live merge failed (job: 618668b9-8213-447d-a198-f314e5ebc38a) (vm:5344)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5342, in merge
    bandwidth, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 823, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirt.libvirtError: Requested operation is not valid: can't keep relative backing relationship

Libvirtd log
2020-03-30 01:56:53.283+0000: 881438: debug : virThreadJobSet:94 : Thread 881438 (virNetServerHandleJob) is now running job remoteDispatchDomainBlockCommit
2020-03-30 01:56:53.283+0000: 881438: debug : virDomainBlockCommit:10517 : dom=0x7ffac4007420, (VM: name=lmn4, uuid=4dcf9d4e-b65b-4e1a-8852-d44cd229911d), disk=sda, base=sda[4], top=sda[3], bandwidth=0, flags=0x8
2020-03-30 01:56:53.283+0000: 881438: debug : qemuDomainObjBeginJobInternal:9754 : Starting job: job=modify agentJob=none asyncJob=none (vm=0x7ffac80308e0 name=lmn4, current job=none agentJob=none async=none)
2020-03-30 01:56:53.283+0000: 881438: debug : qemuDomainObjBeginJobInternal:9803 : Started job: modify (async=none vm=0x7ffac80308e0 name=lmn4)
2020-03-30 01:56:53.283+0000: 881438: debug : qemuDomainBlockCommit:18876 : Requested operation is not valid: can't keep relative backing relationship
 
So seems not related to index chagne

And reporter helped to confirm that the issue only happened after migration, if snapshots created and deleted on source host, nothing wrong.

Guessed maybe something wrong after migration such as issue https://bugzilla.redhat.com/show_bug.cgi?id=1461303 "libvirt does not load the data necessary to keep the relative relationship" on target host?

Comment 7 Peter Krempa 2020-03-30 06:57:10 UTC
So the problem is that after migration we no longer load the relative paths from the images as the images are specified in the XML now.

Comment 9 Peter Krempa 2020-03-30 14:48:10 UTC
Fixed upstream by:

commit 2ace7a87a8aced68c2504fd4dd4e2df4302c3eeb
Author: Peter Krempa <pkrempa>
Date:   Mon Mar 30 11:18:37 2020 +0200

    qemuDomainSnapshotDiskPrepareOne: Don't load the relative path with blockdev
    
    Since we are refreshing the relative paths when doing the blockjobs we
    no longer need to load them upfront when doing the snapshot.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit ffc6249c79dbf980d116af7c7ed20222538a7c1c
Author: Peter Krempa <pkrempa>
Date:   Mon Mar 30 11:18:32 2020 +0200

    qemu: block: Support VIR_DOMAIN_BLOCK_COMMIT/PULL/REBASE_RELATIVE with blockdev
    
    Preservation of the relative relationship requires us to load the
    backing store strings from the disk images. With blockdev we stopped
    detecting the backing chain if it's specified in the XML so the relative
    links were not loaded at that point. To preserve the functionality from
    the pre-blockdev without accessing the backing chain unnecessarily
    during VM startup we must refresh the relative links when relative
    block commit or block pull is requested.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1818655
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

Comment 14 chhu 2020-04-08 07:11:48 UTC
Try to verify on libvirt-6.0.0-16.el8, hit Bug 1820016, blocked by Bug 1820016.

Comment 15 chhu 2020-04-09 12:22:41 UTC
Test on packages:
libvirt-daemon-kvm-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
qemu-kvm-4.2.0-17.module+el8.2.0+6129+b14d477b.x86_64
kernel: 4.18.0-193.el8.x86_64
vdsm-4.40.5-1.el8ev.x86_64

Test steps:
1. Start vm on host A, create s1(without memory), s2, s3, migrate vm to host B, delete s3, s1, s2 successfully.
2. For running vm, create s1(without memory), s2, s3(without memory), s4, s5, delete s3, s1, s5, s4, s2 successfully; 
   create s1, s2(without memory), s3, delete s1 successfully, migrate vm from host B to host A, Login to vm, touch file,
   clone s3, migrate to host A, delete s2, create s4, delete s4, s3 successfully.

Set the bug status to VERIFIED

Comment 17 errata-xmlrpc 2020-05-05 09:59:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017

Comment 18 yisun 2020-05-06 12:31:52 UTC
To make sure this could be covered by pure libvirt, reproduced with pure libvirt env as follow:
0.
[root@lenovo-sr630-10 files]# rpm -qa | grep libvirt-6
libvirt-6.0.0-14.module+el8.2.0+6069+78a1cb09.x86_64

1. Prepare a gluster server
# more /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option rpc-auth-allow-insecure on
end-volume

# service glusterd restart
Stopping glusterd:                                         [  OK  ]
Starting glusterd:                                         [  OK  ]
# mkdir /br1
# chmod -R 777 /br1
# setenforce 0
# iptables -F

On gluster serverA:
# gluster peer probe 10.66.82.249
peer probe: success.

# gluster peer status
Number of Peers: 1

Hostname: 10.66.82.249
Uuid: 40f4b505-0765-4a6b-906b-db68c078c1dd
State: Peer in Cluster (Connected)

# gluster volume create gluster-vol1 10.66.85.212:/br1 10.66.82.249:/br1 force
volume create: gluster-vol1: success: please start the volume to access data
如果要创建rdma连接,需要加可以加  gluster volume create gluster-vol1 transport rdma 10.66.85.212:/br1 10.66.82.249:/br1 force

# gluster volume set gluster-vol1 server.allow-insecure on
volume set: success

# gluster volume info
 
Volume Name: gluster-vol1
Type: Distribute
Volume ID: 2d4e6867-231a-48e7-821a-c4c253241044
Status: Created
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.66.85.212:/br1
Brick2: 10.66.82.249:/br1about:newtab
Options Reconfigured:
server.allow-insecure: on

# gluster volume start gluster-vol1
volume start: gluster-vol1: success

# gluster volume status
Status of volume: gluster-vol1about:newtab
Gluster process                        Port    Online    Pid
------------------------------------------------------------------------------
Brick 10.66.85.212:/br1                    49152    Y    22917
Brick 10.66.82.249:/br1                    49152    Y    7408
NFS Server on localhost                    2049    Y    22931
NFS Server on 10.66.82.249                2049    Y    7423

Set the nfs.disable=on in the gluster server A:

# gluster volume set gluster-vol1 nfs.disable on
# gluster volume info gluster-vol1 | grep nfs.disable
nfs.disable: on



2. Mount the gluster dir on the 2 test hosts:
# mount -t glusterfs 10.66.85.212:/gluster-vol1 /gmount/

3. Prepare the image chain
root@yisun-test1 /gmount 08:17:56$ qemu-img create -f qcow2 a 10M
Formatting 'a', fmt=qcow2 size=10485760 cluster_size=65536 lazy_refcounts=off refcount_bits=16
root@yisun-test1 /gmount 08:18:04$ qemu-img create -f qcow2 -o backing_fmt=qcow2 -b a b
Formatting 'b', fmt=qcow2 size=10485760 backing_file=a backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
root@yisun-test1 /gmount 08:18:14$ qemu-img create -f qcow2 -o backing_fmt=qcow2 -b b c
Formatting 'c', fmt=qcow2 size=10485760 backing_file=b backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
root@yisun-test1 /gmount 08:18:19$ qemu-img create -f qcow2 -o backing_fmt=qcow2 -b c d
Formatting 'd', fmt=qcow2 size=10485760 backing_file=c backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16


4. Use image 'a' as vm's disk
root@yisun-test1 /gmount 08:19:59$ virsh dumpxml ys | awk '/<disk/,/<\/disk/'
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/gmount/a'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

root@yisun-test1 /gmount 08:20:01$ virsh start ys
Domain ys started


5. Create 3 external snapshots for vm
root@yisun-test1 /gmount 08:20:11$ virsh snapshot-create-as --reuse-external --disk-only --no-metadata ys --diskspec vda,file=/gmount/b
Domain snapshot 1588767647 created
root@yisun-test1 /gmount 08:20:47$ virsh snapshot-create-as --reuse-external --disk-only --no-metadata ys --diskspec vda,file=/gmount/c
Domain snapshot 1588767671 created
root@yisun-test1 /gmount 08:21:11$ virsh snapshot-create-as --reuse-external --disk-only --no-metadata ys --diskspec vda,file=/gmount/d
Domain snapshot 1588767673 created

6. Now the vm's disk xml on the source host is as follow:
root@yisun-test1 /gmount 08:21:38$ virsh dumpxml ys | awk '/<disk/,/<\/disk/'
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/gmount/d' index='4'/>
      <backingStore type='file' index='3'>
        <format type='qcow2'/>
        <source file='/gmount/c'/>
        <backingStore type='file' index='2'>
          <format type='qcow2'/>
          <source file='/gmount/b'/>
          <backingStore type='file' index='1'>
            <format type='qcow2'/>
            <source file='/gmount/a'/>
            <backingStore/>
          </backingStore>
        </backingStore>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

7. Migrate the vm to target host
root@yisun-test1 /gmount 08:21:43$ virsh migrate ys qemu+ssh://lenovo-sr630-10.lab.eng.pek2.redhat.com/system --live --undefinesource --persistent
root.eng.pek2.redhat.com's password:


8. Now the disk xml on target host is as follow:
[root@lenovo-sr630-10 files]# virsh dumpxml ys | awk '/<disk/,/<\/disk/'
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/gmount/d' index='1'/>
      <backingStore type='file' index='2'>
        <format type='qcow2'/>
        <source file='/gmount/c'/>
        <backingStore type='file' index='3'>
          <format type='qcow2'/>
          <source file='/gmount/b'/>
          <backingStore type='file' index='4'>
            <format type='qcow2'/>
            <source file='/gmount/a'/>
            <backingStore/>
          </backingStore>
        </backingStore>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

9. Do a blockcommit with --keep-relative, error happens
[root@lenovo-sr630-10 files]# virsh blockcommit ys vda --top vda[3] --base vda[4] --verbose --wait --keep-relative
error: Requested operation is not valid: can't keep relative backing relationship


Note You need to log in before you can comment on or make changes to this bug.