Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1957776

Summary: Merge operation fails after live storage migration.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Pavol Brilla <pbrilla>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4.5CC: eshenitz, lleistne, lsurette, lsvaty, nsoffer, srevivo, ycui
Target Milestone: ovirt-4.4.7Keywords: Rebase, TestOnly
Target Release: 4.4.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.70.3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-22 15:08:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1955667    
Bug Blocks: 1985906    

Description Roman Hodain 2021-05-06 13:05:45 UTC
Description of problem:
Merge operation fails after live storage migration. Due to 

merge error=invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain

Version-Release number of selected component (if applicable):
4.4.5

How reproducible:
ones

Steps to Reproduce:
Not clear. Most probably a race condition and high load on storage causing the volumes going from mirroring to desynchronized and back.

Actual results:
The merge fails.

Expected results:
Th emerge succeeds 

Additional info:

So the engine executes the replica finish:
2021-04-30 09:46:53,273+02 START VmReplicateDiskFinishVDSCommand(                
  HostName = xha-195                                                             
  VmReplicateDiskParameters:{                                                    
    hostId='be900dfe-9761-43ab-9fb4-3bf5dfbf02a7'                                
    vmId='38058432-4200-486d-8738-c12d6127a790'                                  
    storagePoolId='224c9107-4439-4b45-812f-6c88453cb8aa'                         
    srcStorageDomainId='646767db-0cd4-44d6-9db4-deae4968c7cc'                    
    targetStorageDomainId='9142d041-995f-483e-8f47-5bab695a83df'                 
    imageGroupId='30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8'                          
    imageId='d2b417cf-2a36-430d-808a-4b10a4cdb284'})                             
  log id: 62380c5d                                                               

This operation finishes with success

2021-04-30 09:47:15,497+02 FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 62380c5d

But vdsm actually gets an error when trying to deactivate the image on the source domain. This error is ignored.

2021-04-30 09:47:15,484+0200 INFO  (jsonrpc/0) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\'  Logical volume 646767db-0cd4-44d6-9db4-deae496
8c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\', \'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'727f9b97-2c
f0-42ea-b97e-a9f872d817ee\', \'d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7
cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\': []}' from=::ffff:172.30.233.100,35736, flow_id=55ddb9aa-9d58-4133-9407-a5df6424bd9d, task_id=e8208eb6-4031-4e71-95a4-f61b1c94aaa5 (api:52)
2021-04-30 09:47:15,485+0200 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='e8208eb6-4031-4e71-95a4-f61b1c94aaa5') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1208, in _setLVAvailability
    changelv(vg, lvs, ("--available", available))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1201, in changelv
    raise se.StorageException("%d %s %s\n%s/%s" % (rc, out, err, vg, lvs))
vdsm.storage.exception.StorageException: General Storage Exception: ("5 [] ['  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.', '  Logical volume 646767db-0cd4-4
4d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.']\n646767db-0cd4-44d6-9db4-deae4968c7cc/['727f9b97-2cf0-42ea-b97e-a9f872d817ee', 'd2b417cf-2a36-430d-808a-4b10a4cdb284']",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-171>", line 2, in teardownImage
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3338, in teardownImage
    dom.deactivateImage(imgUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 1328, in deactivateImage
    lvm.deactivateLVs(self.sdUUID, volUUIDs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1747, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1214, in _setLVAvailability
    raise se.CannotDeactivateLogicalVolume(str(e), users)
vdsm.storage.exception.CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a
9f872d817ee in use.\', \'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'727f9b97-2cf0-42ea-b97e-a9f872d817ee\', \'
d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10
a4cdb284\': []}'
2021-04-30 09:47:15,485+0200 INFO  (jsonrpc/0) [storage.TaskManager.Task] (Task='e8208eb6-4031-4e71-95a4-f61b1c94aaa5') aborting: Task is aborted: 'value=Cannot deactivate Logical Volume: \'error=General Storage
 Exception: ("5 [] [\\\'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\\\', \\\'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4
b10a4cdb284 in use.\\\']\\\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\\\'727f9b97-2cf0-42ea-b97e-a9f872d817ee\\\', \\\'d2b417cf-2a36-430d-808a-4b10a4cdb284\\\']",), users={\\\'/dev/646767db-0cd4-44d6-9db4-deae4968
c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\\\': [], \\\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\\\': []}\' abortedcode=552' (task:1190)
2021-04-30 09:47:15,486+0200 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH teardownImage error=Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\'  Logical volume 646767db-0cd4-44d6-9d
b4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\', \'  Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'7
27f9b97-2cf0-42ea-b97e-a9f872d817ee\', \'d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\': []}' (dispatcher:83)
...
2021-04-30 09:47:15,496+0200 INFO  (jsonrpc/0) [api.virt] FINISH diskReplicateFinish return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:172.30.233.100,35736, flow_id=55ddb9aa-9d58-4133-9407-a5df6424bd9d, vmId=38058432-4200-486d-8738-c12d6127a790 (api:54)
2021-04-30 09:47:15,496+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateFinish took more than 1.00 seconds to succeed: 22.22 (__init__:316)

We get libvirt event that the sync i sready
2021-04-30 09:45:31,605+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636)

We disable monitoring of the drives and do they pivot, but just before we get the ready event again.

2021-04-30 09:46:54,071+0200 INFO  (jsonrpc/0) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Disabling drive monitoring
2021-04-30 09:46:54,085+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636)
2021-04-30 09:46:54,102+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda has completed (vm:5627)

So I assume that the volumes got desynchronized due to IO load. 

We then try to deactivate the volumes located on the source storage domain and that fails as the volumes are still open. 

The interesting part:

We get libvirt event that the sync is ready even before we call the replica finish

2021-04-30 09:45:31,605+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636)

We disable monitoring of the drive monitor and do they pivot, but just before we get the ready event again.

2021-04-30 09:46:54,071+0200 INFO  (jsonrpc/0) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Disabling drive monitoring
2021-04-30 09:46:54,085+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636)
2021-04-30 09:46:54,102+0200 INFO  (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda has completed (vm:5627)

So I assume that the volumes got desynchronized due to IO load. 

We then try to deactivate the volumes located on the source storage domain and that fails as the volumes are still open. The consequent merge fails with 

2021-04-30 09:47:21,789+0200 INFO  (jsonrpc/5) [virt.livemerge] Job 483af0b4-195d-4ff3-947c-02bd251a9dd3 switching state from INIT to EXTEND (livemerge:166)
2021-04-30 09:47:21,798+0200 ERROR (jsonrpc/5) [api] FINISH merge error=invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (api:134)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 767, in merge
    drive, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5644, in merge
    driveSpec, baseVolUUID, topVolUUID, bandwidth, jobUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 297, in merge
    self._start_extend(drive, job, base_info, top_info)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 431, in _start_extend
    capacity, alloc, physical = self._vm.getExtendInfo(drive)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1243, in getExtendInfo
    capacity, alloc, physical = self._dom.blockInfo(drive.path)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 839, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirt.libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain

The volume /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 is the volume from the destintation domain. 

Since that we periodically get:
2021-04-30 09:48:22,181+0200 ERROR (periodic/3) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Unable to get watermarks for drive vda: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (vm:1312)

So it seems that the pivot did not finish correctly as the LV was still open and the xml dump shows that the image is still on the original SD  

2021-04-30 09:48:38,914+0200 INFO  (jsonrpc/6) [api.host] FINISH dumpxmls return=
...
<disk type=\'block\' device=\'disk\' snapshot=\'no\'>
      <driver name=\'qemu\' type=\'qcow2\' cache=\'none\' error_policy=\'enospace\' io=\'native\' iothread=\'1\'/>
      <source dev=\'/rhev/data-center/mnt/blockSD/646767db-0cd4-44d6-9db4-deae4968c7cc/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284\' index=\'6\'>
        <seclabel model=\'dac\' relabel=\'no\'/>
      </source>
      <backingStore type=\'block\' index=\'4\'>
        <format type=\'qcow2\'/>
        <source dev=\'/rhev/data-center/mnt/blockSD/646767db-0cd4-44d6-9db4-deae4968c7cc/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/727f9b97-2cf0-42ea-b97e-a9f872d817ee\'>
          <seclabel model=\'dac\' relabel=\'no\'/>
        </source>
        <backingStore/>
      </backingStore>
      <target dev=\'vda\' bus=\'virtio\'/>
      <serial>30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8</serial>
      <boot order=\'1\'/>
      <alias name=\'ua-30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8\'/>
      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x05\' function=\'0x0\'/>
 </disk>
...

I am not sure why libvirt reported 

    type COPY for drive vda has completed

I believe we were suppose to call libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT

Comment 3 Eyal Shenitzky 2021-05-10 09:24:31 UTC
Nir, please have a look.
Is this caused by the recent live merge changes?

Comment 4 Eyal Shenitzky 2021-05-10 12:20:51 UTC
Keeping this bug as a tracker for the libvirt fix.
should be tested when libvirt provides the fix.

Comment 5 Nir Soffer 2021-05-10 13:11:22 UTC
(In reply to Eyal Shenitzky from comment #3)
This looks like a duplicate of bug 1951507.

Libvirt is confused by unexpected state changes reported by qemu, and try to abort live
storage migration instead of complete it. It forgets the target disk so after this failure
any operation (.e.g. setting block threshold, extending the disk, live merge) will fail
with this vm.

In this state we expect to see error bellow every ~2 seconds;

2021-04-30 09:48:22,181+0200 ERROR (periodic/3) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Unable to get watermarks for drive vda: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (vm:1312)

Comment 8 Eyal Shenitzky 2021-06-10 10:20:47 UTC
VDSM already requires Libvirt 7.0.0-14 that contains the fix for this bug.

Comment 9 Sandro Bonazzola 2021-06-11 11:32:57 UTC
$ git tag --contains 60ed11bd6dd8c2bea9968808baabd6a455010e0d
v4.40.70.3

Comment 14 Pavol Brilla 2021-07-07 10:45:42 UTC
# yum deplist vdsm | grep libvirt-daemon-kvm
  dependency: libvirt-daemon-kvm >= 7.0.0-14

Comment 19 errata-xmlrpc 2021-07-22 15:08:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2864