Bug 1957776
| Summary: | Merge operation fails after live storage migration. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> |
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED ERRATA | QA Contact: | Pavol Brilla <pbrilla> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4.5 | CC: | eshenitz, lleistne, lsurette, lsvaty, nsoffer, srevivo, ycui |
| Target Milestone: | ovirt-4.4.7 | Keywords: | Rebase, TestOnly |
| Target Release: | 4.4.7 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | vdsm-4.40.70.3 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-22 15:08:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1955667 | ||
| Bug Blocks: | 1985906 | ||
Nir, please have a look. Is this caused by the recent live merge changes? Keeping this bug as a tracker for the libvirt fix. should be tested when libvirt provides the fix. (In reply to Eyal Shenitzky from comment #3) This looks like a duplicate of bug 1951507. Libvirt is confused by unexpected state changes reported by qemu, and try to abort live storage migration instead of complete it. It forgets the target disk so after this failure any operation (.e.g. setting block threshold, extending the disk, live merge) will fail with this vm. In this state we expect to see error bellow every ~2 seconds; 2021-04-30 09:48:22,181+0200 ERROR (periodic/3) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Unable to get watermarks for drive vda: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (vm:1312) VDSM already requires Libvirt 7.0.0-14 that contains the fix for this bug. $ git tag --contains 60ed11bd6dd8c2bea9968808baabd6a455010e0d v4.40.70.3 # yum deplist vdsm | grep libvirt-daemon-kvm dependency: libvirt-daemon-kvm >= 7.0.0-14 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.7]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2864 |
Description of problem: Merge operation fails after live storage migration. Due to merge error=invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain Version-Release number of selected component (if applicable): 4.4.5 How reproducible: ones Steps to Reproduce: Not clear. Most probably a race condition and high load on storage causing the volumes going from mirroring to desynchronized and back. Actual results: The merge fails. Expected results: Th emerge succeeds Additional info: So the engine executes the replica finish: 2021-04-30 09:46:53,273+02 START VmReplicateDiskFinishVDSCommand( HostName = xha-195 VmReplicateDiskParameters:{ hostId='be900dfe-9761-43ab-9fb4-3bf5dfbf02a7' vmId='38058432-4200-486d-8738-c12d6127a790' storagePoolId='224c9107-4439-4b45-812f-6c88453cb8aa' srcStorageDomainId='646767db-0cd4-44d6-9db4-deae4968c7cc' targetStorageDomainId='9142d041-995f-483e-8f47-5bab695a83df' imageGroupId='30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8' imageId='d2b417cf-2a36-430d-808a-4b10a4cdb284'}) log id: 62380c5d This operation finishes with success 2021-04-30 09:47:15,497+02 FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 62380c5d But vdsm actually gets an error when trying to deactivate the image on the source domain. This error is ignored. 2021-04-30 09:47:15,484+0200 INFO (jsonrpc/0) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\' Logical volume 646767db-0cd4-44d6-9db4-deae496 8c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\', \' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'727f9b97-2c f0-42ea-b97e-a9f872d817ee\', \'d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7 cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\': []}' from=::ffff:172.30.233.100,35736, flow_id=55ddb9aa-9d58-4133-9407-a5df6424bd9d, task_id=e8208eb6-4031-4e71-95a4-f61b1c94aaa5 (api:52) 2021-04-30 09:47:15,485+0200 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='e8208eb6-4031-4e71-95a4-f61b1c94aaa5') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1208, in _setLVAvailability changelv(vg, lvs, ("--available", available)) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1201, in changelv raise se.StorageException("%d %s %s\n%s/%s" % (rc, out, err, vg, lvs)) vdsm.storage.exception.StorageException: General Storage Exception: ("5 [] [' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.', ' Logical volume 646767db-0cd4-4 4d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.']\n646767db-0cd4-44d6-9db4-deae4968c7cc/['727f9b97-2cf0-42ea-b97e-a9f872d817ee', 'd2b417cf-2a36-430d-808a-4b10a4cdb284']",) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-171>", line 2, in teardownImage File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3338, in teardownImage dom.deactivateImage(imgUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 1328, in deactivateImage lvm.deactivateLVs(self.sdUUID, volUUIDs) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1747, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1214, in _setLVAvailability raise se.CannotDeactivateLogicalVolume(str(e), users) vdsm.storage.exception.CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a 9f872d817ee in use.\', \' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'727f9b97-2cf0-42ea-b97e-a9f872d817ee\', \' d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10 a4cdb284\': []}' 2021-04-30 09:47:15,485+0200 INFO (jsonrpc/0) [storage.TaskManager.Task] (Task='e8208eb6-4031-4e71-95a4-f61b1c94aaa5') aborting: Task is aborted: 'value=Cannot deactivate Logical Volume: \'error=General Storage Exception: ("5 [] [\\\' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\\\', \\\' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4 b10a4cdb284 in use.\\\']\\\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\\\'727f9b97-2cf0-42ea-b97e-a9f872d817ee\\\', \\\'d2b417cf-2a36-430d-808a-4b10a4cdb284\\\']",), users={\\\'/dev/646767db-0cd4-44d6-9db4-deae4968 c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\\\': [], \\\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\\\': []}\' abortedcode=552' (task:1190) 2021-04-30 09:47:15,486+0200 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH teardownImage error=Cannot deactivate Logical Volume: 'error=General Storage Exception: ("5 [] [\' Logical volume 646767db-0cd4-44d6-9d b4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee in use.\', \' Logical volume 646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284 in use.\']\\n646767db-0cd4-44d6-9db4-deae4968c7cc/[\'7 27f9b97-2cf0-42ea-b97e-a9f872d817ee\', \'d2b417cf-2a36-430d-808a-4b10a4cdb284\']",), users={\'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/727f9b97-2cf0-42ea-b97e-a9f872d817ee\': [], \'/dev/646767db-0cd4-44d6-9db4-deae4968c7cc/d2b417cf-2a36-430d-808a-4b10a4cdb284\': []}' (dispatcher:83) ... 2021-04-30 09:47:15,496+0200 INFO (jsonrpc/0) [api.virt] FINISH diskReplicateFinish return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:172.30.233.100,35736, flow_id=55ddb9aa-9d58-4133-9407-a5df6424bd9d, vmId=38058432-4200-486d-8738-c12d6127a790 (api:54) 2021-04-30 09:47:15,496+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateFinish took more than 1.00 seconds to succeed: 22.22 (__init__:316) We get libvirt event that the sync i sready 2021-04-30 09:45:31,605+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636) We disable monitoring of the drives and do they pivot, but just before we get the ready event again. 2021-04-30 09:46:54,071+0200 INFO (jsonrpc/0) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Disabling drive monitoring 2021-04-30 09:46:54,085+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636) 2021-04-30 09:46:54,102+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda has completed (vm:5627) So I assume that the volumes got desynchronized due to IO load. We then try to deactivate the volumes located on the source storage domain and that fails as the volumes are still open. The interesting part: We get libvirt event that the sync is ready even before we call the replica finish 2021-04-30 09:45:31,605+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636) We disable monitoring of the drive monitor and do they pivot, but just before we get the ready event again. 2021-04-30 09:46:54,071+0200 INFO (jsonrpc/0) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Disabling drive monitoring 2021-04-30 09:46:54,085+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda is ready (vm:5636) 2021-04-30 09:46:54,102+0200 INFO (libvirt/events) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Block job (untracked) type COPY for drive vda has completed (vm:5627) So I assume that the volumes got desynchronized due to IO load. We then try to deactivate the volumes located on the source storage domain and that fails as the volumes are still open. The consequent merge fails with 2021-04-30 09:47:21,789+0200 INFO (jsonrpc/5) [virt.livemerge] Job 483af0b4-195d-4ff3-947c-02bd251a9dd3 switching state from INIT to EXTEND (livemerge:166) 2021-04-30 09:47:21,798+0200 ERROR (jsonrpc/5) [api] FINISH merge error=invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (api:134) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 767, in merge drive, baseVolUUID, topVolUUID, bandwidth, jobUUID) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5644, in merge driveSpec, baseVolUUID, topVolUUID, bandwidth, jobUUID) File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 297, in merge self._start_extend(drive, job, base_info, top_info) File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 431, in _start_extend capacity, alloc, physical = self._vm.getExtendInfo(drive) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1243, in getExtendInfo capacity, alloc, physical = self._dom.blockInfo(drive.path) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 839, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirt.libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain The volume /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 is the volume from the destintation domain. Since that we periodically get: 2021-04-30 09:48:22,181+0200 ERROR (periodic/3) [virt.vm] (vmId='38058432-4200-486d-8738-c12d6127a790') Unable to get watermarks for drive vda: invalid argument: invalid path /rhev/data-center/mnt/blockSD/9142d041-995f-483e-8f47-5bab695a83df/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284 not assigned to domain (vm:1312) So it seems that the pivot did not finish correctly as the LV was still open and the xml dump shows that the image is still on the original SD 2021-04-30 09:48:38,914+0200 INFO (jsonrpc/6) [api.host] FINISH dumpxmls return= ... <disk type=\'block\' device=\'disk\' snapshot=\'no\'> <driver name=\'qemu\' type=\'qcow2\' cache=\'none\' error_policy=\'enospace\' io=\'native\' iothread=\'1\'/> <source dev=\'/rhev/data-center/mnt/blockSD/646767db-0cd4-44d6-9db4-deae4968c7cc/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/d2b417cf-2a36-430d-808a-4b10a4cdb284\' index=\'6\'> <seclabel model=\'dac\' relabel=\'no\'/> </source> <backingStore type=\'block\' index=\'4\'> <format type=\'qcow2\'/> <source dev=\'/rhev/data-center/mnt/blockSD/646767db-0cd4-44d6-9db4-deae4968c7cc/images/30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8/727f9b97-2cf0-42ea-b97e-a9f872d817ee\'> <seclabel model=\'dac\' relabel=\'no\'/> </source> <backingStore/> </backingStore> <target dev=\'vda\' bus=\'virtio\'/> <serial>30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8</serial> <boot order=\'1\'/> <alias name=\'ua-30e3deb8-2b1a-4aa8-8512-4c64ce0fc3f8\'/> <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x05\' function=\'0x0\'/> </disk> ... I am not sure why libvirt reported type COPY for drive vda has completed I believe we were suppose to call libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT