+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1502352 +++ ====================================================================== Description of problem: If a block job somehow failed in the libvirtd side, the vdsm will continuously execute the cleanup thread for live merge until the VM goes down. A merge operation starts at the backend to merge volume ded1883f and be4a4d85. 2017-10-12 14:56:21,244+0530 INFO (jsonrpc/6) [vdsm.api] START merge(driveSpec={u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'ded1883f-94de-4510-8ddd-7ecc432e4009', u'domainID': u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', u'imageID': u'7455de38-1df1-4acd-b07c-9dc2138aafb3'}, baseVolUUID=u'be4a4d85-d7e6-4725-b7f5-90c9d935c336', topVolUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', bandwidth=u'0', jobUUID=u'26702cd2-6666-4eca-86dd-ac072f0c4478') from=::ffff:10.74.130.115,58346, flow_id=cafd88cf-d091-4a0d-95aa-645fcc714955 (api:46) Block job terminated in the libvirtd side and vdsm detects it correctly. 2017-10-12 14:56:49,824+0530 DEBUG (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760) 2017-10-12 14:56:49,840+0530 INFO (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Libvirt job 26702cd2-6666-4eca-86dd-ac072f0c4478 was terminated (vm:4799) The chain in both libvirtd and vdsm was detected correctly. 2017-10-12 14:56:50,043+0530 DEBUG (merge/c1ed19b5) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') vdsm chain: ['ded1883f-94de-4510-8ddd-7ecc432e4009', 'be4a4d85-d7e6-4725-b7f5-90c9d935c336'], libvirt chain: ['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009'] (vm:5008) 2017-10-12 14:56:50,043+0530 INFO (merge/c1ed19b5) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', imgUUID=u'7455de38-1df1-4acd-b07c-9dc2138aafb3', volUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', newChain=['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009']) from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:46) 2017-10-12 14:56:50,142+0530 INFO (merge/c1ed19b5) [storage.Image] Current chain=be4a4d85-d7e6-4725-b7f5-90c9d935c336 < ded1883f-94de-4510-8ddd-7ecc432e4009 (top) (image:1266) 2017-10-12 14:56:50,144+0530 INFO (merge/c1ed19b5) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:52) However, after syncVolumeChain, teardown_top_volume was executed which will fail as the image is still in use by the VM. self.vm._syncVolumeChain(self.drive) if self.doPivot: self.vm.enableDriveMonitor() self.teardown_top_volume() 2017-10-12 14:56:55,007+0530 DEBUG (merge/c1ed19b5) [storage.Misc.excCmd] FAILED: <err> = ' Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\n'; <rc> = 5 (commands:93) 2017-10-12 14:56:55,009+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/share/vdsm/virt/vm.py", line 5287, in run self.teardown_top_volume() File "/usr/share/vdsm/virt/vm.py", line 5270, in teardown_top_volume self.job['topVolume']) File "/usr/share/vdsm/storage/blockSD.py", line 918, in teardownVolume lvm.deactivateLVs(self.sdUUID, [volUUID]) File "/usr/share/vdsm/storage/lvm.py", line 1304, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/share/vdsm/storage/lvm.py", line 843, in _setLVAvailability raise error(str(e)) CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\' Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\']\\n8257cf14-d88d-4e4e-998c-9f8976dac2a2/[u\'ded1883f-94de-4510-8ddd-7ecc432e4009\']",)',) 2017-10-12 14:56:55,028+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374) I think we should only do teardown_top_volume if doPivot is set. The vdsm will try to do this until the VM goes down. 2017-10-16 06:01:10,915+0530 DEBUG (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760) 2017-10-16 06:01:10,915+0530 INFO (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Previous job 26702cd2-6666-4eca-86dd-ac072f0c4478 cleanup thread failed, retrying (vm:4817) Version-Release number of selected component (if applicable): rhevm-4.1.6.2-0.1.el7.noarch vdsm-api-4.19.31-1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Start a live merge. 2. During live merge, abort the block job manually from the libvirtd. Actual results: The vdsm is not correctly doing the cleanup if the job failed in libvirtd Expected results: vdsm should detect the failure correctly and correct the chains. Additional info: (Originally by Nijin Ashok)
Nijin, can you explain how to manually kill a job using libvirtd?
To save some time waiting for Nijin, I will reply to the NEEDINFO. You have to run: virsh blockjob <vm_name> sda --abort (You need virsh credentials to run the abort) You may want to get the info of the job before aborting, to make sure the job started. This is done by running: virsh -r blockjob <vm_name> sda --info (No credentials are required here)
-------------------------------------- Tested with the following code: ---------------------------------------- rhevm-4.1.7.6-0.1.el7.noarch vdsm-4.19.36-1.el7ev.x86_64 Tested with the following scenario: Steps to Reproduce: 1. Start a live merge. 2. During live merge, abort the block job manually from the libvirtd. Actual results: vdsm detected the libvirt job failure and no error is shown in the vdsm log Expected results: Moving to VERIFIED!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3139
sync2jira