1502352 – Cleanup thread for live merge executed continuously if a block job failed in libvirtd side

Bug 1502352 - Cleanup thread for live merge executed continuously if a block job failed in libvirtd side

Summary: Cleanup thread for live merge executed continuously if a block job failed in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.6
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Ala Hino
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1506161
TreeView+	depends on / blocked

Reported:	2017-10-16 02:13 UTC by nijin ashok
Modified:	2021-03-11 17:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, when a block job was manually aborted during a live merge, the merge operation would fail. In this release, vdsm will now detect the failure and correct it.
Clone Of:
Clones:	1506161 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:52:46 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3218311	None	None	None	2017-11-20 04:48:37 UTC
Red Hat Product Errata	RHEA-2018:1489	None	None	None	2018-05-15 17:53:52 UTC
oVirt gerrit	83128	master	MERGED	live-merge: Don't teardown the top volume if libvirt block commit job fails	2021-02-04 17:46:44 UTC

Description nijin ashok 2017-10-16 02:13:49 UTC

Description of problem:

If a block job somehow failed in the libvirtd side, the vdsm will continuously execute the cleanup thread for live merge until the VM goes down.

A merge operation starts at the backend to merge volume ded1883f and be4a4d85.

2017-10-12 14:56:21,244+0530 INFO  (jsonrpc/6) [vdsm.api] START merge(driveSpec={u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'ded1883f-94de-4510-8ddd-7ecc432e4009', u'domainID': u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', u'imageID': u'7455de38-1df1-4acd-b07c-9dc2138aafb3'}, baseVolUUID=u'be4a4d85-d7e6-4725-b7f5-90c9d935c336', topVolUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', bandwidth=u'0', jobUUID=u'26702cd2-6666-4eca-86dd-ac072f0c4478') from=::ffff:10.74.130.115,58346, flow_id=cafd88cf-d091-4a0d-95aa-645fcc714955 (api:46)

Block job terminated in the libvirtd side and vdsm detects it correctly.

2017-10-12 14:56:49,824+0530 DEBUG (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760)
2017-10-12 14:56:49,840+0530 INFO  (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Libvirt job 26702cd2-6666-4eca-86dd-ac072f0c4478 was terminated (vm:4799)

The chain in both libvirtd and vdsm was detected correctly.

2017-10-12 14:56:50,043+0530 DEBUG (merge/c1ed19b5) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') vdsm chain: ['ded1883f-94de-4510-8ddd-7ecc432e4009', 'be4a4d85-d7e6-4725-b7f5-90c9d935c336'], libvirt chain: ['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009'] (vm:5008)

2017-10-12 14:56:50,043+0530 INFO  (merge/c1ed19b5) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', imgUUID=u'7455de38-1df1-4acd-b07c-9dc2138aafb3', volUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', newChain=['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009']) from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:46)

2017-10-12 14:56:50,142+0530 INFO  (merge/c1ed19b5) [storage.Image] Current chain=be4a4d85-d7e6-4725-b7f5-90c9d935c336 < ded1883f-94de-4510-8ddd-7ecc432e4009 (top)  (image:1266)

2017-10-12 14:56:50,144+0530 INFO  (merge/c1ed19b5) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:52)


However, after syncVolumeChain, teardown_top_volume was executed which will fail as the image is still in use by the VM.

        self.vm._syncVolumeChain(self.drive)
        if self.doPivot:
            self.vm.enableDriveMonitor()
        self.teardown_top_volume()


2017-10-12 14:56:55,007+0530 DEBUG (merge/c1ed19b5) [storage.Misc.excCmd] FAILED: <err> = '  Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\n'; <rc> = 5 (commands:93)
2017-10-12 14:56:55,009+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5287, in run
    self.teardown_top_volume()
  File "/usr/share/vdsm/virt/vm.py", line 5270, in teardown_top_volume
    self.job['topVolume'])
  File "/usr/share/vdsm/storage/blockSD.py", line 918, in teardownVolume
    lvm.deactivateLVs(self.sdUUID, [volUUID])
  File "/usr/share/vdsm/storage/lvm.py", line 1304, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/share/vdsm/storage/lvm.py", line 843, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\']\\n8257cf14-d88d-4e4e-998c-9f8976dac2a2/[u\'ded1883f-94de-4510-8ddd-7ecc432e4009\']",)',)
2017-10-12 14:56:55,028+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374)

I think we should only do teardown_top_volume if doPivot is set.

The vdsm will try to do this until the VM goes down.


2017-10-16 06:01:10,915+0530 DEBUG (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760)
2017-10-16 06:01:10,915+0530 INFO  (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Previous job 26702cd2-6666-4eca-86dd-ac072f0c4478 cleanup thread failed, retrying (vm:4817)


Version-Release number of selected component (if applicable):

rhevm-4.1.6.2-0.1.el7.noarch
vdsm-api-4.19.31-1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Start a live merge.

2. During live merge, abort the block job manually from the libvirtd.


Actual results:

The vdsm is not correctly doing the cleanup if the job failed in libvirtd

Expected results:

vdsm should detect the failure correctly and correct the chains.

Additional info:

Comment 5 Lilach Zitnitski 2017-11-14 08:23:03 UTC

--------------------------------------
Tested with the following code:
----------------------------------------
ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch
vdsm-4.20.6-62.gitd3023e4.el7.centos.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. Start a live merge.
2. During live merge, abort the block job manually from the libvirtd.

Actual results:
vdsm detected the libvirt job failure and no error is shown in the vdsm log 

Expected results:

Moving to VERIFIED!

Comment 11 errata-xmlrpc 2018-05-15 17:52:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 12 Franta Kust 2019-05-16 13:06:29 UTC

BZ<2>Jira Resync

Comment 13 Daniel Gur 2019-08-28 13:13:30 UTC

sync2jira

Comment 14 Daniel Gur 2019-08-28 13:17:44 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.