Bug 1506161 - [downstream clone - 4.1.7] Cleanup thread for live merge executed continously if a block job failed in libvirtd side
Summary: [downstream clone - 4.1.7] Cleanup thread for live merge executed continously...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.6
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.1.7
: ---
Assignee: Ala Hino
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On: 1502352
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-25 09:59 UTC by rhev-integ
Modified: 2021-05-01 16:26 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1502352
Environment:
Last Closed: 2017-11-07 17:29:21 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:3139 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1.7 2017-11-07 22:22:40 UTC
oVirt gerrit 83128 0 master MERGED live-merge: Don't teardown the top volume if libvirt block commit job fails 2020-07-30 14:55:11 UTC
oVirt gerrit 83265 0 ovirt-4.1 MERGED live-merge: Don't teardown the top volume if libvirt block commit job fails 2020-07-30 14:55:11 UTC

Description rhev-integ 2017-10-25 09:59:14 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1502352 +++
======================================================================

Description of problem:

If a block job somehow failed in the libvirtd side, the vdsm will continuously execute the cleanup thread for live merge until the VM goes down.

A merge operation starts at the backend to merge volume ded1883f and be4a4d85.

2017-10-12 14:56:21,244+0530 INFO  (jsonrpc/6) [vdsm.api] START merge(driveSpec={u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'ded1883f-94de-4510-8ddd-7ecc432e4009', u'domainID': u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', u'imageID': u'7455de38-1df1-4acd-b07c-9dc2138aafb3'}, baseVolUUID=u'be4a4d85-d7e6-4725-b7f5-90c9d935c336', topVolUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', bandwidth=u'0', jobUUID=u'26702cd2-6666-4eca-86dd-ac072f0c4478') from=::ffff:10.74.130.115,58346, flow_id=cafd88cf-d091-4a0d-95aa-645fcc714955 (api:46)

Block job terminated in the libvirtd side and vdsm detects it correctly.

2017-10-12 14:56:49,824+0530 DEBUG (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760)
2017-10-12 14:56:49,840+0530 INFO  (periodic/0) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Libvirt job 26702cd2-6666-4eca-86dd-ac072f0c4478 was terminated (vm:4799)

The chain in both libvirtd and vdsm was detected correctly.

2017-10-12 14:56:50,043+0530 DEBUG (merge/c1ed19b5) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') vdsm chain: ['ded1883f-94de-4510-8ddd-7ecc432e4009', 'be4a4d85-d7e6-4725-b7f5-90c9d935c336'], libvirt chain: ['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009'] (vm:5008)

2017-10-12 14:56:50,043+0530 INFO  (merge/c1ed19b5) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'8257cf14-d88d-4e4e-998c-9f8976dac2a2', imgUUID=u'7455de38-1df1-4acd-b07c-9dc2138aafb3', volUUID=u'ded1883f-94de-4510-8ddd-7ecc432e4009', newChain=['be4a4d85-d7e6-4725-b7f5-90c9d935c336', 'ded1883f-94de-4510-8ddd-7ecc432e4009']) from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:46)

2017-10-12 14:56:50,142+0530 INFO  (merge/c1ed19b5) [storage.Image] Current chain=be4a4d85-d7e6-4725-b7f5-90c9d935c336 < ded1883f-94de-4510-8ddd-7ecc432e4009 (top)  (image:1266)

2017-10-12 14:56:50,144+0530 INFO  (merge/c1ed19b5) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=b67ea7bd-0073-4977-9612-bdb0d303906a (api:52)


However, after syncVolumeChain, teardown_top_volume was executed which will fail as the image is still in use by the VM.

        self.vm._syncVolumeChain(self.drive)
        if self.doPivot:
            self.vm.enableDriveMonitor()
        self.teardown_top_volume()


2017-10-12 14:56:55,007+0530 DEBUG (merge/c1ed19b5) [storage.Misc.excCmd] FAILED: <err> = '  Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\n'; <rc> = 5 (commands:93)
2017-10-12 14:56:55,009+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5287, in run
    self.teardown_top_volume()
  File "/usr/share/vdsm/virt/vm.py", line 5270, in teardown_top_volume
    self.job['topVolume'])
  File "/usr/share/vdsm/storage/blockSD.py", line 918, in teardownVolume
    lvm.deactivateLVs(self.sdUUID, [volUUID])
  File "/usr/share/vdsm/storage/lvm.py", line 1304, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/share/vdsm/storage/lvm.py", line 843, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 8257cf14-d88d-4e4e-998c-9f8976dac2a2/ded1883f-94de-4510-8ddd-7ecc432e4009 in use.\']\\n8257cf14-d88d-4e4e-998c-9f8976dac2a2/[u\'ded1883f-94de-4510-8ddd-7ecc432e4009\']",)',)
2017-10-12 14:56:55,028+0530 ERROR (merge/c1ed19b5) [root] Unhandled exception (utils:374)

I think we should only do teardown_top_volume if doPivot is set.

The vdsm will try to do this until the VM goes down.


2017-10-16 06:01:10,915+0530 DEBUG (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Checking job 26702cd2-6666-4eca-86dd-ac072f0c4478 (vm:4760)
2017-10-16 06:01:10,915+0530 INFO  (periodic/3) [virt.vm] (vmId='c1ed19b5-e876-4ec0-8067-837bed214d48') Previous job 26702cd2-6666-4eca-86dd-ac072f0c4478 cleanup thread failed, retrying (vm:4817)


Version-Release number of selected component (if applicable):

rhevm-4.1.6.2-0.1.el7.noarch
vdsm-api-4.19.31-1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Start a live merge.

2. During live merge, abort the block job manually from the libvirtd.


Actual results:

The vdsm is not correctly doing the cleanup if the job failed in libvirtd

Expected results:

vdsm should detect the failure correctly and correct the chains.

Additional info:

(Originally by Nijin Ashok)

Comment 5 Lilach Zitnitski 2017-11-05 08:11:17 UTC
Nijin, can you explain how to manually kill a job using libvirtd?

Comment 6 Ala Hino 2017-11-05 09:59:15 UTC
To save some time waiting for Nijin, I will reply to the NEEDINFO.

You have to run: virsh blockjob <vm_name> sda --abort
(You need virsh credentials to run the abort)

You may want to get the info of the job before aborting, to make sure the job started. This is done by running: virsh -r blockjob <vm_name> sda --info
(No credentials are required here)

Comment 7 Lilach Zitnitski 2017-11-06 12:23:41 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
rhevm-4.1.7.6-0.1.el7.noarch
vdsm-4.19.36-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. Start a live merge.
2. During live merge, abort the block job manually from the libvirtd.

Actual results:
vdsm detected the libvirt job failure and no error is shown in the vdsm log 

Expected results:

Moving to VERIFIED!

Comment 9 errata-xmlrpc 2017-11-07 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3139

Comment 10 Daniel Gur 2019-08-28 13:15:26 UTC
sync2jira

Comment 11 Daniel Gur 2019-08-28 13:21:13 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.