1422670 – During a Live Merge, VDSM still saw a block job, even though libvirt's block job had completed.

Bug 1422670 - During a Live Merge, VDSM still saw a block job, even though libvirt's block job had completed.

Summary: During a Live Merge, VDSM still saw a block job, even though libvirt's block ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.5
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Assignee:	Ala Hino
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:	1418280
Blocks:	1506157
TreeView+	depends on / blocked

Reported:	2017-02-15 20:00 UTC by Gordon Watson
Modified:	2021-09-09 12:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, live merge failed during merge status, delete volume or verify of delete volume. The live merge operation will now not fail until those steps successfully completed.
Clone Of:
Clones:	1506157 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:50:23 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
VDSM Log from SPM (4.53 MB, text/plain) 2017-09-07 19:43 UTC, Frank DeLorey	no flags	Details
GUI showing failed task (571.87 KB, image/png) 2017-09-21 17:03 UTC, Frank DeLorey	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2930801	None	None	None	2017-02-15 20:21:41 UTC
Red Hat Product Errata	RHEA-2018:1489	None	None	None	2018-05-15 17:52:06 UTC
oVirt gerrit	83128	'None'	MERGED	live-merge: Don't teardown the top volume if libvirt block commit job fails	2021-02-03 15:56:59 UTC

Description Gordon Watson 2017-02-15 20:00:50 UTC

Description of problem:

A live merge never completed, leaving a block job still visible to VDSM, whereas libvirt's block job had actually completed.

The live merge having failed on the engine side, subsequent attempts to retry it failed with "Cannot add block job <uuid>. A block job with id <uuid> already exists for image <uuid>".

The symptoms were;

- "Pivot completed" was not reported in the vdsm logs.
- "libvirt chain" was not reported in the vdsm logs.
- "The XML update has been completed" was not reported in the vdsm logs.
- 'virsh -r domblklist' showed that the active volume prior to the merge was still the active volume.
- 'lsof' showed that this volume was no longer, and therefore the parent volume was now the real active volume.
- 'virsh -r blockjob' showed that there were no active block jobs.
- 'vdsClient -s 0 getAllVmStats' showed that VDSM still saw an active block job,
- The RHEV database still contained all of the images that existed prior to the merge.
- One was in an "illegal" state.
- The volume metadata for the original active volume was marked as "ILLEGAL".

So, the merge had actually completed, and the 'qemu-kvm' process was no longer writing to the previous active volume, but 'libvirt' had not been updated to reflect this, even though the block job had completed.

Version-Release number of selected component (if applicable):

RHV 4.0.3
RHV-H 4.0 for 7.2
vdsm-4.18.11-1.el7
libvirt-1.2.17-13.el7_2.5

How reproducible:

Not.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 4 Tal Nisan 2017-02-19 17:04:45 UTC

Ala, this looks similar to the pivot issue you just worked on, can you have a look please?

Comment 5 Ala Hino 2017-02-19 19:25:16 UTC

Seems to be a duplicate of BZ 1414067. Gordon, can you please confirm?

Comment 8 Marina Kalinin 2017-02-21 20:11:11 UTC

4.0.z bugzilla for logging: bz#1418280.

Comment 10 Frank DeLorey 2017-09-07 19:41:22 UTC

Customer hit this with 4.1.5 and RHEL 7.4 hosts.

vdsm-4.19.28-1.el7ev.x86_64 
qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64 
libvirt-client-3.2.0-14.el7_4.2.x86_64
rhevm-4.1.5.2-0.1.el7.noarch 

- Below are some of the vm and disk images related data from database :

               vm_guid                |          vm_name           
--------------------------------------+----------------------------
 3d820a7d-29e4-40df-aea0-2c74f239fc98 | c7-test-thinkdisk-multiple


                vds_id                |             vds_name             
--------------------------------------+----------------------------------
 a7a854bb-ca59-4bcc-8080-5e6a4803c842 | lasdc-vnode-09.lxi.mosaic451.com




### Disk image 963677f1-1366-4445-a94d-2b0bfa62841f is in "Illegal" status in database :


          vm_name           |              device_id               |    alias     |            disk_alias            |                  lv                  | size |               parentid               | imagestatus | volume_type | volume_format |          storage_domain_id           | storage_name 
----------------------------+--------------------------------------+--------------+----------------------------------+--------------------------------------+------+--------------------------------------+-------------+-------------

 c7-test-thinkdisk-multiple | aa35c389-6a39-48d6-a55b-4281ef4a928a | virtio-disk0 | c7-test-thindisk_Disk1           | 7d096715-fd9e-446f-8d37-5b7fc513291b |   20 | 963677f1-1366-4445-a94d-2b0bfa62841f |           1 |           2 |      
      4 | 8863c919-bccb-461f-a366-71eba6f13838 | Dell2TB
 c7-test-thinkdisk-multiple | aa35c389-6a39-48d6-a55b-4281ef4a928a | virtio-disk0 | c7-test-thindisk_Disk1           | 963677f1-1366-4445-a94d-2b0bfa62841f |   20 | 00000000-0000-0000-0000-000000000000 |           4 |           2 |      
      4 | 8863c919-bccb-461f-a366-71eba6f13838 | Dell2TB
 c7-test-thinkdisk-multiple | baf7a0ea-4727-400b-b089-28e7e080e637 | virtio-disk1 | c7-test-thinkdisk-multiple_Disk1 | b2d05d2b-3d99-4c74-a511-090e90bec17d |  100 | 00000000-0000-0000-0000-000000000000 |           1 |           2 |      
      4 | 8863c919-bccb-461f-a366-71eba6f13838 | Dell2TB
(3 rows)



- The vm disk volume merge operation still failed :
###################################################


2017-09-06 02:27:37,677Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-6) [a4bc2686-9112-47ff-af1d-3a1b93d0330c] START, MergeVDSCommand(HostName = lasdc-vnode-09.lxi.mosaic451.com, MergeVDSCommandParameters:{runAsync='true', hostId='a7a854bb-ca59-4bcc-8080-5e6a4803c842', vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98', storagePoolId='00000001-0001-0001-0001-000000000311', storageDomainId='8863c919-bccb-461f-a366-71eba6f13838', imageGroupId='aa35c389-6a39-48d6-a55b-4281ef4a928a', imageId='7d096715-fd9e-446f-8d37-5b7fc513291b', baseImageId='963677f1-1366-4445-a94d-2b0bfa62841f', topImageId='7d096715-fd9e-446f-8d37-5b7fc513291b', bandwidth='0'}), log id: 58bca14b
2017-09-06 02:27:39,014Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-6) [a4bc2686-9112-47ff-af1d-3a1b93d0330c] Failed in 'MergeVDS' method
2017-09-06 02:27:39,018Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-6) [a4bc2686-9112-47ff-af1d-3a1b93d0330c] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lasdc-vnode-09.lxi.mosaic451.com command MergeVDS failed: Merge failed
2017-09-06 02:27:39,018Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-6) [a4bc2686-9112-47ff-af1d-3a1b93d0330c] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=52, message=Merge failed]]'


2017-09-06 02:27:39,018Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-6) [a4bc2686-9112-47ff-af1d-3a1b93d0330c] Command 'MergeVDSCommand(HostName = lasdc-vnode-09.lxi.mosaic451.com, MergeVDSCommandParameters:{runAsync='true', hostId='a7a854bb-ca59-4bcc-8080-5e6a4803c842', vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98', storagePoolId='00000001-0001-0001-0001-000000000311', storageDomainId='8863c919-bccb-461f-a366-71eba6f13838', imageGroupId='aa35c389-6a39-48d6-a55b-4281ef4a928a', imageId='7d096715-fd9e-446f-8d37-5b7fc513291b', baseImageId='963677f1-1366-4445-a94d-2b0bfa62841f', topImageId='7d096715-fd9e-446f-8d37-5b7fc513291b', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52



- The reason for this failure shows that earlier block job is still running:
############################################################################

2017-09-06 02:27:47,423+0000 ERROR (jsonrpc/1) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Cannot add block job 5ed7a1d8-acaa-4154-9e33-5a7489efcfe5.  A block job with id 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24 already exists for image aa35c389-6a39-48d6-a55b-4281ef4a928a (vm:4646)
2017-09-06 02:27:47,423+0000 ERROR (jsonrpc/1) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') A block job is already active on this disk (vm:4867)
2017-09-06 02:27:47,423+0000 INFO  (jsonrpc/1) [vdsm.api] FINISH merge return={'status': {'message': 'Merge failed', 'code': 52}} from=::ffff:100.64.64.9,54598, flow_id=a4bc2686-9112-47ff-af1d-3a1b93d0330c (api:52)
2017-09-06 02:27:47,424+0000 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.merge failed (error 52) in 0.34 seconds (__init__:539)


- It's still trying repeatedly to sync the disk volume:
#######################################################

2017-09-06 02:27:47,639+0000 INFO  (periodic/1) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Previous job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24 cleanup thread failed, retrying (vm:4773)
2017-09-06 02:27:47,863+0000 INFO  (merge/3d820a7d) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Synchronizing volume chain after live merge (job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24) (vm:5238)

2017-09-06 02:28:02,635+0000 INFO  (periodic/2) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Previous job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24 cleanup thread failed, retrying (vm:4773)
2017-09-06 02:28:02,882+0000 INFO  (merge/3d820a7d) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Synchronizing volume chain after live merge (job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24) (vm:5238)

2017-09-06 02:28:17,643+0000 INFO  (periodic/2) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Previous job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24 cleanup thread failed, retrying (vm:4773)
2017-09-06 02:28:17,853+0000 INFO  (merge/3d820a7d) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Synchronizing volume chain after live merge (job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24) (vm:5238)

2017-09-06 02:42:17,659+0000 INFO  (periodic/2) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Previous job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24 cleanup thread failed, retrying (vm:4773)
2017-09-06 02:42:17,883+0000 INFO  (merge/3d820a7d) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Synchronizing volume chain after live merge (job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24) (vm:5238)




- The vdsm was restarted on "2017-09-05 15:57" :
###############################################


2017-09-05 15:57:16,666+0000 INFO  (MainThread) [vds] (PID: 19428) I am the actual vdsm 4.19.28-1.el7ev lasdc-vnode-09.lxi.mosaic451.com (3.10.0-693.1.1.el7.x86_64) (vdsm:145)


- But it still fails with similar errors:
########################################

2017-09-05 15:57:32,376+0000 INFO  (merge/3d820a7d) [virt.vm] (vmId='3d820a7d-29e4-40df-aea0-2c74f239fc98') Synchronizing volume chain after live merge (job 77e3ad6a-2b46-4a9b-a401-e7fc93e96b24) (vm:5238)
2017-09-05 15:57:32,386+0000 INFO  (merge/3d820a7d) [vdsm.api] START imageSyncVolumeChain(sdUUID=u'8863c919-bccb-461f-a366-71eba6f13838', imgUUID=u'aa35c389-6a39-48d6-a55b-4281ef4a928a', volUUID=u'7d096715-fd9e-446f-8d37-5b7fc513291b', newChain=['963677f1-1366-4445-a94d-2b0bfa62841f', '7d096715-fd9e-446f-8d37-5b7fc513291b']) from=internal, task_id=f4971cc1-9b21-41af-bd6c-b98a36adb39b (api:46)




2017-09-05 15:57:47,678+0000 INFO  (merge/3d820a7d) [storage.Image] Current chain=963677f1-1366-4445-a94d-2b0bfa62841f < 7d096715-fd9e-446f-8d37-5b7fc513291b (top)  (image:1266)
2017-09-05 15:57:47,678+0000 INFO  (merge/3d820a7d) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=e5eea8a5-c10d-43a0-95f0-55b0070b7ff0 (api:52)
2017-09-05 15:57:47,679+0000 INFO  (merge/3d820a7d) [storage.LVM] Deactivating lvs: vg=8863c919-bccb-461f-a366-71eba6f13838 lvs=[u'7d096715-fd9e-446f-8d37-5b7fc513291b'] (lvm:1303)

2017-09-05 15:57:32,546+0000 INFO  (merge/3d820a7d) [storage.Image] Current chain=963677f1-1366-4445-a94d-2b0bfa62841f < 7d096715-fd9e-446f-8d37-5b7fc513291b (top)  (image:1266)
2017-09-05 15:57:32,546+0000 INFO  (merge/3d820a7d) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=f4971cc1-9b21-41af-bd6c-b98a36adb39b (api:52)
2017-09-05 15:57:32,547+0000 INFO  (merge/3d820a7d) [storage.LVM] Deactivating lvs: vg=8863c919-bccb-461f-a366-71eba6f13838 lvs=[u'7d096715-fd9e-446f-8d37-5b7fc513291b'] (lvm:1303)

2017-09-05 15:57:42,388+0000 ERROR (merge/3d820a7d) [root] Unhandled exception (utils:374)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5242, in run
    self.teardown_top_volume()
  File "/usr/share/vdsm/virt/vm.py", line 5225, in teardown_top_volume
    self.job['topVolume'])
  File "/usr/share/vdsm/storage/blockSD.py", line 918, in teardownVolume
    lvm.deactivateLVs(self.sdUUID, [volUUID])
  File "/usr/share/vdsm/storage/lvm.py", line 1304, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/share/vdsm/storage/lvm.py", line 843, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 8863c919-bccb-461f-a366-71eba6f13838/7d096715-fd9e-446f-8d37-5b7fc513291b in use.\']\\n8863c919-bccb-461f-a366-71eba6f13838/[u\'7d096715-fd9e-446f-8d37-5b7fc513291b\']",)',)


# virsh -r domblklist c7-test-thinkdisk-multiple 
Target     Source
------------------------------------------------
hdc        -
vda        /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/aa35c389-6a39-48d6-a55b-4281ef4a928a/7d096715-fd9e-446f-8d37-5b7fc513291b
vdb        /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/baf7a0ea-4727-400b-b089-28e7e080e637/b2d05d2b-3d99-4c74-a511-090e90bec17d

# virsh -r blockjob c7-test-thinkdisk-multiple vda --info 
No current block job for vda

# virsh -r blockjob c7-test-thinkdisk-multiple vdb --info 
No current block job for vdb

[root@lasdc-vnode-09 ~]# su vdsm -s /bin/sh -c 'qemu-img info --backing-chain /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/aa35c389-6a39-48d6-a55b-4281ef4a928a/7d096715-fd9e-446f-8d37-5b7fc513291b'
image: /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/aa35c389-6a39-48d6-a55b-4281ef4a928a/7d096715-fd9e-446f-8d37-5b7fc513291b
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 0
cluster_size: 65536
backing file: 963677f1-1366-4445-a94d-2b0bfa62841f (actual path: /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/aa35c389-6a39-48d6-a55b-4281ef4a928a/963677f1-1366-4445-a94d-2b0bfa62841f)
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

image: /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/aa35c389-6a39-48d6-a55b-4281ef4a928a/963677f1-1366-4445-a94d-2b0bfa62841f
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

[root@lasdc-vnode-09 ~]# su vdsm -s /bin/sh -c '/rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/baf7a0ea-4727-400b-b089-28e7e080e637/b2d05d2b-3d99-4c74-a511-090e90bec17d'
sh: /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/baf7a0ea-4727-400b-b089-28e7e080e637/b2d05d2b-3d99-4c74-a511-090e90bec17d: Permission denied
[root@lasdc-vnode-09 ~]# su vdsm -s /bin/sh -c 'qemu-img info --backing-chain /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/baf7a0ea-4727-400b-b089-28e7e080e637/b2d05d2b-3d99-4c74-a511-090e90bec17d'
image: /rhev/data-center/00000001-0001-0001-0001-000000000311/8863c919-bccb-461f-a366-71eba6f13838/images/baf7a0ea-4727-400b-b089-28e7e080e637/b2d05d2b-3d99-4c74-a511-090e90bec17d
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

[root@lasdc-vnode-07 ~]# vdsClient -s 0 getAllTasksInfo
Not SPM
[root@lasdc-vnode-07 ~]# vdsClient -s 0 getAllTasksStatuses
Not SPM

[root@lasdc-vnode-08 ~]# vdsClient -s 0 getAllTasksInfo
e8e2f8cc-af7d-43a7-afb2-f6ff1859ec30 :
	 verb = copyImage
	 id = e8e2f8cc-af7d-43a7-afb2-f6ff1859ec30

[root@lasdc-vnode-08 ~]# vdsClient -s 0 getAllTasksStatuses
{'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'e8e2f8cc-af7d-43a7-afb2-f6ff1859ec30': {'message': 'e8e2f8cc-af7d-43a7-afb2-f6ff1859ec30', 'code': 411, 'taskID': 'e8e2f8cc-af7d-43a7-afb2-f6ff1859ec30', 'taskResult': 'cleanSuccess', 'taskState': 'finished'}}}


[root@lasdc-vnode-10 ~]# vdsClient -s 0 getAllTasksInfo
Not SPM
[root@lasdc-vnode-10 ~]# vdsClient -s 0 getAllTasksStatuses
Not SPM


[root@lasdc-rhvm ~]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select * from job;"
                job_id                |    action_type    |                              description                               | status  |               owner_id               | visible |        start_time         | end_time |      last_update_time      |            correlation_id            | is_external | is_auto_cleared | engine_session_seq_id
--------------------------------------+-------------------+------------------------------------------------------------------------+---------+--------------------------------------+---------+---------------------------+----------+----------------------------+--------------------------------------+-------------+-----------------+-----------------------
 f97b58b3-d495-486c-83ca-84cf76a01c26 | AddVmFromSnapshot | Creating VM LASDC1709060703 from Snapshot Active VM in Cluster Default | STARTED | 5981fe56-014c-02ff-03e5-00000000033f | t       | 2017-09-06 07:04:05.05+00 |          | 2017-09-06 07:04:16.999+00 | baa68d02-2efd-4096-9b51-b06fefae6553 | f           | t               |                   288
(1 row)


[root@lasdc-rhvm ~]# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select * from async_tasks;"
 task_id | action_type | status | result | step_id | command_id | started_at | storage_pool_id | task_type | vdsm_task_id | root_command_id | user_id
---------+-------------+--------+--------+---------+------------+------------+-----------------+-----------+--------------+-----------------+---------
(0 rows)

Comment 11 Frank DeLorey 2017-09-07 19:43:43 UTC

Created attachment 1323448 [details]
VDSM Log from SPM

Comment 12 Frank DeLorey 2017-09-21 17:03:34 UTC

Created attachment 1329085 [details]
GUI showing failed task

Comment 13 Germano Veit Michel 2017-10-11 05:35:12 UTC

Let me add a few cents as I saw something that seems to be the same and may help here.

Initially:
1. Snapshot merge fails due to BZ1414067
2. Support manually aborts the blockjob on libvirt using virsh --abort
3. restart vdsm/libvirt
4. no more running blockjobs on libvirt
5. customer attempts to delete the snapshot again, it fails. VDSM thinks there are blockjobs in progress, but there are not (confirmed on libvirt).

VDSM on restart reads the vm recovery file (var/run/vdsm/uuid.recovery) to fill the jobs of the VM and then periodically tries to updateVmJobs for the VM. updateVmJobs calls queryBlockJobs from libvirt. And it fails with an exception,

periodic/3::ERROR::2017-10-07 14:01:39,434::executor::232::Executor::(_execute_task) Unhandled exception in Task(callable=<BlockjobMonitor vm=c1cf23bb-b65d-4999-9e3c-fe0d95f1daf0 at 0x21cb590>, timeout=7.5)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 230, in _execute_task
task.callable()
File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 303, in __call__
self._execute()
File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 374, in _execute
self._vm.updateVmJobs()
File "/usr/share/vdsm/virt/vm.py", line 4506, in updateVmJobs
self._vmJobs = self.queryBlockJobs()
File "/usr/share/vdsm/virt/vm.py", line 4529, in queryBlockJobs
drive = self._findDriveByUUIDs(storedJob['disk'])
File "/usr/share/vdsm/virt/vm.py", line 3248, in _findDriveByUUIDs
raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'imageID': u'0b6bde2d-2eea-4076-9ca6-2c97e88b2eac', 'domainID': u'd142a364-6593-4786-9124-ba7722293830', 'volumeID': u'94819807-6eac-4e69-b22a-9a4f8f543d84', 'poolID': u'58c66ccd-02d9-00cc-0324-0000000002ea'}'

The VM is not using this volume anymore, there is no more blockjob, we cleaned it up in step 2-3. Apparently this exception prevents the VmJobs from being updated at all! So VDSM data on VmJobs is old, stuck, not being updated.

So the VmJobs for the VM are not updated in case the vm recovery file contains some old garbage. Therefore VDSM is still seeing blockjobs that dont exist, and vdsm does not allow new blockjobs to be started, as it thinks these blockjobs actually exist in trackBlockJob on VM.Merge. And ti also never syncs with libvirt on the actual jobs (none).

See:

import pickle
vm_recovery = pickle.load( open( "var/run/vdsm/uuid.recovery", "rb" ) )
print vm_recovery["_blockJobs"]

$ ./load.py
...
u'43691c6b-8ad9-4060-a0b9-187fa0a9ed47': {
'blockJobType': 'commit',
'topVolume': u'94819807-6eac-4e69-b22a-9a4f8f543d84',
'strategy': 'commit',
'jobID': u'43691c6b-8ad9-4060-a0b9-187fa0a9ed47',
'disk': {'imageID': u'0b6bde2d-2eea-4076-9ca6-2c97e88b2eac',
'domainID': u'd142a364-6593-4786-9124-ba7722293830',
'volumeID': u'94819807-6eac-4e69-b22a-9a4f8f543d84',
'poolID': u'58c66ccd-02d9-00cc-0324-0000000002ea'},
'baseVolume': u'3642dde8-8d2f-4a76-ba8e-a35237100499'},
}
...

Even if this can only be reached as a result of BZ1414067, perhaps this exception should be better handled, allowing VDSM to actually sync with libvirt regarding the VM's block jobs? Better resilience.

Does this make sense?

Comment 18 Natalie Gavrielov 2018-01-01 12:07:11 UTC

Verified using builds:
rhvm-4.2.0.2-0.1.el7.noarch
vdsm-4.20.9.3-1.el7ev.x86_64

Passed automation, live merge TP (storage types NFS and iSCSI)

Comment 23 Ala Hino 2018-05-15 08:53:38 UTC

Billy,

The doc text add isn't correct.
Is there a way to restore the original doc text?

Comment 24 errata-xmlrpc 2018-05-15 17:50:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1489

Comment 25 Franta Kust 2019-05-16 13:05:26 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.