Bug 1349735

Summary:	blockJobAbort times out during live merge, vm left writing to two images (mirror)
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	vdsm	Assignee:	Ala Hino <ahino>
Status:	CLOSED DUPLICATE	QA Contact:	Aharon Canan <acanan>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.6.6	CC:	ahino, amureini, bazulay, ebenahar, gveitmic, lsurette, mkalinin, srevivo, tnisan, ycui, ykaul, ylavi
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-17 11:40:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2016-06-24 06:46:46 UTC

Description of problem:

Right to the point:

-> VM with 4 disks, all with 2 images in chain.
-> Remove snapshot, top images merged into parent
-> Completes successfully for 2 disks, and 2 are apparently stuck

This is about a live merge where 2 out of 4 disks are stuck in the "mirror" phase because both their blockJobAbort commands timed out and were never sent again. 

I am not sure if this is expected to be dealt with in the recovery flows, I am afraid not as it will possibly fail when adding another block job for the same disk?

Version-Release number of selected component (if applicable):
vdsm-4.17.26-0.el7ev.noarch
engine is 3.6.6

How reproducible:
0% after a couple tries. Customer hit this twice already.

Actual results:

Disk 1:

vdsm.log.43.xz:Thread-10367908::INFO::2016-06-16 18:40:37,313::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job 55a9a807-23fe-4c9c-9cdc-d0ff9ec069f7)

vdsm.log.43.xz:Thread-10367908::INFO::2016-06-16 18:41:12,918::vm::5241::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Pivot completed (job 55a9a807-23fe-4c9c-9cdc-d0ff9ec069f7)

.... all good ....

Disk 4:

vdsm.log.43.xz:Thread-10368049::INFO::2016-06-16 18:41:15,484::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job dcdd8131-c9e2-4c04-b4d3-8cb9b2d2e2ef)

vdsm.log.43.xz:Thread-10368049::INFO::2016-06-16 18:42:09,861::vm::5241::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Pivot completed (job dcdd8131-c9e2-4c04-b4d3-8cb9b2d2e2ef)

.... all good ....

Disk 2 and 3 are stuck like this:

Thread-10368051::INFO::2016-06-16 18:41:15,491::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job f106d20e-208b-4749-985d-e89234e24a33)

30 seconds after an exception is raised (30sec timeout for the blockJobAbort command)

Thread-10368051::ERROR::2016-06-16 18:41:45,491::utils::739::root::(wrapper) Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 736, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5262, in run
    self.tryPivot()
  File "/usr/share/vdsm/virt/vm.py", line 5231, in tryPivot
    ret = self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/share/vdsm/virt/virdomain.py", line 76, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainBlockJobAbort)

After this point, no more signs of this Job or Thread, vdsm seems to forget it.

virsh -r blockjob --info <vm> <disk>
Active Block Commit: [100 %]

It's always at 100%, apparently ready to pivot by receiving another blockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag, which is never sent again.

So vdsm sent a blockJobAbort with VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT, apparently it timed out and nothing else was done to recover from this.

A day later the engine fails and is ready for a "retry". However, the qemu process is still writing to both images as the pivot never completed.
* Just requested new logs to understand how the engine failed, probably timeout.

The exception itself seems related to BZ 1319400, which should arrive in 7.2.Z soon. Question is, IF that BZ is really related, will it be enough or there is a need to improve something on vdsm side?

Expected results:
Pivot and finish merge

Comment 1 Germano Veit Michel 2016-06-24 06:48:18 UTC

I will add more information as soon as I get updated logs or manage to reproduce this.

Comment 2 Germano Veit Michel 2016-06-27 00:56:45 UTC

Interesting. On the engine side, the command failed after ~2 days. But the engine is still waiting for the related job.

2016-06-16 16:40:25,711 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-3) [7d62b9eb] START, MergeVDSCommand(HostName = svo113.mgt.rad.lan, MergeVDSCommandParameters:{runAsync='true', hostId='a482125b-a741-4684-9e85-f0e2ec0d825d', vmId='6edf9fb0-5d51-4f13-a623-fe1aa18a9681', storagePoolId='00000002-0002-0002-0002-0000000003a8', storageDomainId='3d2fdb78-a0b8-404d-ac3a-8fbb306581ab', imageGroupId='a5ebb735-5165-4b62-a254-30b56c9d721f', imageId='9de533dd-5a0e-4652-8a01-cd0fbe097d8b', baseImageId='3a16ed1d-f90e-4f2f-90c8-bf616a7165dc', topImageId='9de533dd-5a0e-4652-8a01-cd0fbe097d8b', bandwidth='0'}), log id: 7227453c

2016-06-16 16:41:15,082 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-3) [7d62b9eb] FINISH, MergeVDSCommand, return: f106d20e-208b-4749-985d-e89234e24a33, log id: 7227453c

2016-06-18 18:40:25,588 WARN  [org.ovirt.engine.core.bll.tasks.CommandExecutor] (DefaultQuartzScheduler_Worker-95) [2e3cef94] Marking expired command as Failed: command 'Merge (a5c86e46-a16f-45e2-8f94-8a1473b4039d)' that started at 'Thu Jun 16 16:40:25 UTC 2016' has been marked as Failed.

2016-06-18 18:40:35,259 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-29) [394e420a] Merging of snapshot 'ea9bdeb5-1928-4320-9aca-730cdd64bcb6' images '3a16ed1d-f90e-4f2f-90c8-bf616a7165dc'..'9de533dd-5a0e-4652-8a01-cd0fbe097d8b' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.

A week later...

2016-06-24 11:22:35,292 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-10) [2e3cef94] VM job 'f106d20e-208b-4749-985d-e89234e24a33': In progress (no change)

Comment 3 Allon Mureinik 2016-06-27 11:50:36 UTC

(In reply to Germano Veit Michel from comment #0)
> The exception itself seems related to BZ 1319400, which should arrive in
> 7.2.Z soon. Question is, IF that BZ is really related, will it be enough or
> there is a need to improve something on vdsm side?
Ala, can you confirm/refute this?

Comment 5 Germano Veit Michel 2016-06-28 23:03:23 UTC

Shutting down the VMs leaves everything broken and VM cannot start or remove snapshot again. So the recovery flows don't know how to handle this, yet.

A manual clean-up of everything, that is, removing the leaf images in Storage Domain and adjusting all metadata, including DB seems necessary. After it VM boots from base image and everything works as expected with no signs or data rolling back or any corruption, which indicates the mirror phase was still active and kept both images "up do date".

Comment 8 Allon Mureinik 2016-06-29 06:02:27 UTC

(In reply to Germano Veit Michel from comment #5)
> Shutting down the VMs leaves everything broken and VM cannot start or remove
> snapshot again. So the recovery flows don't know how to handle this, yet.
> 
> A manual clean-up of everything, that is, removing the leaf images in
> Storage Domain and adjusting all metadata, including DB seems necessary.
> After it VM boots from base image and everything works as expected with no
> signs or data rolling back or any corruption, which indicates the mirror
> phase was still active and kept both images "up do date".

Wouldn't cold merging the same snapshot do the trick?

Also - please confirm/refute - does the qemu fix remove the initial issue (making the recovery from it a mute point)?

Comment 9 Germano Veit Michel 2016-06-29 06:09:16 UTC

(In reply to Allon Mureinik from comment #8)
> 
> Wouldn't cold merging the same snapshot do the trick?
> 
> Also - please confirm/refute - does the qemu fix remove the initial issue
> (making the recovery from it a mute point)?

Hi Allon,

Unfortunately no, cold snapshot removal retry did not make any difference as it also failed.

Let me try to make the sequence of events from my comment #5 a bit more clear:

Customer shutdown (fine) -> Try to start (fail) -> Try to remove snapshot (fail) -> Manual intervention -> Try to start (ok)

Comment 10 Ala Hino 2016-07-12 12:58:41 UTC

Hi Germano,

I have a couple of questions/requests:

1. Can you please elaborate what exactly happened when failed to start the VM before the manual clean-up? What error/behavior did you encounter?

2. Can you please explain why the manual cleaned-up was necessary? What was wrong there? Is it the leaf data or metadata?

3. Can you please double check the engine version? The bug says 3.6.6; however, in 3.6.6, when retrying live merge, I expected to see an exception in the engine log file that I don't see here

Comment 11 Germano Veit Michel 2016-07-13 01:18:41 UTC

Hi Ala,

Sure. 

1. The VM won't start or remove the snapshot, both failing with "illegal volume" when preparing the volume. VM has 4 disks, all of them with the leaves set as illegal in SD metadata. Note that only 2 disks were stuck in the mirror phase and 2 pivoted fine.

So engine was failing like this:

2016-06-27 06:35:42,328 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-55) [] BaseAsyncTask::logEndTaskFailure: Task '9ec4607a-9e44-4ba7-af49-f63ae7ab4331' (Parent Command 'RemoveSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Cannot prepare illegal volume',
-- Exception: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Cannot prepare illegal volume'

Due to VDSM failing like this:

9ec4607a-9e44-4ba7-af49-f63ae7ab4331::ERROR::2016-06-27 08:35:35,083::image::1302::Storage.Image::(merge) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/image.py", line 1288, in merge
    [vols[vName] for vName in chain])
  File "/usr/share/vdsm/storage/image.py", line 1078, in _baseRawVolumeMerge
    with srcVol.scopedPrepare(rw=True, chainrw=True, setrw=True):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/volume.py", line 796, in scopedPrepare
    force=False)
  File "/usr/share/vdsm/storage/volume.py", line 815, in prepare
    raise se.prepareIllegalVolumeError(self.volUUID)
prepareIllegalVolumeError: Cannot prepare illegal volume: (u'5d4c4563-1414-453a-b0ba-0770b762acfc',)

Just for completeness, this is what happened in VDSM during a failed VM start:

Thread-984360::ERROR::2016-06-27 08:36:48,417::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Cannot prepare illegal volume: (u'9de533dd-5a0e-4652-8a01-cd0fbe097d8b',)", 'code': 227}}
Thread-984360::ERROR::2016-06-27 08:36:48,417::vm::759::virt.vm::(_startUnderlyingVm) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1859, in _run
    self.preparePaths(dev_spec_map[hwclass.DISK])
  File "/usr/share/vdsm/virt/vm.py", line 780, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 317, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {u'index': 0, u'iface': u'virtio', u'type': u'disk', u'format': u'cow', u'bootOrder': u'1', u'address': {u'slot': u'0x05', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'volumeID': u'9de533dd-5a0e-4652-8a01-cd0fbe097d8b', 'apparentsize': '2819817472', u'imageID': u'a5ebb735-5165-4b62-a254-30b56c9d721f', u'specParams': {}, u'readonly': u'false', u'domainID': u'3d2fdb78-a0b8-404d-ac3a-8fbb306581ab', 'reqsize': '0', u'deviceId': u'a5ebb735-5165-4b62-a254-30b56c9d721f', 'truesize': '2819760128', u'poolID': u'00000002-0002-0002-0002-0000000003a8', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'optional': u'false'}

From what I see we were not trying any recovery flow here.

2. This particular VM had 4 disks. 2 were stuck in the mirror phase and 2 merged and pivoted fine. After seeing that a retry of the snapshot removal failed, VM also would not start and the mirror job was at 100%, I assumed the base disk contained all the data it should. So the 4 base images were all we needed. We removed the 4 leaves manually, removed the leaves and snapshot from the DB, cleaned up the SD metadata, setting the base volume as LEAF. All went back to normal. Yes, I could have triggered a known recovery flow with less changes I guess, but when this happens it's usually just easier to "overkill" and set every single piece to good known state.

3. rhevm-3.6.6.2-0.1.el6.noarch

To be fair I don't completely understand what happened here. Initially I thought this would end up as a "one in a lifetime" incident. Then we saw a second VM doing the same thing, and then never again.

I suppose having a recovery flow to recover from this would be just enough, as apparently it does not harm data integrity (Both VMs are happy a customer reports all the data is there). Also, the images were set as LEGAL in the DB, if we timed out and set them to ILLEGAL it would have probably been enough to trigger a recovery flow which could have handled this. Or maybe not, I haven't studied much the recovery flows, yet =D.

Hopefully I answered all your questions. Do you have any root cause for this in mind?

Let me know if you need anything else or how I can help you further.

Comment 12 Ala Hino 2016-07-13 15:50:21 UTC

Thanks you for the detailed reply, Germano.

Can you please confirm what the engine version was when issue hit?

With 3.6.6 improvements made in Live Merge recovery area, I think retry Live Merge would fixed the issues listed here. 
I am afraid that retry Live Merge made before upgrading the deployment ?

Comment 13 Germano Veit Michel 2016-07-14 05:48:21 UTC

Hi Ala,

Yes, upgraded to 3.6.6 more than two weeks before the incident. 

yum.log
May 31 08:05:49 Updated: rhevm-3.6.6.2-0.1.el6.noarch

ovirt-engine-setup-20160531080423-sh2ixh.log
2016-05-31 08:06:55 DEBUG otopi.context context.dumpEnvironment:510 ENV OVESETUP_CORE_MODIFIED_LINES_GROUP/versionlock=dict:'{'/etc/yum/pluginconf.d/versionlock.list': [{'added': u'rhevm-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-userportal-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-dbscripts-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-tools-backup-3.6.6.2-0.1.el6.noarch'}, {'added': u'redhat-support-plugin-rhev-3.6.0-15.el6.noarch'}, {'added': u'rhevm-webadmin-portal-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-restapi-3.6.6.2-0.1.el6.noarch'}, {'added': u'ovirt-engine-extension-aaa-jdbc-1.0.7-2.el6ev.noarch'}, {'added': u'rhevm-tools-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-backend-3.6.6.2-0.1.el6.noarch'}]}'

Please let me know what else I can help you with.

Comment 14 Ala Hino 2016-07-14 14:52:06 UTC

Thank you for the reply, Germano.

Would like to summarize the behavior here and close the bug:

1. Root cause of the issue is BZ 1319400 that is ON_QA now 
2. First time issue hit was on Feb (from the logs I see that this the first time retrying live merge was done). At that time, an older version of the engine was installed, i.e. *not* 3.6.6. I am emphasizing this because recovery enhancements introduced in 3.6.6
3. On June 08 deployment upgraded to 3.6.6
4. At this point it is not possible to recover "old" volumes by retying live merge 

With 3.6.6 we would be able to recover from similar issues by retrying live merge.

Comment 15 Germano Veit Michel 2016-07-15 04:10:03 UTC

Hi Ala,

Well, if this will indeed be solved by BZ 1319400 then of course it can be closed. I had this impression when I opened this BZ as well as you can see in the bug description, just wanted to make sure nothing else needs to be done to prevent this from happening.

I'm not sure where you got the Feb incident from or the upgrade on June 08th, perhaps I missed something or you were also looking at the vdsm side. But yes, the attached case literally hit all the possible live merge bugs and those VMs already have a history of related failures. If previous issues can prevent the live merge recovery flows from doing their job then that's acceptable as well.

Thank you for looking at this. Feel free to close it.

Comment 16 Ala Hino 2016-07-17 11:40:40 UTC

Hi Germano,

Thanks for your prompt inputs.

In the engine log file from sosreport that uploaded on Feb 25, I saw that a retry merge was done. This means that merge already failed at this date.

In the customer case, the customer reported on June 08 that they upgraded to 3.6.6.

Closing.

*** This bug has been marked as a duplicate of bug 1319400 ***