Description of problem: Right to the point: -> VM with 4 disks, all with 2 images in chain. -> Remove snapshot, top images merged into parent -> Completes successfully for 2 disks, and 2 are apparently stuck This is about a live merge where 2 out of 4 disks are stuck in the "mirror" phase because both their blockJobAbort commands timed out and were never sent again. I am not sure if this is expected to be dealt with in the recovery flows, I am afraid not as it will possibly fail when adding another block job for the same disk? Version-Release number of selected component (if applicable): vdsm-4.17.26-0.el7ev.noarch engine is 3.6.6 How reproducible: 0% after a couple tries. Customer hit this twice already. Actual results: Disk 1: vdsm.log.43.xz:Thread-10367908::INFO::2016-06-16 18:40:37,313::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job 55a9a807-23fe-4c9c-9cdc-d0ff9ec069f7) vdsm.log.43.xz:Thread-10367908::INFO::2016-06-16 18:41:12,918::vm::5241::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Pivot completed (job 55a9a807-23fe-4c9c-9cdc-d0ff9ec069f7) .... all good .... Disk 4: vdsm.log.43.xz:Thread-10368049::INFO::2016-06-16 18:41:15,484::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job dcdd8131-c9e2-4c04-b4d3-8cb9b2d2e2ef) vdsm.log.43.xz:Thread-10368049::INFO::2016-06-16 18:42:09,861::vm::5241::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Pivot completed (job dcdd8131-c9e2-4c04-b4d3-8cb9b2d2e2ef) .... all good .... Disk 2 and 3 are stuck like this: Thread-10368051::INFO::2016-06-16 18:41:15,491::vm::5228::virt.vm::(tryPivot) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::Requesting pivot to complete active layer commit (job f106d20e-208b-4749-985d-e89234e24a33) 30 seconds after an exception is raised (30sec timeout for the blockJobAbort command) Thread-10368051::ERROR::2016-06-16 18:41:45,491::utils::739::root::(wrapper) Unhandled exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 736, in wrapper return f(*a, **kw) File "/usr/share/vdsm/virt/vm.py", line 5262, in run self.tryPivot() File "/usr/share/vdsm/virt/vm.py", line 5231, in tryPivot ret = self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/share/vdsm/virt/virdomain.py", line 76, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainBlockJobAbort) After this point, no more signs of this Job or Thread, vdsm seems to forget it. virsh -r blockjob --info <vm> <disk> Active Block Commit: [100 %] It's always at 100%, apparently ready to pivot by receiving another blockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag, which is never sent again. So vdsm sent a blockJobAbort with VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT, apparently it timed out and nothing else was done to recover from this. A day later the engine fails and is ready for a "retry". However, the qemu process is still writing to both images as the pivot never completed. * Just requested new logs to understand how the engine failed, probably timeout. The exception itself seems related to BZ 1319400, which should arrive in 7.2.Z soon. Question is, IF that BZ is really related, will it be enough or there is a need to improve something on vdsm side? Expected results: Pivot and finish merge
I will add more information as soon as I get updated logs or manage to reproduce this.
Interesting. On the engine side, the command failed after ~2 days. But the engine is still waiting for the related job. 2016-06-16 16:40:25,711 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-3) [7d62b9eb] START, MergeVDSCommand(HostName = svo113.mgt.rad.lan, MergeVDSCommandParameters:{runAsync='true', hostId='a482125b-a741-4684-9e85-f0e2ec0d825d', vmId='6edf9fb0-5d51-4f13-a623-fe1aa18a9681', storagePoolId='00000002-0002-0002-0002-0000000003a8', storageDomainId='3d2fdb78-a0b8-404d-ac3a-8fbb306581ab', imageGroupId='a5ebb735-5165-4b62-a254-30b56c9d721f', imageId='9de533dd-5a0e-4652-8a01-cd0fbe097d8b', baseImageId='3a16ed1d-f90e-4f2f-90c8-bf616a7165dc', topImageId='9de533dd-5a0e-4652-8a01-cd0fbe097d8b', bandwidth='0'}), log id: 7227453c 2016-06-16 16:41:15,082 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-3) [7d62b9eb] FINISH, MergeVDSCommand, return: f106d20e-208b-4749-985d-e89234e24a33, log id: 7227453c 2016-06-18 18:40:25,588 WARN [org.ovirt.engine.core.bll.tasks.CommandExecutor] (DefaultQuartzScheduler_Worker-95) [2e3cef94] Marking expired command as Failed: command 'Merge (a5c86e46-a16f-45e2-8f94-8a1473b4039d)' that started at 'Thu Jun 16 16:40:25 UTC 2016' has been marked as Failed. 2016-06-18 18:40:35,259 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-29) [394e420a] Merging of snapshot 'ea9bdeb5-1928-4320-9aca-730cdd64bcb6' images '3a16ed1d-f90e-4f2f-90c8-bf616a7165dc'..'9de533dd-5a0e-4652-8a01-cd0fbe097d8b' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. A week later... 2016-06-24 11:22:35,292 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-10) [2e3cef94] VM job 'f106d20e-208b-4749-985d-e89234e24a33': In progress (no change)
(In reply to Germano Veit Michel from comment #0) > The exception itself seems related to BZ 1319400, which should arrive in > 7.2.Z soon. Question is, IF that BZ is really related, will it be enough or > there is a need to improve something on vdsm side? Ala, can you confirm/refute this?
Shutting down the VMs leaves everything broken and VM cannot start or remove snapshot again. So the recovery flows don't know how to handle this, yet. A manual clean-up of everything, that is, removing the leaf images in Storage Domain and adjusting all metadata, including DB seems necessary. After it VM boots from base image and everything works as expected with no signs or data rolling back or any corruption, which indicates the mirror phase was still active and kept both images "up do date".
(In reply to Germano Veit Michel from comment #5) > Shutting down the VMs leaves everything broken and VM cannot start or remove > snapshot again. So the recovery flows don't know how to handle this, yet. > > A manual clean-up of everything, that is, removing the leaf images in > Storage Domain and adjusting all metadata, including DB seems necessary. > After it VM boots from base image and everything works as expected with no > signs or data rolling back or any corruption, which indicates the mirror > phase was still active and kept both images "up do date". Wouldn't cold merging the same snapshot do the trick? Also - please confirm/refute - does the qemu fix remove the initial issue (making the recovery from it a mute point)?
(In reply to Allon Mureinik from comment #8) > > Wouldn't cold merging the same snapshot do the trick? > > Also - please confirm/refute - does the qemu fix remove the initial issue > (making the recovery from it a mute point)? Hi Allon, Unfortunately no, cold snapshot removal retry did not make any difference as it also failed. Let me try to make the sequence of events from my comment #5 a bit more clear: Customer shutdown (fine) -> Try to start (fail) -> Try to remove snapshot (fail) -> Manual intervention -> Try to start (ok)
Hi Germano, I have a couple of questions/requests: 1. Can you please elaborate what exactly happened when failed to start the VM before the manual clean-up? What error/behavior did you encounter? 2. Can you please explain why the manual cleaned-up was necessary? What was wrong there? Is it the leaf data or metadata? 3. Can you please double check the engine version? The bug says 3.6.6; however, in 3.6.6, when retrying live merge, I expected to see an exception in the engine log file that I don't see here
Hi Ala, Sure. 1. The VM won't start or remove the snapshot, both failing with "illegal volume" when preparing the volume. VM has 4 disks, all of them with the leaves set as illegal in SD metadata. Note that only 2 disks were stuck in the mirror phase and 2 pivoted fine. So engine was failing like this: 2016-06-27 06:35:42,328 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-55) [] BaseAsyncTask::logEndTaskFailure: Task '9ec4607a-9e44-4ba7-af49-f63ae7ab4331' (Parent Command 'RemoveSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Cannot prepare illegal volume', -- Exception: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Cannot prepare illegal volume' Due to VDSM failing like this: 9ec4607a-9e44-4ba7-af49-f63ae7ab4331::ERROR::2016-06-27 08:35:35,083::image::1302::Storage.Image::(merge) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/image.py", line 1288, in merge [vols[vName] for vName in chain]) File "/usr/share/vdsm/storage/image.py", line 1078, in _baseRawVolumeMerge with srcVol.scopedPrepare(rw=True, chainrw=True, setrw=True): File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/share/vdsm/storage/volume.py", line 796, in scopedPrepare force=False) File "/usr/share/vdsm/storage/volume.py", line 815, in prepare raise se.prepareIllegalVolumeError(self.volUUID) prepareIllegalVolumeError: Cannot prepare illegal volume: (u'5d4c4563-1414-453a-b0ba-0770b762acfc',) Just for completeness, this is what happened in VDSM during a failed VM start: Thread-984360::ERROR::2016-06-27 08:36:48,417::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Cannot prepare illegal volume: (u'9de533dd-5a0e-4652-8a01-cd0fbe097d8b',)", 'code': 227}} Thread-984360::ERROR::2016-06-27 08:36:48,417::vm::759::virt.vm::(_startUnderlyingVm) vmId=`6edf9fb0-5d51-4f13-a623-fe1aa18a9681`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 1859, in _run self.preparePaths(dev_spec_map[hwclass.DISK]) File "/usr/share/vdsm/virt/vm.py", line 780, in preparePaths drive['path'] = self.cif.prepareVolumePath(drive, self.id) File "/usr/share/vdsm/clientIF.py", line 317, in prepareVolumePath raise vm.VolumeError(drive) VolumeError: Bad volume specification {u'index': 0, u'iface': u'virtio', u'type': u'disk', u'format': u'cow', u'bootOrder': u'1', u'address': {u'slot': u'0x05', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'volumeID': u'9de533dd-5a0e-4652-8a01-cd0fbe097d8b', 'apparentsize': '2819817472', u'imageID': u'a5ebb735-5165-4b62-a254-30b56c9d721f', u'specParams': {}, u'readonly': u'false', u'domainID': u'3d2fdb78-a0b8-404d-ac3a-8fbb306581ab', 'reqsize': '0', u'deviceId': u'a5ebb735-5165-4b62-a254-30b56c9d721f', 'truesize': '2819760128', u'poolID': u'00000002-0002-0002-0002-0000000003a8', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'optional': u'false'} From what I see we were not trying any recovery flow here. 2. This particular VM had 4 disks. 2 were stuck in the mirror phase and 2 merged and pivoted fine. After seeing that a retry of the snapshot removal failed, VM also would not start and the mirror job was at 100%, I assumed the base disk contained all the data it should. So the 4 base images were all we needed. We removed the 4 leaves manually, removed the leaves and snapshot from the DB, cleaned up the SD metadata, setting the base volume as LEAF. All went back to normal. Yes, I could have triggered a known recovery flow with less changes I guess, but when this happens it's usually just easier to "overkill" and set every single piece to good known state. 3. rhevm-3.6.6.2-0.1.el6.noarch To be fair I don't completely understand what happened here. Initially I thought this would end up as a "one in a lifetime" incident. Then we saw a second VM doing the same thing, and then never again. I suppose having a recovery flow to recover from this would be just enough, as apparently it does not harm data integrity (Both VMs are happy a customer reports all the data is there). Also, the images were set as LEGAL in the DB, if we timed out and set them to ILLEGAL it would have probably been enough to trigger a recovery flow which could have handled this. Or maybe not, I haven't studied much the recovery flows, yet =D. Hopefully I answered all your questions. Do you have any root cause for this in mind? Let me know if you need anything else or how I can help you further.
Thanks you for the detailed reply, Germano. Can you please confirm what the engine version was when issue hit? With 3.6.6 improvements made in Live Merge recovery area, I think retry Live Merge would fixed the issues listed here. I am afraid that retry Live Merge made before upgrading the deployment ?
Hi Ala, Yes, upgraded to 3.6.6 more than two weeks before the incident. yum.log May 31 08:05:49 Updated: rhevm-3.6.6.2-0.1.el6.noarch ovirt-engine-setup-20160531080423-sh2ixh.log 2016-05-31 08:06:55 DEBUG otopi.context context.dumpEnvironment:510 ENV OVESETUP_CORE_MODIFIED_LINES_GROUP/versionlock=dict:'{'/etc/yum/pluginconf.d/versionlock.list': [{'added': u'rhevm-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-userportal-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-dbscripts-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-tools-backup-3.6.6.2-0.1.el6.noarch'}, {'added': u'redhat-support-plugin-rhev-3.6.0-15.el6.noarch'}, {'added': u'rhevm-webadmin-portal-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-restapi-3.6.6.2-0.1.el6.noarch'}, {'added': u'ovirt-engine-extension-aaa-jdbc-1.0.7-2.el6ev.noarch'}, {'added': u'rhevm-tools-3.6.6.2-0.1.el6.noarch'}, {'added': u'rhevm-backend-3.6.6.2-0.1.el6.noarch'}]}' Please let me know what else I can help you with.
Thank you for the reply, Germano. Would like to summarize the behavior here and close the bug: 1. Root cause of the issue is BZ 1319400 that is ON_QA now 2. First time issue hit was on Feb (from the logs I see that this the first time retrying live merge was done). At that time, an older version of the engine was installed, i.e. *not* 3.6.6. I am emphasizing this because recovery enhancements introduced in 3.6.6 3. On June 08 deployment upgraded to 3.6.6 4. At this point it is not possible to recover "old" volumes by retying live merge With 3.6.6 we would be able to recover from similar issues by retrying live merge.
Hi Ala, Well, if this will indeed be solved by BZ 1319400 then of course it can be closed. I had this impression when I opened this BZ as well as you can see in the bug description, just wanted to make sure nothing else needs to be done to prevent this from happening. I'm not sure where you got the Feb incident from or the upgrade on June 08th, perhaps I missed something or you were also looking at the vdsm side. But yes, the attached case literally hit all the possible live merge bugs and those VMs already have a history of related failures. If previous issues can prevent the live merge recovery flows from doing their job then that's acceptable as well. Thank you for looking at this. Feel free to close it.
Hi Germano, Thanks for your prompt inputs. In the engine log file from sosreport that uploaded on Feb 25, I saw that a retry merge was done. This means that merge already failed at this date. In the customer case, the customer reported on June 08 that they upgraded to 3.6.6. Closing. *** This bug has been marked as a duplicate of bug 1319400 ***