1207290 – [engine-backend] Live merge failure (VM with disks on block and file) after a successful merge

Bug 1207290 - [engine-backend] Live merge failure (VM with disks on block and file) after a successful merge

Summary: [engine-backend] Live merge failure (VM with disks on block and file) after a...

Keywords:
Status:	CLOSED DUPLICATE of bug 1207808
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.1
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.1
Assignee:	Adam Litke
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-30 15:45 UTC by Elad
Modified:	2016-02-10 17:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-31 20:18:18 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs from engine, vdsm, images table from db, pg dump, lvm output and /rhev/data-center tree (1.60 MB, application/x-gzip) 2015-03-30 15:45 UTC, Elad	no flags	Details
View All

Description Elad 2015-03-30 15:45:40 UTC

Created attachment 1008557 [details]
logs from engine, vdsm, images table from db, pg dump, lvm output and /rhev/data-center tree

Description of problem:
I tried to live delete a snapshot for a VM that had 4 disks, 2 of them located on a NFS domain and 2 on FC domain. This was a second live merge after the first had succeeded.
This second attempt failed on engine. Looking in the snapshot overview for these domains, I saw that the snapshot disks that located on the FC domain where removed successfully, while the snapshot disks located on the NFS domain were in status 'Illegal'.

Version-Release number of selected component (if applicable):
rhev 3.5.1 vt14.1
rhel7.1
vdsm-4.16.12.1-3.el7ev.x86_64
libvirt-daemon-1.2.8-16.el7_1.2.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64

How reproducible:
Tested once

Steps to Reproduce:
1. Created a VM with a disk on FC domain attached, installed OS. Attached 3 more disks, 2 from NFS domain and 1 from FC domain
2. Created 2 snapshots for the VM with all the disks.
3. Live removed successfully the first created snapshot
4. Tried to live remove the second created snapshot

Actual results:
The snapshot removal reported as failed on engine:

2015-03-30 17:56:38,129 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-41) [10c9ee97] All Live Merge child commands have completed, status FAILED

2015-03-30 17:56:48,809 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-59) [10c9ee97] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand
2015-03-30 17:56:48,881 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-59) [10c9ee97] Correlation ID: 10c9ee97, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '2' for VM 'vm-2'.


After the failure, the 2 snapshot disks, which attached to the VM and reside on the NFS domain were reported as 'Illegal' while the 2 snapshot disks, which attached to the VM and reside on the FC domain didn't exist anymore, they were successfully removed.
  

Expected results:
Live merge should succeed 

Additional info:
attached:
logs from engine, vdsm, images table from db, pg dump, lvm output and /rhev/data-center tree

Comment 1 Elad 2015-03-31 07:47:47 UTC

Reproduced again.
It seems that it happens while trying to merge the last created snapshot after all the newer snapshots were merged.

Comment 2 Allon Mureinik 2015-03-31 08:07:21 UTC

Adam - this looks like a dup of an issue you're already handling, no?

Can you please take a look?

Comment 3 Adam Litke 2015-03-31 20:18:18 UTC

When looking at the vdsm log I see the following:

Thread-5396::INFO::2015-03-30 17:55:17,974::vm::6089::vm.Vm::(tryPivot) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Requesting pivot to complete active layer commit (job 3a4ce665-7836-490f-9020-780c609ea9c1)
Thread-5396::INFO::2015-03-30 17:55:18,205::vm::6101::vm.Vm::(tryPivot) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Pivot completed (job 3a4ce665-7836-490f-9020-780c609ea9c1)
Thread-5396::INFO::2015-03-30 17:55:18,205::vm::6108::vm.Vm::(run) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Synchronizing volume chain after live merge (job 3a4ce665-7836-490f-9020-780c609ea9c1)
Thread-5396::DEBUG::2015-03-30 17:55:18,363::vm::5979::vm.Vm::(_syncVolumeChain) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::vdsm chain: [u'7fe5061b-d4d3-47a7-813e-2693eb3fce2e', u'ed77872b-b305-4812-9107-25c190c57354'], libvirt chain: [u'7fe5061b-d4d3-47a7-813e-2693eb3fce2e', u'ed77872b-b305-4812-9107-25c190c57354']

Merge job 3a4ce665-7836-490f-9020-780c609ea9c1 was an active layer commit and _syncVolumeChain tells us that after the pivot the same two volUUIDs exist in the chain reported to us by libvirt.  This is definitely the race described in bug 1207808.

*** This bug has been marked as a duplicate of bug 1207808 ***

Note You need to log in before you can comment on or make changes to this bug.