Created attachment 1008557 [details] logs from engine, vdsm, images table from db, pg dump, lvm output and /rhev/data-center tree Description of problem: I tried to live delete a snapshot for a VM that had 4 disks, 2 of them located on a NFS domain and 2 on FC domain. This was a second live merge after the first had succeeded. This second attempt failed on engine. Looking in the snapshot overview for these domains, I saw that the snapshot disks that located on the FC domain where removed successfully, while the snapshot disks located on the NFS domain were in status 'Illegal'. Version-Release number of selected component (if applicable): rhev 3.5.1 vt14.1 rhel7.1 vdsm-4.16.12.1-3.el7ev.x86_64 libvirt-daemon-1.2.8-16.el7_1.2.x86_64 qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64 How reproducible: Tested once Steps to Reproduce: 1. Created a VM with a disk on FC domain attached, installed OS. Attached 3 more disks, 2 from NFS domain and 1 from FC domain 2. Created 2 snapshots for the VM with all the disks. 3. Live removed successfully the first created snapshot 4. Tried to live remove the second created snapshot Actual results: The snapshot removal reported as failed on engine: 2015-03-30 17:56:38,129 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-41) [10c9ee97] All Live Merge child commands have completed, status FAILED 2015-03-30 17:56:48,809 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-59) [10c9ee97] Ending command with failure: org.ovirt.engine.core.bll.RemoveSnapshotCommand 2015-03-30 17:56:48,881 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-59) [10c9ee97] Correlation ID: 10c9ee97, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '2' for VM 'vm-2'. After the failure, the 2 snapshot disks, which attached to the VM and reside on the NFS domain were reported as 'Illegal' while the 2 snapshot disks, which attached to the VM and reside on the FC domain didn't exist anymore, they were successfully removed. Expected results: Live merge should succeed Additional info: attached: logs from engine, vdsm, images table from db, pg dump, lvm output and /rhev/data-center tree
Reproduced again. It seems that it happens while trying to merge the last created snapshot after all the newer snapshots were merged.
Adam - this looks like a dup of an issue you're already handling, no? Can you please take a look?
When looking at the vdsm log I see the following: Thread-5396::INFO::2015-03-30 17:55:17,974::vm::6089::vm.Vm::(tryPivot) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Requesting pivot to complete active layer commit (job 3a4ce665-7836-490f-9020-780c609ea9c1) Thread-5396::INFO::2015-03-30 17:55:18,205::vm::6101::vm.Vm::(tryPivot) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Pivot completed (job 3a4ce665-7836-490f-9020-780c609ea9c1) Thread-5396::INFO::2015-03-30 17:55:18,205::vm::6108::vm.Vm::(run) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::Synchronizing volume chain after live merge (job 3a4ce665-7836-490f-9020-780c609ea9c1) Thread-5396::DEBUG::2015-03-30 17:55:18,363::vm::5979::vm.Vm::(_syncVolumeChain) vmId=`42829cb9-9d04-4ef6-8719-c0079abee6df`::vdsm chain: [u'7fe5061b-d4d3-47a7-813e-2693eb3fce2e', u'ed77872b-b305-4812-9107-25c190c57354'], libvirt chain: [u'7fe5061b-d4d3-47a7-813e-2693eb3fce2e', u'ed77872b-b305-4812-9107-25c190c57354'] Merge job 3a4ce665-7836-490f-9020-780c609ea9c1 was an active layer commit and _syncVolumeChain tells us that after the pivot the same two volUUIDs exist in the chain reported to us by libvirt. This is definitely the race described in bug 1207808. *** This bug has been marked as a duplicate of bug 1207808 ***