Bug 1383301

Summary: Snapshot remove Live-Merge failed, After vm shutdown, start again is not possible
Product: [oVirt] ovirt-engine Reporter: Marco Buettner <buettner>
Component: BLL.StorageAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.6CC: ableisch, ahino, amureini, bugs, mtessun, ratamir, tnisan, ylavi
Target Milestone: ovirt-4.1.7Flags: rule-engine: ovirt-4.1+
rule-engine: exception+
Target Release: 4.1.7.4   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-13 12:29:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1384321    
Attachments:
Description Flags
ovirt-engine.log
none
vdsm-log-vm4
none
vdsm-log-vm1 none

Description Marco Buettner 2016-10-10 11:48:09 UTC
Created attachment 1208808 [details]
ovirt-engine.log

Description of problem:

The snapshot of a running VM with two disks should be removed.
The process started but never get complete in gui.
After a reboot of the engine, there are no more tasks in the gui. The snapshot is still locked.
The time between snapshot creation and deleten don't care. It could be a week, a month or a day.
Interesting: the snapshot is removed from disk1, but not from disk2.

If the VM would powered off, it is not possible to run it again, also it is not possible to remove the snapshot after unlocking it by the cli tool(Livemerge or Offline-merge). 

Trying to start the vm will cause an "bad Volume specification" error, trying to delete the snapshot will bring a "disc image could not found" error. 

There are similar bug reports, but the solutions won't fix my problem.

My Enviroment:

Problem started with:
Five host with CentOS 7.2 
vdsm-4.17.28-1.el7.noarch
engine 3.6.6

Furthermore I tried an update to engine version 3.6.7 and one host to vdsm-4.17.32-1.el7.noarch. Same issues.

Output of vdsm-tool:

vdsm-tool dump-volume-chains 92f45e8b-b46a-4711-8376-770b0f1d807c

......
image:    53d50939-3134-4e51-8239-7cfa306faa7e

             - 04524569-b562-4ef6-9f9b-30392b1b820f
               status: OK, voltype: INTERNAL, format: RAW, legality: LEGAL, type: PREALLOCATED

             - 535aa7d6-ef9b-4d5d-a58e-855e96298c59
               status: ILLEGAL, voltype: LEAF, format: COW, legality: ILLEGAL, type: SPARSE

......

On Database:

engine=# SELECT * from images where image_group_id='53d50939-3134-4e51-8239-7cfa306faa7e';
-[ RECORD 1 ]---------+-------------------------------------
image_guid            | 535aa7d6-ef9b-4d5d-a58e-855e96298c59
creation_date         | 2015-07-23 15:58:20+02
size                  | 64424509440
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 04524569-b562-4ef6-9f9b-30392b1b820f
imagestatus           | 1
lastmodified          | 1970-01-01 01:00:00+01
vm_snapshot_id        | e8bc4f12-0b47-47e2-b47a-6bc3bc5bc680
volume_type           | 2
volume_format         | 4
image_group_id        | 53d50939-3134-4e51-8239-7cfa306faa7e
_create_date          | 2015-07-23 15:58:15.43802+02
_update_date          | 2016-10-09 23:15:35.144036+02
active                | t
volume_classification | 0
-[ RECORD 2 ]---------+-------------------------------------
image_guid            | 04524569-b562-4ef6-9f9b-30392b1b820f
creation_date         | 2015-06-13 17:14:45+02
size                  | 64424509440
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 4
lastmodified          | 2015-07-23 15:58:15.437+02
vm_snapshot_id        | 9a40dc84-0649-4e3d-9794-a2acd6e3d7e8
volume_type           | 1
volume_format         | 5
image_group_id        | 53d50939-3134-4e51-8239-7cfa306faa7e
_create_date          | 2015-03-21 13:38:12.808172+01
_update_date          | 2015-07-23 15:58:15.43802+02
active                | f
volume_classification | 1


snipped from engine-log

2016-10-10 09:38:18,688 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-11) [1f4d2f78] Merging of snapshot '9a40dc84-0649-4e3d-97
94-a2acd6e3d7e8' images '04524569-b562-4ef6-9f9b-30392b1b820f'..'535aa7d6-ef9b-4d5d-a58e-855e96298c59' failed. Images have been marked illegal and can no longer be previewed or rever
ted to. Please retry Live Merge on the snapshot to complete the operation.




How reproducible:
Create a snapshot on a vm with more than one disk, then try to delete him.



Actual results:
Deletion of a snapshot failed.


Expected results:
Deletion of a snapshot went fine.

Additional info:

Comment 1 Marco Buettner 2016-10-10 11:50:23 UTC
Created attachment 1208809 [details]
vdsm-log-vm4

Comment 2 Marco Buettner 2016-10-10 11:51:07 UTC
Created attachment 1208811 [details]
vdsm-log-vm1

Comment 3 Allon Mureinik 2016-10-10 13:22:37 UTC
Ala, can you take a look please?

Comment 4 Ala Hino 2016-10-10 14:27:12 UTC
Ack; looking into it

Comment 5 Marco Buettner 2016-10-17 07:57:16 UTC
Any new ideas?

A quick grep of the lv-name in lvdisplay:

 --- Logical volume ---
  LV Path                /dev/92f45e8b-b46a-4711-8376-770b0f1d807c/535aa7d6-ef9b-4d5d-a58e-855e96298c59
  LV Name                535aa7d6-ef9b-4d5d-a58e-855e96298c59
  VG Name                92f45e8b-b46a-4711-8376-770b0f1d807c
  LV UUID                rkR063-nRO6-q8zP-chNN-4x2T-1dsH-ytx6ok
  LV Write Access        read/write
  LV Creation host, time vm3.service.domain.de, 2015-07-23 15:58:16 +0200
  LV Status              NOT available
  LV Size                3,00 GiB
  Current LE             24
  Segments               3
  Allocation             inherit
  Read ahead sectors     auto


 --- Logical volume ---
  LV Path                /dev/92f45e8b-b46a-4711-8376-770b0f1d807c/04524569-b562-4ef6-9f9b-30392b1b820f
  LV Name                04524569-b562-4ef6-9f9b-30392b1b820f
  VG Name                92f45e8b-b46a-4711-8376-770b0f1d807c
  LV UUID                FyBtIj-gOt9-ST29-2eB6-gfIG-y5iW-4p835f
  LV Write Access        read/write
  LV Creation host, time vm3.service.domain.de, 2015-06-13 17:14:42 +0200
  LV Status              NOT available
  LV Size                60,00 GiB
  Current LE             480
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

Comment 6 Yaniv Lavi 2016-12-26 12:45:33 UTC
We are not maintaining oVirt 3.6 anymore.
4.0.x has a lot of improvement that will probably resolve this issue.
Please reopen if you can reproduce on oVirt 4.0.x.

Comment 7 Andreas Bleischwitz 2017-07-04 09:53:14 UTC
I have been able to reproduce that issue on a RHV4.1 setup.

Engine:
ovirt-engine-4.1.2.3-0.1.el7.noarch

host:
vdsm-4.19.15-1.el7ev.x86_64

* Created a vm with two disks attached.
* Create offline snapshot (may not be relevant).
* Put some changes on both disk, large enough to take some time for deletion.
* delete snapshot, shutdown vm during that activity (issue "init 0" / "halt" on vm)
* engine does process the snapshot deletion for some time, but aborts tasks finally - without having the snapshot of both disks deleted.

* Find the GUI showing the snapshot of one disk still present but "invalid"
* The vm will no longer start: "Exit message: Bad volume specification"

Comment 19 Allon Mureinik 2017-10-01 11:56:18 UTC
The patches were reverted on master, as they seem to cause non-deterministic OST failures.
Moving back to ASSIGNED - they should be reverted on the stable branch too.

Comment 22 Raz Tamir 2017-10-25 07:26:15 UTC
Based on comment #19 - moving back to assign

Comment 23 Red Hat Bugzilla Rules Engine 2017-10-25 07:26:24 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 24 Red Hat Bugzilla Rules Engine 2017-10-25 07:26:45 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 25 Allon Mureinik 2017-10-25 07:33:23 UTC
(In reply to Raz Tamir from comment #22)
> Based on comment #19 - moving back to assign

New patches were poster after that, and merged. These updates don't get comment numbers so I can't link to them, but check out the block between comment 19 and comment 20 (items posted by automation  and rhev-integ). 
ON_QA is the right status for this bug.

Comment 26 Raz Tamir 2017-10-25 08:15:49 UTC
Thanks for clarifying Allon.

Comment 27 Ala Hino 2017-10-25 09:35:10 UTC
Background:
It was impossible to start the VM again after live merge was failed and the VM was shutdown, because the top volume was illegal. Top volume can be illegal only when doing active merge, i.e. merging the parent of the leaf. In addition, the top volume is set to illegal only after merge step (at Vdsm) completes.

Steps to reproduce:
1. Live merge the leaf's parent snapshot and watch the engine logs
2. When merge steps completes, block access from the host to the engine
3. At this point, the engine will keep retrying checking the merge status, and fail because there is no access to the host
4. Shutdown the VM
5. Start the VM, should work

Comment 28 Kevin Alon Goldblatt 2017-10-26 13:53:44 UTC
Verified with the following code:
-----------------------------------------
ovirt-engine-4.1.7.4-0.1.el7.noarch
vdsm-4.19.35-1.el7ev.x86_64


Verified with the following scenario:
-----------------------------------------
Steps to reproduce:

Steps to reproduce:
1. On a vm with 2 disks create a snapshot.
2. Add some data
3. Live merge the leaf's parent snapshot and watch the engine logs
4. When merge steps completes, block access from the host to the engine
5. At this point, the engine will keep retrying checking the merge status, and fail because there is no access to the host
6. Shutdown the VM - snapshot deletion completes successfully
7. Start the VM, works fine - no disks are left invalid

Moving to VERIFIED!