1050901 – Failed to delete snapshot

Bug 1050901 - Failed to delete snapshot

Summary: Failed to delete snapshot

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	vdsm
Sub Component:
Version:	3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.1
Assignee:	Maor
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-09 10:21 UTC by Nicolas Ecarnot
Modified:	2016-02-10 17:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-3.4.1-1.el6
Clone Of:
Environment:
Last Closed:	2014-05-23 18:37:41 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm log when first snapshot deletion crash (653.90 KB, application/x-xz) 2014-01-09 10:21 UTC, Nicolas Ecarnot	no flags	Details
View All

Description Nicolas Ecarnot 2014-01-09 10:21:39 UTC

Created attachment 847552 [details]
vdsm log when first snapshot deletion crash

Description of problem:

On a shutdown VM, trying to delete a snapshot fails with the error :
"Failed to delete snapshot 'blahblahbla' for VM 'myVM'."
Eventually leading to snapshot status getting "BROKEN".

Version-Release number of selected component (if applicable):
- oVirt 3.3.0-4.el6
- vdsm-4.12.1-2.el6
- Manager and host is CentOS 6.4 64bits
- Storage is iSCSI SAN (Equalogic)

How reproducible:
Quite unwanted but already seen once before.

Steps to Reproduce:
1. Shutdown the VM
2. Create a snapshot
3. Start the VM
4. Shutdown the VM
5. From the web gui, delete the snapshot

Actual results:
Snapshot deletion fails.
Trying to repeat the deletion some times fails the same way, until the last attempt that leads the snapshot status to become "BROKEN".
At this point, the VM can not be started anymore.

Expected results:
Snapshot being deleted. VM able to run.

Additional info:
Once the snapshot gets "BROKEN", trying to run the VM crashes with the error message :

VM uc-674 is down. Exit message: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none,id=drive-virtio-disk0,format=qcow2,serial=69220da6-eeed-4435-aad0-7aa33f3a0d21,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23: Invalid argument.

I had forced this VM to run on a specific host in order to ease debugging. This host is also the SPM, so it helps reading the logs.

When running "lvs", I see the disk logical volume is still there, but disabled.
Running into the same situation as there (http://list-archives.org/2013/10/25/users-ovirt-org/vm-snapshot-delete-failed-iscsi-domain/f/6837397684), I have re-enabled the logical volume :

lvchange -aey /dev/blahblahblah

and it activated well.

Then running it manually worked also :

/usr/libexec/qemu-kvm -m 512 -name uc-674 -drive file=/dev/blahblahblah -vnc :5

I can VNC it, I have no network but it is running OK.

Trying to run it through oVirt (web gui) leads to the crash below :

VM uc-674 is down. Exit message: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none,id=drive-virtio-disk0,format=qcow2,serial=69220da6-eeed-4435-aad0-7aa33f3a0d21,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23: Invalid argument .

I took the time to run the exact same command manually, removing some options one by one. I get three cases :
- Working (VM is booting fine, OS is running OK)
- Invalid argument (The command is immediately stopped)
- No boot device (The BIOS VM is starting, but as no boot device is found, failbacking to PXE...)

These are the cases :

* Working :
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop,rerror=stop
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop,rerror=stop,aio=native

* No boot device :
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none

* Invalid argument :
-drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,format=qcow2

Last thing : My main concern is to recover the VM, not the snapshot.

Comment 1 Itamar Heim 2014-01-12 08:42:59 UTC

setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc.

Comment 2 Nicolas Ecarnot 2014-01-12 19:13:28 UTC

What's next is more or less related to the bug, as it deals with the now faulty VM, but is deals more with trying to run it, than to debug why the snapshot deletion failed.

Anyway.

Last news :
- I saw that I was able to lvchange -aey (to activate) the logical volume, and then manually mount it (qemu-kvm). So it showed the information stored inside this volume was not corrupted.

Weird or not, I was able to clone the whole VM definition AND the disk (using the web gui), not forgetting to rename each part (VM name, disk name).
And the clone ran well!

For the production part of my job, things are solved.
But for the oVirt project, they're not.

Good news : I still have the faulty VM available for your tests and tries.
Semi-good news : I have another VM with a still running snapshot. You will understand I'm not fond of playing with the latter until we found a secure way to delete snapshots.


On another way, I tried to compare the XML shown in the vdsm logs between the faulty VM and a VM running fine. Apart the many obvious differences (UID, paths), I see nothing shocking.

Comment 4 Sandro Bonazzola 2014-03-04 09:29:32 UTC

This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 5 Maor 2014-04-28 09:49:12 UTC

Hi Nicolas,
can u please also add sanlock and messages logs (also engine log could be great)
I suspect it could be the same scenario as https://bugzilla.redhat.com/1082655

Comment 6 Nicolas Ecarnot 2014-04-28 14:04:00 UTC

Hi,

As I said on january, the concerned VM are production servers, so it is not straighforward for me to play with them.
Some steps are to be taken before playing again with them, so please be patient and stay tuned. Thank you.

Comment 7 Allon Mureinik 2014-04-30 08:38:22 UTC

Returning the needinfo flag to mark that we need more info to solve this issue.

Comment 8 Allon Mureinik 2014-04-30 08:41:17 UTC

Nicolas, another question: does this failure reproduce consistently on ANY VM, or just on this specific one?

Comment 9 Nicolas Ecarnot 2014-04-30 09:01:15 UTC

Allon,

Hard to say as we had only two VMs with snapshots. The first one was concerned, and the second one is running in production, and not payable with at present.

Comment 10 Sandro Bonazzola 2014-05-08 13:52:12 UTC

This is an automated message.

oVirt 3.4.1 has been released.
This issue has been retargeted to 3.4.2 as it has severity high, please retarget if needed.
If this is a blocker please add it to the tracker Bug #1095370

Comment 11 Nicolas Ecarnot 2014-05-23 11:52:40 UTC

On my two oVirt setups, I upgraded to 3.4.1-1.el6.

I just tested today and tried to reproduce the bug, and it does not appear anymore.

I propose we close this bug.

Comment 12 Allon Mureinik 2014-05-23 18:37:41 UTC

(In reply to Nicolas Ecarnot from comment #11)
> On my two oVirt setups, I upgraded to 3.4.1-1.el6.
> 
> I just tested today and tried to reproduce the bug, and it does not appear
> anymore.
> 
> I propose we close this bug.
Thanks for the update, Nicolas!
Closing.

Note You need to log in before you can comment on or make changes to this bug.