Created attachment 847552 [details] vdsm log when first snapshot deletion crash Description of problem: On a shutdown VM, trying to delete a snapshot fails with the error : "Failed to delete snapshot 'blahblahbla' for VM 'myVM'." Eventually leading to snapshot status getting "BROKEN". Version-Release number of selected component (if applicable): - oVirt 3.3.0-4.el6 - vdsm-4.12.1-2.el6 - Manager and host is CentOS 6.4 64bits - Storage is iSCSI SAN (Equalogic) How reproducible: Quite unwanted but already seen once before. Steps to Reproduce: 1. Shutdown the VM 2. Create a snapshot 3. Start the VM 4. Shutdown the VM 5. From the web gui, delete the snapshot Actual results: Snapshot deletion fails. Trying to repeat the deletion some times fails the same way, until the last attempt that leads the snapshot status to become "BROKEN". At this point, the VM can not be started anymore. Expected results: Snapshot being deleted. VM able to run. Additional info: Once the snapshot gets "BROKEN", trying to run the VM crashes with the error message : VM uc-674 is down. Exit message: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none,id=drive-virtio-disk0,format=qcow2,serial=69220da6-eeed-4435-aad0-7aa33f3a0d21,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23: Invalid argument. I had forced this VM to run on a specific host in order to ease debugging. This host is also the SPM, so it helps reading the logs. When running "lvs", I see the disk logical volume is still there, but disabled. Running into the same situation as there (http://list-archives.org/2013/10/25/users-ovirt-org/vm-snapshot-delete-failed-iscsi-domain/f/6837397684), I have re-enabled the logical volume : lvchange -aey /dev/blahblahblah and it activated well. Then running it manually worked also : /usr/libexec/qemu-kvm -m 512 -name uc-674 -drive file=/dev/blahblahblah -vnc :5 I can VNC it, I have no network but it is running OK. Trying to run it through oVirt (web gui) leads to the crash below : VM uc-674 is down. Exit message: internal error process exited while connecting to monitor: qemu-kvm: -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none,id=drive-virtio-disk0,format=qcow2,serial=69220da6-eeed-4435-aad0-7aa33f3a0d21,cache=none,werror=stop,rerror=stop,aio=native: could not open disk image /rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23: Invalid argument . I took the time to run the exact same command manually, removing some options one by one. I get three cases : - Working (VM is booting fine, OS is running OK) - Invalid argument (The command is immediately stopped) - No boot device (The BIOS VM is starting, but as no boot device is found, failbacking to PXE...) These are the cases : * Working : -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop,rerror=stop -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,cache=none,werror=stop,rerror=stop,aio=native * No boot device : -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,if=none * Invalid argument : -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/11a077c7-658b-49bb-8596-a785109c24c9/images/69220da6-eeed-4435-aad0-7aa33f3a0d21/c50561d9-c3ba-4366-b2bc-49bbfaa4cd23,id=drive-virtio-disk0,format=qcow2 Last thing : My main concern is to recover the VM, not the snapshot.
setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc.
What's next is more or less related to the bug, as it deals with the now faulty VM, but is deals more with trying to run it, than to debug why the snapshot deletion failed. Anyway. Last news : - I saw that I was able to lvchange -aey (to activate) the logical volume, and then manually mount it (qemu-kvm). So it showed the information stored inside this volume was not corrupted. Weird or not, I was able to clone the whole VM definition AND the disk (using the web gui), not forgetting to rename each part (VM name, disk name). And the clone ran well! For the production part of my job, things are solved. But for the oVirt project, they're not. Good news : I still have the faulty VM available for your tests and tries. Semi-good news : I have another VM with a still running snapshot. You will understand I'm not fond of playing with the latter until we found a secure way to delete snapshots. On another way, I tried to compare the XML shown in the vdsm logs between the faulty VM and a VM running fine. Apart the many obvious differences (UID, paths), I see nothing shocking.
This is an automated message. Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.
Hi Nicolas, can u please also add sanlock and messages logs (also engine log could be great) I suspect it could be the same scenario as https://bugzilla.redhat.com/1082655
Hi, As I said on january, the concerned VM are production servers, so it is not straighforward for me to play with them. Some steps are to be taken before playing again with them, so please be patient and stay tuned. Thank you.
Returning the needinfo flag to mark that we need more info to solve this issue.
Nicolas, another question: does this failure reproduce consistently on ANY VM, or just on this specific one?
Allon, Hard to say as we had only two VMs with snapshots. The first one was concerned, and the second one is running in production, and not payable with at present.
This is an automated message. oVirt 3.4.1 has been released. This issue has been retargeted to 3.4.2 as it has severity high, please retarget if needed. If this is a blocker please add it to the tracker Bug #1095370
On my two oVirt setups, I upgraded to 3.4.1-1.el6. I just tested today and tried to reproduce the bug, and it does not appear anymore. I propose we close this bug.
(In reply to Nicolas Ecarnot from comment #11) > On my two oVirt setups, I upgraded to 3.4.1-1.el6. > > I just tested today and tried to reproduce the bug, and it does not appear > anymore. > > I propose we close this bug. Thanks for the update, Nicolas! Closing.