Bug 1016700

Summary: Cloning VM from snapshot of another VM results in corruption of original VM
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: ovirt-engine-webadmin-portalAssignee: Tomas Jelinek <tjelinek>
Status: CLOSED ERRATA QA Contact: Pavel Novotny <pnovotny>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, bazulay, cboyle, ecohen, grajaiya, hateya, iheim, jkt, lpeer, lyarwood, mavital, michal.skrivanek, pnovotny, rcyriac, Rhev-m-bugs, rhs-bugs, scohen, shaines, sputhenp, surs, tjelinek, yeylon
Target Milestone: ---Keywords: Regression, ZStream
Target Release: 3.2.5   
Hardware: All   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
In a hypervisor environment, and cloning a VM from the snapshot of another VM, the new VM booted up fine, but the original VM of which the snapshot was taken was rendered as not bootable. The failure message on boot-up was: "VM <VM name> is down. Exit message: unsupported configuration: non-primary video device must be type of 'qxl'. Failed to run VM <VM name> on Host <Hypervisor name>." There was a chance that the devices from the original VM were not cloned to the new VM but rewritten over the original, which meant the original VM was corrupted. This has been fixed by cloning the devices to the new VM, meaning the original VM will not be corrupted.
Story Points: ---
Clone Of: 982636 Environment:
Last Closed: 2013-12-18 14:09:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 982636    
Bug Blocks:    

Description rhev-integ 2013-10-08 14:37:14 UTC
+++ This bug is a RHEV-M zstream clone. The original bug is: +++
+++   https://bugzilla.redhat.com/show_bug.cgi?id=982636. +++
+++ Requested by "idith" +++
======================================================================



----------------------------------------------------------------------
Following comment by rcyriac on July 09 at 13:13:13, 2013

Description of problem:
In a RHEV-RHS environment, on cloning a VM from the snapshot of another VM, the new VM boots up fine, but the original VM of which the snapshot was taken is rendered as not bootable. The failure message on boot-up is:

------------------------------------------------------

VM <VM name> is down. Exit message: unsupported configuration: non-primary video device must be type of 'qxl'.

Failed to run VM <VM name> on Host <Hypervisor name>.

------------------------------------------------------  

The only way to rescue the original VM is to commit the snapshot. This will make the VM bootable. However, it results in the loss of all data created after the snapshot was taken.

* The issue is reproducible for Live Snapshot and for Snapshot taken after VM shut-down.

* The issue is reproducible for VM with 'pre-allocated' disk, and for VM with 'thin-provision' disk.

* The issue is reproducible when the image store Storage Domain is based of RHS volume of type distribute-replicate, and of RHS volume of type pure replicate

Version-Release number of selected component (if applicable):

RHEVM: 3.2 (3.2.0-11.37.el6ev)

RHS: 2.0+ (glusterfs-server-3.3.0.11rhs-1.el6rhs.x86_64)

Hypervisors: RHEL 6.4 and RHEVH 6.4 with glusterfs-3.3.0.11rhs-1.el6.x86_64 and glusterfs-fuse-3.3.0.11rhs-1.el6.x86_64


How reproducible:
Always Reproducible.

Steps to Reproduce:

1. Create VM

2. Seal VM for cloning (remove unique identifiers like hostname,MAC address etc.)

3. Create Live Snapshot, or shut-down the VM and create Snapshot

4. Clone VM from snapshot

5. After clone process is completed, boot up original VM. This fails with message:

------------------------------------------------------

VM quizzac1 is down. Exit message: unsupported configuration: non-primary video device must be type of 'qxl'.

Failed to run VM quizzac1 on Host RHEVH6.4-rhs-gp-srv15.

------------------------------------------------------

6. Boot up the cloned VM. It will boot up okay.

7. Attempt to rescue Original VM

a) If the snapshot is deleted from the original VM, it is still not bootable, and shows same error messages. The VM will now be irrecoverably lost!

b) Switch on the 'Preview Mode' of snapshot of the original VM, and on running the VM in preview mode, the VM boots up okay, but as expected it will have no data created from the period after the snapshot was taken.

c) Shutdown the VM from 'Preview Mode', and choose to 'Undo Preview', so as to restore from snapshot. The VM would then be not bootable, showing same error messages.

d) Switch on the 'Preview Mode' of snapshot of the original VM again. Confirm it boots okay, shut it down. Then choose to 'Commit Preview' of snapshot. Then the original VM is bootable again, but the data created from the period after snapshot was taken is now lost forever!

Actual results:
After a VM is cloned from the snapshot of another VM, the original VM is corrupted, and is recoverable only with loss of the data created from the period after snapshot.

Expected results:
After a VM is cloned from the snapshot of another VM, the original VM should be bootable to its state before shut-down.

Additional info:

----------------------------------------------------------------------
Following comment by pm-rhel on July 09 at 13:31:54, 2013

This bug report has Keywords: Regression or TestBlocker.

Since no regressions or test blockers are allowed between releases,
it is also being identified as a blocker for this release.

Please resolve ASAP.

----------------------------------------------------------------------
Following comment by rcyriac on July 09 at 13:37:33, 2013

RHEV-RHS Test Environment Summary
---------------------------------

Versions:
--------

RHEVM: 3.2 (3.2.0-11.37.el6ev)

RHS: 2.0+ (glusterfs-server-3.3.0.11rhs-1.el6rhs.x86_64)

Hypervisors: RHEL 6.4 and RHEVH 6.4 with glusterfs-3.3.0.11rhs-1.el6.x86_64 and glusterfs-fuse-3.3.0.11rhs-1.el6.x86_64


Systems:
-------

RHEVM:
mcqueen.lab.eng.blr.redhat.com

Hypervisors:
RHEVH6.4:
rhs-gp-srv15.lab.eng.blr.redhat.com
RHEL6.4:
rhs-gp-srv12.lab.eng.blr.redhat.com
rhs-client10.lab.eng.blr.redhat.com

RHS:
rhs-client45.lab.eng.blr.redhat.com
rhs-client37.lab.eng.blr.redhat.com
rhs-client15.lab.eng.blr.redhat.com
rhs-client4.lab.eng.blr.redhat.com


RHS Volume Info:
---------------

Volume Name: boom
Type: Distributed-Replicate
Volume ID: 9754cbfc-b1d7-4ace-b15b-9225076c321b
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick1/boom
Brick2: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick1/boom
Brick3: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick2/boom
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick2/boom
Brick5: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick3/boom
Brick6: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick3/boom
Brick7: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/boom
Brick8: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/boom
Brick9: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/boom
Brick10: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/boom
Brick11: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/boom
Brick12: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/boom
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

System sosreports:
-----------------

Available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/982636/09July2013/sosreports/

----------------------------------------------------------------------
Following comment by rcyriac on July 11 at 06:54:28, 2013

I am going to run a test to ensure that this issue is limited to POSIX compliant FS Storage Domains, ie RHS volume backed.

Will update BZ with result.

Cheers!

rejy (rmc)

----------------------------------------------------------------------
Following comment by pm-rhel on July 11 at 12:21:15, 2013

This bug report previously had all acks and release flag approved.
However since at least one of its acks has been changed, the
release flag has been reset to ? by the bugbot (pm-rhel).  The
ack needs to become approved before the release flag can become
approved again.

----------------------------------------------------------------------
Following comment by rcyriac on July 11 at 12:35:21, 2013

I have tried the issue reproduction on RHEV Data Center, with pure NFS Storage Domain used as image store, so as to ensure whether scope of bug is limited to
RHS volume.

The result is that the issue *is always* reproducible when using pure NFS Storage Domain, in the same way as when RHS backed Storage Domain is used. So it looks like this may be caused by a Regression issue in RHEV.

The issue description remains valid whether a pure NFS or an RHS volume is used for the Storage Domain.

Versions used for the current test:

RHEVM: 3.2 (3.2.0-11.37.el6ev)

Hypervisor: RHEL 6.4

Storage Domain: NFS share from RHEL 6.4 system

P.S. I can provide any additional information needed, and assist with any data collection and testing, if required.

- rejy (rmc)

----------------------------------------------------------------------
Following comment by pm-rhel on July 11 at 12:53:25, 2013

This request has been proposed as a blocker, but a release flag has
not been requested. Please set a release flag to ? to ensure we may
track this bug against the appropriate upcoming release, and reset
the blocker flag to ?.

----------------------------------------------------------------------
Following comment by rcyriac on July 11 at 14:30:10, 2013

I played around with this Bug some more, and managed to narrow down the environment factors leading to the issue, and steps on how to get back the original VM.

Factors leading to issue:
------------------------

The issue is related to the Console Protocol for the original VM. The issue appears to occur only if the Console Protocol of 'VNC' is chosen for the VM of which the snapshot is taken. Then when the VM is shut down, and a VM is cloned off the snapshot, the original VM refuses to boot up, with the message:

VM <VM name> is down. Exit message: unsupported configuration: non-primary video device must be type of 'qxl'.

And a way to get back the original VM back, is to preview and commit on the original VM, the same snapshot used to clone the other VM. But that results in the loss of all data created after the period when the snapshot was taken.

Discovered Now! Steps to get back the original VM with data intact:
------------------------------------------------------------------

After a VM is cloned off the snapshot, and while the original VM remains shut down, change the Console Protocol to 'Spice', and click on 'OK' to save. Then you may start up the original VM straight away, or again edit the Console Protocol back to 'VNC', save, and then start up the original VM. Either way, the original VM boots up fine, with all data intact, even from the period after the snapshot was taken.

Now I am not sure if this is a Regression, or if the issue was always there, and we never hit it, because there is a chance we may not have used the combination of VNC console protocol, VM snapshot, and cloning VM !

Hope this helps in weeding out the cause of the issue. :-)

Cheers!

- rejy (rmc)

----------------------------------------------------------------------
Following comment by pm-rhel on July 11 at 14:52:27, 2013

This bug report has Keywords: Regression or TestBlocker.

Since no regressions or test blockers are allowed between releases,
it is also being identified as a blocker for this release.

Please resolve ASAP.

----------------------------------------------------------------------
Following comment by rcyriac on July 12 at 07:00:23, 2013

Just to clarify on the 'Factors leading to issue:' part of comment 7 :

The issue occurs only if 'VNC' is the 'Console Protocol' for the original VM, *at the time of* creation of the snapshot, to be used for VM cloning. Later on, even if the original VM's 'Console Protocol' is changed to 'Spice', the issue will still occur as soon as that specific snapshot is used to clone a VM.

----------------------------------------------------------------------
Following comment by tjelinek on July 17 at 15:12:27, 2013

merged u/s: 58720c1faf1d95f4e869b662fcbe2b3bd9f02889

----------------------------------------------------------------------
Following comment by jturner on July 24 at 11:33:42, 2013

Moving to Target Milestone of Beta1.

----------------------------------------------------------------------
Following comment by dbotzer on September 08 at 08:37:17, 2013

Fixed, 3.3/is12
Both VMs, original and cloned-from-snapshot
working ok,
Fixed, 3.3/is12

----------------------------------------------------------------------
Following comment by lyarwood on October 08 at 10:10:58, 2013

Attaching SFDC#00955734 and setting rhevm-3.2.z? given this looks like a clear hit.

Comment 4 Charlie 2013-11-28 00:41:58 UTC
This bug is currently attached to errata RHBA-2013:16431. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 5 Pavel Novotny 2013-12-04 17:22:25 UTC
Verified in rhevm-3.2.5-0.48.el6ev.noarch (sf22)

Verification steps:
1. Create VM "original" and run it.
2. Create live snapshot and from it clone new VM "cloned-live".
3. Shut down VM "original", create another snapshot and from it clone second VM "cloned-offline".
4. Boot up both "cloned-*" VMs and check via console if they work correctly.
5. Remove both snapshots created in step 2 and 3 and check again the "clone-*" VMs (do so also after both VMs are rebooted).

All VMs (the original one and both cloned) were operating normally, i.e., no disk corruption was experienced nor errors/warnings were reported by RHEVM or in the guest OS.

Comment 7 errata-xmlrpc 2013-12-18 14:09:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1831.html