Bug 1418571 - VM.Snapshot failure not handled well enough
Summary: VM.Snapshot failure not handled well enough
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.0.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.1.1
: ---
Assignee: Maor
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-02 07:35 UTC by Germano Veit Michel
Modified: 2017-02-06 23:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-06 23:13:53 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Germano Veit Michel 2017-02-02 07:35:29 UTC
Description of problem:

1. VM is running with volume A
2. Snapshot Button is Pressed
3. Create volume B on SPM
4. VM.Snapshot sent to the Host running the VM
5. VM.Snapshot fails for whatever reason, we display this message and no other action is taken:
Failed to create live snapshot 'X' for VM 'Y'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.

Even if the user notices the message and shuts down the VM shortly after, it's not good enough. Something else needs to be done, because the following sequence of events may take place:

1. Create volume B for Snapshot
2. Active Layer change from A to B fails
3. A is extended (QCOW2)
4. VM is shutdown (still running with A).

Now the user powers the VM up and it comes up with volume B.

B Virtual Size is X
A Virtual Size is X+Extension

Result:
When running with B, the VM sees a corrupted filesystem because it cannot access the extension. VM fails to boot, production is down. Hard to diagnose.

Basically, this is basically what happens:

# qemu-img create -f qcow2 base.qcow2 1G (initial volume)
# qemu-img create -f qcow2 -o backing_file=base.qcow2 leaf.qcow2 (snapshot)
# qemu-img info leaf.qcow2  | grep "virtual size"                                                                                                                            
virtual size: 1.0G (1073741824 bytes)
# qemu-img resize base.qcow2 +1G (disk extension before shutdown)
# qemu-img info leaf.qcow2  | grep "virtual size"                                                                                                                            
virtual size: 1.0G (1073741824 bytes) <--- wrong!

So base is 2G, but leaf still continues with 1G Virtual Size. As a result, when the VM runs with leaf, it only sees the first 1G, and it's all corrupted. And it doesn't need to be the base. In fact since sometimes the base is raw this can be avoided. But everywhere else in the chain it's qcow2 and this can be hit.

Version-Release number of selected component (if applicable):
rhevm-4.0.5.5-0.1.el7ev.noarch
vdsm-4.18.13-1.el7ev.x86_64

How reproducible:
Need to make VM.Snapshot() fail.

Actual results:
VM seems corrupted

Expected results:
I'm not sure what can be done, but something must be done as the results are quite ugly. The current flow is not handling this the way it should. It's not acceptable to fail this way and just that warning that the VM should be shutdown is not enough, as it may extend it's disk soon after that and there is no control over it. 

I think we should rollback the failed snapshot, remove the new unused leaf (if it did not switch to it) and completely fail the snapshot, like if it was never created. So that if the VM is powered off and booted again it picks up the base/previous image.

Comment 1 Germano Veit Michel 2017-02-02 08:39:54 UTC
Wait. There is something still off. The virtual size should only change if there is a disk extension, not just the qcow2 thin-provisioned growing. 

The logs of this have rotated, it seems to be happening but it may not be the way I described above.

Will try to get more data.

Comment 2 Germano Veit Michel 2017-02-06 23:13:53 UTC
Sorry but this never happened. I don't know what happened because logs have rotated. If I see it again with proper logs I'll re-open.


Note You need to log in before you can comment on or make changes to this bug.