Description of problem: 1. VM is running with volume A 2. Snapshot Button is Pressed 3. Create volume B on SPM 4. VM.Snapshot sent to the Host running the VM 5. VM.Snapshot fails for whatever reason, we display this message and no other action is taken: Failed to create live snapshot 'X' for VM 'Y'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency. Even if the user notices the message and shuts down the VM shortly after, it's not good enough. Something else needs to be done, because the following sequence of events may take place: 1. Create volume B for Snapshot 2. Active Layer change from A to B fails 3. A is extended (QCOW2) 4. VM is shutdown (still running with A). Now the user powers the VM up and it comes up with volume B. B Virtual Size is X A Virtual Size is X+Extension Result: When running with B, the VM sees a corrupted filesystem because it cannot access the extension. VM fails to boot, production is down. Hard to diagnose. Basically, this is basically what happens: # qemu-img create -f qcow2 base.qcow2 1G (initial volume) # qemu-img create -f qcow2 -o backing_file=base.qcow2 leaf.qcow2 (snapshot) # qemu-img info leaf.qcow2 | grep "virtual size" virtual size: 1.0G (1073741824 bytes) # qemu-img resize base.qcow2 +1G (disk extension before shutdown) # qemu-img info leaf.qcow2 | grep "virtual size" virtual size: 1.0G (1073741824 bytes) <--- wrong! So base is 2G, but leaf still continues with 1G Virtual Size. As a result, when the VM runs with leaf, it only sees the first 1G, and it's all corrupted. And it doesn't need to be the base. In fact since sometimes the base is raw this can be avoided. But everywhere else in the chain it's qcow2 and this can be hit. Version-Release number of selected component (if applicable): rhevm-4.0.5.5-0.1.el7ev.noarch vdsm-4.18.13-1.el7ev.x86_64 How reproducible: Need to make VM.Snapshot() fail. Actual results: VM seems corrupted Expected results: I'm not sure what can be done, but something must be done as the results are quite ugly. The current flow is not handling this the way it should. It's not acceptable to fail this way and just that warning that the VM should be shutdown is not enough, as it may extend it's disk soon after that and there is no control over it. I think we should rollback the failed snapshot, remove the new unused leaf (if it did not switch to it) and completely fail the snapshot, like if it was never created. So that if the VM is powered off and booted again it picks up the base/previous image.
Wait. There is something still off. The virtual size should only change if there is a disk extension, not just the qcow2 thin-provisioned growing. The logs of this have rotated, it seems to be happening but it may not be the way I described above. Will try to get more data.
Sorry but this never happened. I don't know what happened because logs have rotated. If I see it again with proper logs I'll re-open.