Bug 1018867 - [RFE] provide better monitoring to allow to know to which disk the VM is writing to after failed LSM.
[RFE] provide better monitoring to allow to know to which disk the VM is writ...
Status: NEW
Product: vdsm
Classification: oVirt
Component: RFEs (Show other bugs)
---
x86_64 Linux
unspecified Severity high (vote)
: ---
: ---
Assigned To: Daniel Erez
Raz Tamir
: FutureFeature
: 1018888 1024811 (view as bug list)
Depends On: 1258659
Blocks: 1018876
  Show dependency treegraph
 
Reported: 2013-10-14 10:43 EDT by vvyazmin@redhat.com
Modified: 2017-11-26 07:58 EST (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ylavi: ovirt‑future?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (iSCSI) (4.85 MB, application/x-gzip)
2013-10-14 10:43 EDT, vvyazmin@redhat.com
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 20986 None ABANDONED vm: allow live snapshot to any volume descendant Never

  None (edit)
Description vvyazmin@redhat.com 2013-10-14 10:43:10 EDT
Created attachment 812043 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (iSCSI)

Description of problem:
Failed run LSM again after disconnections Storage Domain

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS18 environment:

Host OS: RHEL 6.5

RHEVM:  rhevm-3.3.0-0.25.beta1.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.15-1.el6ev.noarch
VDSM:  vdsm-4.13.0-0.2.beta1.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-27.el6.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.412.el6.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
unknow

Steps to Reproduce:
1. Create iSCSI Data Center with two hosts connected to multiple Storage Domain (SD)
2. Create and run a vm from template with OS installed on it, run on HSM.
3. LSM the vm disk and block connectivity (via iptables) to all domains from the HSM host
* HSM - non operational
* VM - in pause state
4. When the vm pauses remove the iptables block from the hsm host
* HSM - up
* VM - up and running. OS running, and no problem connect to it.
5. Run LSM the vm disk again

Actual results:
Failed run LSM.

Expected results:
Secceed run LSM

Impact on user:
Failed run LSM

Workaround:
None

Additional info:

/var/log/ovirt-engine/engine.log
2013-10-14 14:46:05,410 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Failed in SnapshotVDS method
2013-10-14 14:46:05,410 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Error code SNAPSHOT_FAILED and error message VDSGenericException:
 VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed
2013-10-14 14:46:05,410 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Command org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSComm
and return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48, mMessage=Snapshot failed]]
2013-10-14 14:46:05,410 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) HostName = tigris01.scl.lab.tlv.redhat.com
2013-10-14 14:46:05,410 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) Command SnapshotVDS execution failed. Exception: VDSErrorExceptio
n: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed
2013-10-14 14:46:05,410 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-47) FINISH, SnapshotVDSCommand, log id: 2f9f134e
2013-10-14 14:46:05,410 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-47) Wasnt able to live snapshot due to error: VdcBLLException: VdcBLLExc
eption: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed (Failed with error SN
APSHOT_FAILED and code 48). VM will still be configured to the new created snapshot
2013-10-14 14:46:05,443 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-47) Correlation ID: null, Call Stack: org.ovirt.engine.core.c
ommon.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error 
= Snapshot failed (Failed with error SNAPSHOT_FAILED and code 48)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:122)
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33)
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1983)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$3.runInTransaction(CreateAllSnapshotsFromVmCommand.java:361)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$3.runInTransaction(CreateAllSnapshotsFromVmCommand.java:356)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand.performLiveSnapshot(CreateAllSnapshotsFromVmCommand.java:356)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$1.runInTransaction(CreateAllSnapshotsFromVmCommand.java:243)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$1.runInTransaction(CreateAllSnapshotsFromVmCommand.java:228)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:210)
        at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand.endVmCommand(CreateAllSnapshotsFromVmCommand.java:228)
        at org.ovirt.engine.core.bll.VmCommand.endSuccessfully(VmCommand.java:240)
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:620)
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:566)
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1898)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:210)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:149)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:118)
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:498)
        at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:449)
        at sun.reflect.GeneratedMethodAccessor437.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)

/var/log/vdsm/vdsm.log
Comment 3 Sergey Gotliv 2013-10-16 10:38:15 EDT
*** Bug 1018888 has been marked as a duplicate of this bug. ***
Comment 4 Federico Simoncelli 2013-10-29 13:01:33 EDT
2013-10-14 14:28:54,663 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-5-thread-44) Error code SNAPSHOT_FAILED and error message VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed
...
2013-10-14 14:28:54,664 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-5-thread-44) Wasnt able to live snapshot due to error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed (Failed with error SNAPSHOT_FAILED and code 48). VM will still be configured to the new created snapshot
...
2013-10-14 14:28:54,701 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-44) Correlation ID: null, Call Stack: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed (Failed with error SNAPSHOT_FAILED and code 48)

Auditlog Message (Warning): Failed to create live snapshot 'Auto-generated for Live Storage Migration' for VM 'vm-tra-tra'. VM restart is recommended.

When a live snapshot fails for an unknown/unrecoverable reason (as in this case) the VM is no longer running on the leaf volume and many flows are not supported anymore, as for example a subsequent live snapshot and therefore live storage migration too.

In the future we can only try to have more cases handled gracefully but anyway with this architecture we'll always have some unrecoverable cases.
Comment 5 Federico Simoncelli 2013-10-30 07:21:28 EDT
We'll tackle this on the vdsm side allowing a live snapshot also when the provided base volume is not the current active layer.
Comment 9 Federico Simoncelli 2013-11-06 10:36:57 EST
It actually turns out that maybe we can't cleanly address this on the vdsm side.

It seems to me that on live snapshot qemu-kvm is not capable of following one or multiple qcow2 layers between the new image and its base.

Regular live snapshot:

image1 <- qemu-kvm
image1 < image2 <- qemu-kvm

What we are trying to do here to fix the issue is:

image1 <- qemu-kvm
image1 < image2(empty) < image3 <-qemu-kvm

It actually turns out that qemu-kvm assumes that image3 is the direct descendant of image1 leaving image2 closed:

# lvs
...
  af0633b3-5367-4106-8108-702700b90d98 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  d9fc0a47-61ca-4fae-a170-dea10a4e61ed 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  ce9e7bd8-2715-43a4-871c-8cee20e46c39 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-a----- 128.00m
  6510f3a6-7914-46a5-99e9-ae107badf9a7 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  7685c1c8-177d-4532-9ad4-c96c5e935043 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-a----- 128.00m
  d2ef66ea-2de7-4991-82e5-ad29c4e36861 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m

(reorded from the base at the top and the leaf at the bottom)

As we see above trying to reproduce this issue (testing the fix) what happens is that both ce9e7bd8 and 7685c1c8 (empty volumes) are not kept open by the qemu process and as additional proof they can even be deactivated:

# lvchange -an /dev/864df2d0-b022-435f-a9d1-b7ac0bd766bb/{7685c1c8-177d-4532-9ad4-c96c5e935043,ce9e7bd8-2715-43a4-871c-8cee20e46c39}
# lvs
...
  af0633b3-5367-4106-8108-702700b90d98 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  d9fc0a47-61ca-4fae-a170-dea10a4e61ed 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  ce9e7bd8-2715-43a4-871c-8cee20e46c39 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi------- 128.00m
  6510f3a6-7914-46a5-99e9-ae107badf9a7 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m
  7685c1c8-177d-4532-9ad4-c96c5e935043 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi------- 128.00m
  d2ef66ea-2de7-4991-82e5-ad29c4e36861 864df2d0-b022-435f-a9d1-b7ac0bd766bb -wi-ao---- 128.00m


It seems to me that if this is not going to be addressed in qemu and we still pick this direction we'll end up in some new corner cases, e.g. what if then the user want to live merge? ..it might fail or the result may might be unpredictable.

Kevin any insight?
Comment 10 Ayal Baron 2013-11-11 10:38:07 EST
This will either require changes in qemu or changing the entire approach in engine.
Either way, I don't see this happening for 3.3
Comment 11 Kevin Wolf 2013-11-11 10:51:14 EST
(In reply to Federico Simoncelli from comment #9)
> As we see above trying to reproduce this issue (testing the fix) what
> happens is that both ce9e7bd8 and 7685c1c8 (empty volumes) are not kept open
> by the qemu process and as additional proof they can even be deactivated:
> [...]
> Kevin any insight?

Can you please describe in detail which QMP commands you sent to get into
this state?

Also, as I told you on IRC, qemu is definitely supposed to open all images in
the backing file chain. It keeps a file descriptor open for each of them.

I am not exactly sure what the "open" flag of lvs indicates. To get the
information from a bit closer to the qemu process, can you check if the backing
file doesn't appear in lsof either?
Comment 13 Federico Simoncelli 2013-11-11 16:46:42 EST
*** Bug 1024811 has been marked as a duplicate of this bug. ***
Comment 14 Federico Simoncelli 2013-11-14 06:08:58 EST
(In reply to Kevin Wolf from comment #11)
> (In reply to Federico Simoncelli from comment #9)
> > As we see above trying to reproduce this issue (testing the fix) what
> > happens is that both ce9e7bd8 and 7685c1c8 (empty volumes) are not kept open
> > by the qemu process and as additional proof they can even be deactivated:
> > [...]
> > Kevin any insight?
> 
> Can you please describe in detail which QMP commands you sent to get into
> this state?

# qemu-img create -f qcow2 /tmp/image1.qcow2 1G

# qemu-kvm -hda /tmp/image1.qcow2 -qmp stdio
{"QMP": {"version": {"qemu": {"micro": 1, "minor": 12, "major": 0}, "package": "(qemu-kvm-0.12.1.2)"}, "capabilities": []}}

{ "execute": "qmp_capabilities" }
{"return": {}}

# ls -l /proc/$(pgrep qemu-kvm)/fd | grep /tmp
lrwx------. 1 root root 64 Nov 14 05:57 7 -> /tmp/image1.qcow2

# qemu-img create -f qcow2 -b /tmp/image1.qcow2 /tmp/image2.qcow2 
Formatting '/tmp/image2.qcow2', fmt=qcow2 size=1073741824 backing_file='/tmp/image1.qcow2' encryption=off cluster_size=65536 

# qemu-img create -f qcow2 -b /tmp/image2.qcow2 /tmp/image3.qcow2 
Formatting '/tmp/image3.qcow2', fmt=qcow2 size=1073741824 backing_file='/tmp/image2.qcow2' encryption=off cluster_size=65536 

{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "ide0-hd0", "snapshot-file": "/tmp/image3.qcow2", "mode": "existing" } }
{"return": {}}

# ls -l /proc/$(pgrep qemu-kvm)/fd | grep /tmp
lrwx------. 1 root root 64 Nov 14 06:03 15 -> /tmp/image3.qcow2
lr-x------. 1 root root 64 Nov 14 06:03 16 -> /tmp/image1.qcow2

image2.qcow2 is not open (even if it's part of the chain).
Restarting qemu-kvm:

# qemu-kvm -hda /tmp/image3.qcow2 -qmp stdio

# ls -l /proc/$(pgrep qemu-kvm)/fd | grep /tmp
lr-x------. 1 root root 64 Nov 14 06:06 10 -> /tmp/image1.qcow2
lrwx------. 1 root root 64 Nov 14 06:06 7 -> /tmp/image3.qcow2
lr-x------. 1 root root 64 Nov 14 06:06 9 -> /tmp/image2.qcow2
Comment 15 Kevin Wolf 2013-11-14 06:59:22 EST
(In reply to Federico Simoncelli from comment #14)
> # qemu-img create -f qcow2 -b /tmp/image1.qcow2 /tmp/image2.qcow2 
> Formatting '/tmp/image2.qcow2', fmt=qcow2 size=1073741824
> backing_file='/tmp/image1.qcow2' encryption=off cluster_size=65536 
> 
> # qemu-img create -f qcow2 -b /tmp/image2.qcow2 /tmp/image3.qcow2 
> Formatting '/tmp/image3.qcow2', fmt=qcow2 size=1073741824
> backing_file='/tmp/image2.qcow2' encryption=off cluster_size=65536 
> 
> { "execute": "blockdev-snapshot-sync", "arguments": { "device": "ide0-hd0",
> "snapshot-file": "/tmp/image3.qcow2", "mode": "existing" } }
> {"return": {}}

Oh, now I understand.

When you call 'blockdev-snapshot-sync' with mode=existing, you're making a
promise that the existig image file points to the old top-level image as its
backing file. qemu only opens that file and puts it on top of what's already
openend, without even looking at the backing file path stored in that image
file.

So if you want to add two files at once, you still need to issue two
'blockdev-snapshot-sync' commands.
Comment 16 Ayal Baron 2013-12-18 04:53:52 EST
Fede, any update on this one?
Comment 17 Federico Simoncelli 2014-02-17 09:16:34 EST
After fixing other bugs related to live snapshots and LSM (bug 1018876 and bug 957703) the severity of this one has been very much reduced.

Engine now is much more reliable handling negative flows and I remember that the only way I found to trigger this was by setting multiple unlikely breakpoints in the engine.

I think we should keep this bug for reference as we know that the issue is theoretically present and it may be addressable in the future in some clean way (for example when/if the engine will be able to monitor what's the current active layer of the VM).
Comment 18 Sean Cohen 2014-06-11 04:45:08 EDT
(In reply to Federico Simoncelli from comment #17)
> After fixing other bugs related to live snapshots and LSM (bug 1018876 and
> bug 957703) the severity of this one has been very much reduced.
> 
> Engine now is much more reliable handling negative flows and I remember that
> the only way I found to trigger this was by setting multiple unlikely
> breakpoints in the engine.
> 
> I think we should keep this bug for reference as we know that the issue is
> theoretically present and it may be addressable in the future in some clean
> way (for example when/if the engine will be able to monitor what's the
> current active layer of the VM).

Acked, moving it to future tracking
Sean
Comment 20 Christopher Pereira 2015-08-03 20:57:02 EDT
A live storage migration task failed due to a network error:

    2015-08-03 21:23:16,437 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-12-thread-45) [] Could not perform live snapshot due to error, VM will still be configured to the new created snapshot: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
    2015-08-03 21:23:16,450 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-12-thread-45) [] Correlation ID: null, Call Stack: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
            at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117)
            at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33)
            at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2029)
            at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$2.runInTransaction(CreateAllSnapshotsFromVmCommand.java:400)
            ...

As a consequence, Engine is showing a failed snapshot as "Current", while libvirt is still reporting the previous correct snapshot.
I guess next time Engine will probably try to resume the failed snapshot and VM won't start anymore.
What is the correct way to solve this issue?
Comment 21 Daniel Erez 2015-08-04 02:42:22 EDT
(In reply to Christopher Pereira from comment #20)
> A live storage migration task failed due to a network error:
> 
>     2015-08-03 21:23:16,437 WARN 
> [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand]
> (org.ovirt.thread.pool-12-thread-45) [] Could not perform live snapshot due
> to error, VM will still be configured to the new created snapshot:
> VdcBLLException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
> VDSGenericException: VDSNetworkException: Message timeout which can be
> caused by communication issues (Failed with error VDS_NETWORK_ERROR and code
> 5022)
>     2015-08-03 21:23:16,450 WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (org.ovirt.thread.pool-12-thread-45) [] Correlation ID: null, Call Stack:
> org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
> VDSGenericException: VDSNetworkException: Message timeout which can be
> caused by communication issues (Failed with error VDS_NETWORK_ERROR and code
> 5022)
>             at
> org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117)
>             at
> org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.
> RunVdsCommand(VDSBrokerFrontendImpl.java:33)
>             at
> org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2029)
>             at
> org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$2.
> runInTransaction(CreateAllSnapshotsFromVmCommand.java:400)
>             ...
> 
> As a consequence, Engine is showing a failed snapshot as "Current", while
> libvirt is still reporting the previous correct snapshot.
> I guess next time Engine will probably try to resume the failed snapshot and
> VM won't start anymore.
> What is the correct way to solve this issue?

The error was in live snapshot phase, which means that a VM restart is recommended (using the created snapshot might cause data inconsistency). Next time, the engine won't resume the failed snapshot but will try to create a new one instead. The VM should still be able to start.
Comment 22 Christopher Pereira 2015-08-05 00:36:52 EDT
Thanks Daniel,

I first tried restarting Engine, but the snapshots list continues wrong (BUG).

The first snapshots in the list are bogus entries caused by the network error (in this case Engine is controlling hosts overseas :-).

The snapshots list displays:

Current : Active VM <--- Wrong! failed probably because of previous error
Date 3 : Auto-generated for Live Storage Migration <--- Wrong! failed network
Date 2 : Auto-generated for Live Storage Migration <---- Real current 'snapshot' (according to libvirt)
Date 1 : Original image

This problem with this bug is it disables most of oVirt features.
It also sounds dangerous to have oVirt confusing the current snapshot.
Do we exactly know why restarting the VM is necessary?

Can we avoid restarting the VM?
In general, restarting a VM is not an option.

Maybe we just need to move the code that is creating snapshot entries in the DB/Cache after the point where the new created snapshot was succesfully changed in libvirt/QEMU as the new current snapshot.
Comment 23 Christopher Pereira 2015-08-05 01:55:47 EDT
I confirm that after restarting the VM it points to the last snapshot.
It would be desirable not having to restart the VM.
Comment 24 Christopher Pereira 2015-08-31 19:58:13 EDT
Added: https://bugzilla.redhat.com/show_bug.cgi?id=1258659
Comment 25 Yaniv Lavi 2015-10-29 09:50:42 EDT
Why is this targeted to future?
Comment 26 Allon Mureinik 2015-10-29 10:02:38 EDT
(In reply to Yaniv Dary from comment #25)
> Why is this targeted to future?

It's more of an RFE than a bug. If a live snapshot fails, we have no way of knowing which volumes the VM really writes to, and a restart is required.
When the flow is rewritten (e.g., during the SDM effort), we MAY have a better solution.
Comment 27 Yaniv Lavi 2017-11-26 07:32:47 EST
Are we better positioned to fix this in 4.3?
Comment 28 Allon Mureinik 2017-11-26 07:58:31 EST
(In reply to Yaniv Lavi from comment #27)
> Are we better positioned to fix this in 4.3?

We made no improvement in that direction, unfortunately.

Note You need to log in before you can comment on or make changes to this bug.