Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 923366

Summary: [engine-backend] engine keeps requesting vdsm to create snapshot when the creation fails (live snapshot)
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, acathrow, amureini, dron, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, scohen, sgrinber, yeylon, ykaul
Target Milestone: ---Flags: abaron: Triaged+
Target Release: 3.2.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: sf12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm+rhevm logs
none
vdsm log
none
on SPM
none
on engine none

Description Elad 2013-03-19 16:40:03 UTC
Created attachment 712793 [details]
vdsm+rhevm logs

Description of problem:

When vdsmd service is stopped while engine send a request to vdsm to create snapshot and the creation fails, engine keeps requesting vdsm to create snapshot infinitely. 

Version-Release number of selected component (if applicable):

vdsm-4.10.2-11.0.el6ev.x86_64
rhevm-backend-3.2.0-10.14.beta1.el6ev.noarch
libvirt-0.10.2-18.el6_4.2.x86_64


How reproducible:
100%

Steps to Reproduce:
On 2 hosts setup and 1 iSCSI domain:
1. create a VM with 3 disks 
2. make sure that the VM is on HSM.
3. create snapshot of the VM and immediately stop vdsmd service on HSM.
  
Actual results:
Engine keeps sending create snapshot command to vdsm when vdsm service is down on vdsm.

Expected results:
engine should'nt send the create snapshot command to vdsm when vdsm service is down on vdsm.

Additional info: see logs attached

on engine:


2013-03-19 18:12:14,020 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-48) [78571fb0] START, SnapshotVDSCommand(HostName = nott-vds2, HostId = 4a4cb868-e9e7-4192-b790-83defa5dd1bd, vmId=307dd290-8dba-47d4-b8dd-4d777a52821c), log id: ece236a
2013-03-19 18:12:14,024 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-48) [78571fb0] FINISH, SnapshotVDSCommand, log id: ece236a
2013-03-19 18:12:14,024 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-48) [78571fb0] transaction rolled back
2013-03-19 18:12:14,026 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-48) [78571fb0] transaction rolled back
2013-03-19 18:12:14,037 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-48) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type CreateAllSnapshotsFromVm threw an exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused
	at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:1591) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$2.runInTransaction(CreateAllSnapshotsFromVmCommand.java:226) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$2.runInTransaction(CreateAllSnapshotsFromVmCommand.java:221) [engine-bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:166) [engine-utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:108) [engine-utils.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand.performLiveSnapshot(CreateAllSnapshotsFromVmCommand.java:221) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$1.runInTransaction(CreateAllSnapshotsFromVmCommand.java:178) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand$1.runInTransaction(CreateAllSnapshotsFromVmCommand.java:166) [engine-bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [engine-utils.jar:]
	at org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand.endVmCommand(CreateAllSnapshotsFromVmCommand.java:166) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.VmCommand.endSuccessfully(VmCommand.java:224) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:536) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:483) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1511) [engine-bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [engine-utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:141) [engine-utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:110) [engine-utils.jar:]
	at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:429) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:]
	at sun.reflect.GeneratedMethodAccessor450.invoke(Unknown Source) [:1.7.0_09-icedtea]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
	at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
	at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:374) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.ovirt.engine.core.utils.ThreadLocalSessionCleanerInterceptor.injectWebContextToThreadLocal(ThreadLocalSessionCleanerInterceptor.java:11) [engine-utils.jar:]
	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) [:1.7.0_09-icedtea]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
	at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
	at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:123) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:209) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:361) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:192) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:181) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-2]
	at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.3.Final-redhat-4]
	at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view9.endAction(Unknown Source) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.EntityAsyncTask.EndCommandAction(EntityAsyncTask.java:137) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.EntityAsyncTask.access$000(EntityAsyncTask.java:25) [engine-bll.jar:]
	at org.ovirt.engine.core.bll.EntityAsyncTask$1.run(EntityAsyncTask.java:106) [engine-bll.jar:]
	at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:71) [engine-utils.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_09-icedtea]
	at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]

Comment 1 Haim 2013-03-19 18:57:29 UTC
would like to clarify expected results, the overall user experience is bad as engine is rapidly trying to create snapshot, putting a side the fact that host is down, its flooding the events, i'm not familiar with this behaviour and it looks like a regression, DEV - please check that and add a flag if needed,

Comment 2 Maor 2013-03-20 00:05:01 UTC
VDSM log is empty. can you please reattach it.

Comment 3 Maor 2013-03-20 00:23:56 UTC
What type of solution will be best?
There are few options:

1) On any exception from VDSM at the end command, we will simply end the command with failure - That will cause the live snapshot to be failed most of the time, which will cause the user to keep working on the old snapshot.

2) Use fixed amount of retries - For example 3 retries and if it will fail on the last one, engine will fail the action.
I think that this solution can be good if we will be part of the infra structure layer, for all async operations. that way we can configure also the ammount of times user wants to keep try the operation.

3) Only remove the flood in the event log, but keep on trying to live snapshot the machine.

I will appreciate other suggestions as well?

Comment 4 Ayal Baron 2013-03-20 07:18:42 UTC
(In reply to comment #3)
> What type of solution will be best?
> There are few options:
> 
> 1) On any exception from VDSM at the end command, we will simply end the
> command with failure - That will cause the live snapshot to be failed most
> of the time, which will cause the user to keep working on the old snapshot.
> 
> 2) Use fixed amount of retries - For example 3 retries and if it will fail
> on the last one, engine will fail the action.

What would happen to the snapshot (on SPM) when the action fails?

Comment 5 Maor 2013-03-20 08:23:52 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > What type of solution will be best?
> > There are few options:
> > 
> > 1) On any exception from VDSM at the end command, we will simply end the
> > command with failure - That will cause the live snapshot to be failed most
> > of the time, which will cause the user to keep working on the old snapshot.
> > 
> > 2) Use fixed amount of retries - For example 3 retries and if it will fail
> > on the last one, engine will fail the action.
> 
> What would happen to the snapshot (on SPM) when the action fails?
When the action will fails on SPM, the operation of create snapshot will not end,  but the events will not be at the same frequency compared to HSM, since the SPM will not poll the tasks, if it will not be non-responsive.

Comment 6 Elad 2013-03-20 09:37:55 UTC
Created attachment 713137 [details]
vdsm log

Comment 7 Ayal Baron 2013-03-20 10:03:44 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > What type of solution will be best?
> > > There are few options:
> > > 
> > > 1) On any exception from VDSM at the end command, we will simply end the
> > > command with failure - That will cause the live snapshot to be failed most
> > > of the time, which will cause the user to keep working on the old snapshot.
> > > 
> > > 2) Use fixed amount of retries - For example 3 retries and if it will fail
> > > on the last one, engine will fail the action.
> > 
> > What would happen to the snapshot (on SPM) when the action fails?
> When the action will fails on SPM, the operation of create snapshot will not
> end,  but the events will not be at the same frequency compared to HSM,
> since the SPM will not poll the tasks, if it will not be non-responsive.

That is not what I meant.
engine sends 'snapshot' to VM (on hsm) and vdms dies in the middle.  Assuming we have X retries, if all of them fail, what would the rollback behaviour be?

Comment 8 Maor 2013-03-20 11:22:42 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > (In reply to comment #3)
> > > > What type of solution will be best?
> > > > There are few options:
> > > > 
> > > > 1) On any exception from VDSM at the end command, we will simply end the
> > > > command with failure - That will cause the live snapshot to be failed most
> > > > of the time, which will cause the user to keep working on the old snapshot.
> > > > 
> > > > 2) Use fixed amount of retries - For example 3 retries and if it will fail
> > > > on the last one, engine will fail the action.
> > > 
> > > What would happen to the snapshot (on SPM) when the action fails?
> > When the action will fails on SPM, the operation of create snapshot will not
> > end,  but the events will not be at the same frequency compared to HSM,
> > since the SPM will not poll the tasks, if it will not be non-responsive.
> 
> That is not what I meant.
> engine sends 'snapshot' to VM (on hsm) and vdms dies in the middle. 
> Assuming we have X retries, if all of them fail, what would the rollback
> behaviour be?

There will not be specific rollback, but engine can print the user the following event log USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE:
Failed to create live snapshot '${SnapshotName}' for VM '${VmName}'. VM restart is recommended

Comment 9 Elad 2013-03-20 11:27:56 UTC
The bug reproduces also when trying to create snapshot from VM that runs on SPM

Comment 10 Maor 2013-03-20 11:36:42 UTC
(In reply to comment #9)
> The bug reproduces also when trying to create snapshot from VM that runs on
> SPM

Elad, can u please add a VDSM and engine log for that, I suspect it will have much less events in this scenario

Comment 11 Elad 2013-03-20 12:40:02 UTC
Created attachment 713218 [details]
on SPM

 The relevant log lines starts at 12:59

Comment 12 Elad 2013-03-20 12:42:20 UTC
Created attachment 713219 [details]
on engine

The relevant log lines starts at 12:59

Comment 13 Maor 2013-03-20 13:05:39 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > The bug reproduces also when trying to create snapshot from VM that runs on
> > SPM
> 
> Elad, can u please add a VDSM and engine log for that, I suspect it will
> have much less events in this scenario
Thanks for the logs, it seems that the reason for that is that, when you stop VDSM in the SPM, engine will re-elect another host which will be the SPM instead.
So the new SPM will keep polling the tasks and will try to do Snapshot on the HSM which was the SPM before.
This scenario will only be reproduced when you got two hosts in the same DC.

Comment 14 Maor 2013-03-21 17:33:45 UTC
After discussed with Barak and Ayal, it was decided that on any failure engine will print an audit log as described in comment 8, and will finish the operation

Comment 15 Maor 2013-03-21 18:04:37 UTC
(In reply to comment #14)
> After discussed with Barak and Ayal, it was decided that on any failure
> engine will print an audit log as described in comment 8, and will finish
> the operation
By saying any failure it is meant any VDS command failure such as Connection refused as described in the bug.

Comment 16 Maor 2013-03-25 01:05:34 UTC
Pushed upstream at http://gerrit.ovirt.org/#/c/13262/

Comment 17 Elad 2013-04-04 12:30:29 UTC
Tested on RHEVM-3.2 - SF12:
rhevm-3.2.0-10.17.master.el6ev.noarch
rhevm-backend-3.2.0-10.17.master.el6ev.noarch
vdsm-4.10.2-13.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.2.x86_64

Engine is doing roll-back when failing to create snapshot during restart to vdsmd.


2013-04-04 15:15:20,703 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) [43fdbcc3] START, SnapshotVDSCommand(HostName = nott-vds2, HostId = 8fed1a90-79f3-4c6e-b26b-1f5a9e2cc
8e8, vmId=72f0ff63-60e1-4131-ab12-9993708502c3), log id: 477c736a
2013-04-04 15:15:20,706 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) [43fdbcc3] FINISH, SnapshotVDSCommand, log id: 477c736a
2013-04-04 15:15:20,706 WARN  [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (pool-3-thread-47) [43fdbcc3] Wasnt able to live snpashot due to error: VdcBLLException: VdcBLLException: org.ovirt.engine.
core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused. VM will still be configured to the new created snapshot

Comment 18 Itamar Heim 2013-06-11 08:30:06 UTC
3.2 has been released

Comment 19 Itamar Heim 2013-06-11 08:30:18 UTC
3.2 has been released

Comment 20 Itamar Heim 2013-06-11 08:32:24 UTC
3.2 has been released