Bug 1764943

Summary:	Create Snapshot does not proceed beyond CreateVolume
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Liran Rotenberg <lrotenbe>
Status:	CLOSED ERRATA	QA Contact:	Evelina Shames <eshames>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.5	CC:	aefrat, bzlotnik, lrotenbe, michal.skrivanek, mkalinin, mperina, mzamazal, pelauter, rdlugyhe, rhodain, tgolembi, tnisan
Target Milestone:	ovirt-4.4.0	Keywords:	ZStream
Target Release:	---	Flags:	lsvaty: testing_plan_complete-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, while creating virtual machine snapshots, if the VDSM's command to freeze a virtual machines' file systems exceeded the snapshot command's 3-minute timeout period, creating snapshots failed, causing virtual machines and disks to lock. The current release adds two key-value pairs to the engine configuration. You can configure these using the engine-config tool: - Setting `LiveSnapshotPerformFreezeInEngine` to `true` enables the {engine-name} to freeze VMs' file systems before it creates a snapshot of them. - Setting `LiveSnapshotAllowInconsistent` to `true` enables the {engine-name} to continue creating snapshots if it fails to freeze VMs' file systems.	Story Points:	---
Clone Of:
Clones:	1842377 (view as bug list)		Environment:
Last Closed:	2020-08-04 13:20:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1842377

Description Germano Veit Michel 2019-10-24 05:34:30 UTC

Description of problem:

All snapshot operations were working fine. Then suddenly all new Snapshot creations only execute until the CreateVolume + ClearTask part.
SnapshotVDSCommand is never sent, almost all VMs/Disks on the environment are now locked.

See:

2019-10-20 04:03:28,100+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Lock Acquired to object 'EngineLock:{exclusiveLocks='[8322c1e2-6c6e-45ef-9638-df56666c68c1=VM]', sharedLocks=''}'
2019-10-20 04:03:28,212+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: 8322c1e2-6c6e-45ef-9638-df56666c68c1 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2019-10-20 04:03:28,239+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Running command: CreateSnapshotDiskCommand internal: true. Entities affected :  ID: 8322c1e2-6c6e-45ef-9638-df56666c68c1 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2019-10-20 04:03:28,330+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Running command: CreateSnapshotCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2019-10-20 04:03:28,362+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='59fc7670-0090-006e-0176-0000000000f2', ignoreFailoverLimit='false', storageDomainId='37a25cc3-b9e7-4a91-9c72-3174899bbd68', imageGroupId='4414db9f-77e2-40bc-b12c-51ad5d197c49', imageSizeInBytes='59055800320', volumeFormat='COW', newImageId='cead93f6-5823-4684-a3a9-27f385d58d4c', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='716204d6-4471-4e7d-938d-0dbeff82a077', sourceImageGroupId='4414db9f-77e2-40bc-b12c-51ad5d197c49'}), log id: 33e667cb
2019-10-20 04:03:28,482+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] FINISH, CreateVolumeVDSCommand, return: cead93f6-5823-4684-a3a9-27f385d58d4c, log id: 33e667cb
2019-10-20 04:03:28,488+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command '58b0d00d-6ad1-4542-a297-ec630fb4a486'
2019-10-20 04:03:28,488+02 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] CommandMultiAsyncTasks::attachTask: Attaching task 'c1a003ad-99aa-41aa-bd23-effad9d05bab' to command '58b0d00d-6ad1-4542-a297-ec630fb4a486'.
2019-10-20 04:03:28,505+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Adding task 'c1a003ad-99aa-41aa-bd23-effad9d05bab' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2019-10-20 04:03:28,535+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] BaseAsyncTask::startPollingTask: Starting to poll task 'c1a003ad-99aa-41aa-bd23-effad9d05bab'.
2019-10-20 04:03:28,633+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3131) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot '<removed>' creation for VM '<removed>' was initiated by <removed>.
2019-10-20 04:03:36,820+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] Command [id=58b0d00d-6ad1-4542-a297-ec630fb4a486]: Updating status to 'SUCCEEDED', The command end method logic will be executed by one of its parent commands.
2019-10-20 04:03:36,821+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'CreateSnapshot' completed, handling the result.
2019-10-20 04:03:36,821+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'CreateSnapshot' succeeded, clearing tasks.
2019-10-20 04:03:36,821+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] SPMAsyncTask::ClearAsyncTask: Attempting to clear task 'c1a003ad-99aa-41aa-bd23-effad9d05bab'
2019-10-20 04:03:36,821+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='59fc7670-0090-006e-0176-0000000000f2', ignoreFailoverLimit='false', taskId='c1a003ad-99aa-41aa-bd23-effad9d05bab'}), log id: 3057823c
2019-10-20 04:03:36,822+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] START, HSMClearTaskVDSCommand(HostName = <removed>, HSMTaskGuidBaseVDSCommandParameters:{hostId='23297c91-5159-467e-af08-56d1d87da207', taskId='c1a003ad-99aa-41aa-bd23-effad9d05bab'}), log id: 42b817dc
2019-10-20 04:03:36,939+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] FINISH, HSMClearTaskVDSCommand, return: , log id: 42b817dc
2019-10-20 04:03:36,939+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] FINISH, SPMClearTaskVDSCommand, return: , log id: 3057823c
2019-10-20 04:03:36,943+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] BaseAsyncTask::removeTaskFromDB: Removed task 'c1a003ad-99aa-41aa-bd23-effad9d05bab' from DataBase
2019-10-20 04:03:36,943+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-27) [42a32c13-f2ce-4ae3-9b1d-72f28c08e52b] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '58b0d00d-6ad1-4542-a297-ec630fb4a486'

And that's it, nothing else.

Version-Release number of selected component (if applicable):
rhvm-4.3.5.5-0.3.bz1749944.el7.noarch
 * Hotfix installed on Sep 25th 2019
 * Problem started on Oct 21th 2019
 * Tons of successful snapshot operations in between

How reproducible:
Unknown yet.

Actual results:
Engine stuck

Expected results:
Complete CreateSnapshot

Additional info:
Hotfix is for BZ1749944.

Comment 1 Germano Veit Michel 2019-10-24 05:37:21 UTC

104 stuck snapshot creations.

engine=> select count(*) from job where action_type = 'CreateSnapshotForVm' and status ='STARTED';
 count 
-------
   104

Comment 2 Germano Veit Michel 2019-10-24 05:42:13 UTC

Where are these 2 in the logs? I think they can help explaining the problem...

SerialChildCommandsExecutionCallback
ConcurrentChildCommandsExecutionCallback

For example, from the last succesfull snapshot, 3 minutes before all new snapshot got stuck:

2019-10-20 04:00:05,113+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [700ea60d-d4df-40b7-8933-18ea2c666486] Command 'CreateSnapshotForVm' (id: '7ae35e42-9aa3-4aec-be9a-3b6500ea6475') waiting on child command id: 'd5769a42-eb6a-4555-80a8-15c2ff47496d' type:'CreateSnapshotDisk' to complete
2019-10-20 04:00:05,113+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [700ea60d-d4df-40b7-8933-18ea2c666486] Command 'CreateSnapshotDisk' (id: 'd5769a42-eb6a-4555-80a8-15c2ff47496d') waiting on child command id: '24118e8b-e130-42f7-9008-a7383e4b5b9f' type:'CreateSnapshot' to complete

I don't see this on the stuck snapshots.

Comment 4 Germano Veit Michel 2019-10-24 05:47:48 UTC

There is also this stuck SnapshotVDSCommand from a few minutes before the problem:

2019-10-20 04:00:42,187+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [dbadca6a-5e2f-4c20-add6-e401a54c673a] START, SnapshotVDSCommand(HostName = <removed>, SnapshotVDSCommandParameters:{hostId='7bba1392-c460-4e00-8029-4f79fe10da7c', vmId='42298abb-79fd-b354-04bd-dfb1df9e96c6'}), log id: ba3f335

Its still on Thread Dump 2 days later:

2019-10-22 12:09:00
"EE-ManagedThreadFactory-engineScheduled-Thread-91" #352 daemon prio=5 os_prio=0 tid=0x0000000006cb9000 nid=0x2f1c waiting on condition [0x00007fa03aa26000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006f8576f20> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at org.ovirt.vdsm.jsonrpc.client.internal.BatchCall.get(BatchCall.java:60)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:53)
        at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:20)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.lazyEval(FutureMap.java:113)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.get(FutureMap.java:249)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturn.<init>(StatusOnlyReturn.java:13)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.snapshot(JsonRpcVdsServer.java:1057)
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.snapshot(JsonRpcVdsServer.java:1040)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand.executeSnapshotVerb(SnapshotVDSCommand.java:36)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand.executeVdsBrokerCommand(SnapshotVDSCommand.java:24)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVdsCommandWithNetworkEvent(VdsBrokerCommand.java:123)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398)
        at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51)
        at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:78)
        at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12)
        at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.weld.interceptor.reader.SimpleInterceptorInvocation$SimpleMethodInvocation.invoke(SimpleInterceptorInvocation.java:73)
        at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeAroundInvoke(InterceptorMethodHandler.java:84)
        at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeInterception(InterceptorMethodHandler.java:72)
        at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.invoke(InterceptorMethodHandler.java:56)
        at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:79)
        at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:68)
        at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source)
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2075)
        at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.lambda$performLiveSnapshot$4(CreateSnapshotForVmCommand.java:376)
        at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand$$Lambda$1555/1376732601.runInTransaction(Unknown Source)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
        at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.performLiveSnapshot(CreateSnapshotForVmCommand.java:375)
        at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.performNextOperation(CreateSnapshotForVmCommand.java:171)
        at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32)
        at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:77)
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller$$Lambda$932/892810885.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

Comment 8 Germano Veit Michel 2019-10-24 06:03:30 UTC

The snapshot from comment #4 hung on the host, fsfreeze. Could this have contributed to the engine hung too?

2019-10-22 12:15:43,569+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'42298abb-79fd-b354-04bd-dfb1df9e96c6', u'snapDrives': [{u'baseVolumeID': u'31ed219a-7de5-41bc-a7f1-044a6eb7cd31', u'domainID': u'c18a93ed-e9d7-41fa-893c-e714283ae8e8', u'volumeID': u'1d055a41-00c3-46f6-9b46-40ea7ae01151', u'imageID': u'efca48c7-685e-43e8-8bfb-cfbb768b7dd3'}, {u'baseVolumeID': u'ec125970-6a7d-4222-bce1-9ddd94ce89d8', u'domainID': u'c18a93ed-e9d7-41fa-893c-e714283ae8e8', u'volumeID': u'34252e09-b025-4255-b92a-1a7b7ed7a95c', u'imageID': u'4888f48f-e356-4d69-a977-382e68fea8ca'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'ba1577a6-034b-49e1-8c1f-2843986b6681'} at 0x7f93f0126d10> timeout=60, duration=202501.39 at 0x7f93f04c6b50> task#=970386 at 0x7f93f07d0bd0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 733, in snapshot
  return vm.snapshot(snapDrives, memoryParams, frozen=frozen)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 101, in method
  return func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4497, in snapshot
  freezed = self.freeze()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4260, in freeze
  frozen = self._dom.fsFreeze()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2875, in fsFreeze
  ret = libvirtmod.virDomainFSFreeze(self._o, mountpoints, flags) (executor:363)

Comment 11 Germano Veit Michel 2019-10-25 00:42:56 UTC

How reproducible:
Partially

I can reproduce the behaviour that all subsequent snapshot operations (completely unrelated, on different VMs and Hosts) do not go ahead after CreateVolume() is there is a previous SnapshotVDSCommand() that is stuck.

1. Add a sleep on VDSM virt/vm.py on Host A
~~~
        self.log.info("Freezing guest filesystems")

        try:
            time.sleep(40000) 
            frozen = self._dom.fsFreeze()
~~~
2. Snapshot VM1 on Host A
3. Snapshot VM2 on Host B

Snapshot of VM2 [3] just proceeds to SnapshotVDSCommand once Snapshot of VM1 fails with vdsTimeout. 

See the timestamps:

[VM1 SnapshotVDS]
2019-10-25 10:24:38,039+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [1788a3a9-03b0-4614-b2d1-a9358f0a0337] START, SnapshotVDSCommand(HostName = host1.kvm, SnapshotVDSCommandParameters:{hostId='11d447a9-7fa7-4c3a-a2f9-ad9def20e797', vmId='0871f8a8-cdda-40a0-ba9e-88f21cae9f3a'}), log id: 6fd1a782

2019-10-25 10:27:38,346+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [1788a3a9-03b0-4614-b2d1-a9358f0a0337] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host1.kvm command SnapshotVDS failed: Message timeout which can be caused by communication issues

[VM2]

2019-10-25 10:25:14,034+10 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: 7f028f9d-840e-4190-838f-c0416467cdd0 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER

2019-10-25 10:25:14,213+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='c97145ac-0cdd-422a-b330-ee425e418723', imageSizeInBytes='21474836480', volumeFormat='COW', newImageId='4ecbf36b-2db4-4025-8e94-8f4e667f5334', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='08620f4b-e821-46b5-a29a-c3535e06102d', sourceImageGroupId='c97145ac-0cdd-422a-b330-ee425e418723'}), log id: 775be77f
2019-10-25 10:25:14,308+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, CreateVolumeVDSCommand, return: 4ecbf36b-2db4-4025-8e94-8f4e667f5334, log id: 775be77

2019-10-25 10:25:17,575+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, HSMClearTaskVDSCommand, return: , log id: 361a734c
2019-10-25 10:25:17,575+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, SPMClearTaskVDSCommand, return: , log id: 17f5c770
2019-10-25 10:25:17,587+10 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] BaseAsyncTask::removeTaskFromDB: Removed task 'fd1ac433-b013-4d47-aee8-36d168acef29' from DataBase
2019-10-25 10:25:17,587+10 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity 'eb76d231-7167-4bf7-9012-9275a6b7abaf'

... stuck until timeout of SnapshotVDS of VM1 ....

2019-10-25 10:27:40,231+10 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Command 'CreateSnapshotDisk' id: '94eaa561-87f3-4e08-9cd2-6b6937c9906c' child commands '[eb76d231-7167-4bf7-9012-9275a6b7abaf]' executions were completed, status 'SUCCEEDED'
2019-10-25 10:27:40,231+10 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Command 'CreateSnapshotDisk' id: '94eaa561-87f3-4e08-9cd2-6b6937c9906c' Updating status to 'SUCCEEDED', The command end method logic will be executed by one of its parent commands.

2019-10-25 10:27:42,304+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-66) [367f396c-a500-4194-87d0-b2dbc76bbe4a] START, SnapshotVDSCommand(HostName = host2.kvm, SnapshotVDSCommandParameters:{hostId='848b56c1-b024-438c-8a55-de4c74057e67', vmId='7f028f9d-840e-4190-838f-c0416467cdd0'}), log id: 28ccab6b

Now the remaining questions are:
a) why the customer's SnapshotVDS() never timed-out/returned (see comment #4)
b) why the fsfreeze hung

Comment 12 Tal Nisan 2019-10-28 16:01:45 UTC

Martin, seems to me like an Infra issue, can someone have a look please?

Comment 20 Germano Veit Michel 2019-11-18 23:00:58 UTC

Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest Agent, (4.1.7). 
So the hung fsfreeze could have been https://bugzilla.redhat.com/show_bug.cgi?id=1596457.

Comment 22 Michal Skrivanek 2019-11-19 12:26:45 UTC

(In reply to Germano Veit Michel from comment #20)
> Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest
> Agent, (4.1.7). 
> So the hung fsfreeze could have been
> https://bugzilla.redhat.com/show_bug.cgi?id=1596457.


That explains a lot. Can't they upgrade?

Comment 23 Germano Veit Michel 2019-11-19 22:32:30 UTC

(In reply to Michal Skrivanek from comment #22)
> (In reply to Germano Veit Michel from comment #20)
> > Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest
> > Agent, (4.1.7). 
> > So the hung fsfreeze could have been
> > https://bugzilla.redhat.com/show_bug.cgi?id=1596457.
> 
> 
> That explains a lot. Can't they upgrade?

Possibly, we can ask them. But it could partially explain the fsfreeze hang, which is not what this bug is about:
* Why engine didn't timeout from that SnapshotVDS with hung fsfreeze
* Why engine got stuck after CreateVolumeVDS for all other VMs that had nothing to do with the hung fsfreeze of a single VM, breaking almost all customers VMs.

Comment 24 Marina Kalinin 2020-01-16 23:15:44 UTC

(In reply to Germano Veit Michel from comment #23)
> (In reply to Michal Skrivanek from comment #22)
> > (In reply to Germano Veit Michel from comment #20)
> > > Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest
> > > Agent, (4.1.7). 
> > > So the hung fsfreeze could have been
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1596457.
> > 
> > 
> > That explains a lot. Can't they upgrade?
> 
> Possibly, we can ask them. But it could partially explain the fsfreeze hang,
> which is not what this bug is about:
> * Why engine didn't timeout from that SnapshotVDS with hung fsfreeze
> * Why engine got stuck after CreateVolumeVDS for all other VMs that had
> nothing to do with the hung fsfreeze of a single VM, breaking almost all
> customers VMs.

Martin, can you please review is this is on your plate or still virt?

Comment 25 Michal Skrivanek 2020-04-16 08:58:31 UTC

it's a stuck storage call (fsfreeze, part of snapshot operation). there's nothing wrong in infra it seems, it's just that the call shouldn't be synchronous if it can get stuck ( it shouldn't get stuck in a first place of course, but anything can happen when interacting with guests)
Bug 1749284 may significantly change the situation, but it can very well be still a problem

Tal/Benny, thoughts?

Comment 26 Benny Zlotnik 2020-04-16 09:36:24 UTC

(In reply to Michal Skrivanek from comment #25)
> it's a stuck storage call (fsfreeze, part of snapshot operation). there's
> nothing wrong in infra it seems, it's just that the call shouldn't be
> synchronous if it can get stuck ( it shouldn't get stuck in a first place of
> course, but anything can happen when interacting with guests)
> Bug 1749284 may significantly change the situation, but it can very well be
> still a problem
> 
> Tal/Benny, thoughts?

We have an API call for freeze which is used when the snapshot is created for storage not managed by vdsm (cinder, cinderlib).
For managed storage vdsm calls freeze internally, but maybe we can just let engine call freeze and if it times out - just fail the snapshot operation (or not if inconsistent snapshots are fine)?
There could be ramifications I am not aware of, but it seems like it could make things easier

Comment 27 Tal Nisan 2020-04-19 12:50:12 UTC

What do you think Michal regarding comment 26?

Comment 28 Michal Skrivanek 2020-04-19 13:19:48 UTC

timing out and not proceeding with snapshot sounds safe(r). We may not have a way how to abort, but still better to wait until it ends eventually, or even if it doesn't it's better not to attempt further snapshots

Comment 29 Benny Zlotnik 2020-04-19 15:40:19 UTC

(In reply to Michal Skrivanek from comment #28)
> timing out and not proceeding with snapshot sounds safe(r). We may not have
> a way how to abort, but still better to wait until it ends eventually, or
> even if it doesn't it's better not to attempt further snapshots

Waiting for freeze in the engine can also be a solution for https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a backport-able one).
I'm not sure about the consequences of a stuck freeze though, can we just thaw in case it times out (or fails)?

Milan, do you know of any issues we might have if we always send "False" in[1]?
We currently only do this when the snapshot volume is not managed by vdsm (cinder/cinderlib volumes)


[1] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L445

Comment 30 Milan Zamazal 2020-04-20 09:34:40 UTC

According to https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L180, calling thaw should be OK after a failed freeze. I don't know though what happens if the stuck freeze call is still running.

Calling freeze for snapshots from Engine rather than from Vdsm may be possible. But it should be considered that:

- Engine must ensure thaw is always called after attempting a snapshot; should be no problem.
- There are operations that are currently performed after thaw before the snapshot finishes, such as memory volume padding; they probably don't need the  VM running but calling thaw after means it will be called unnecessarily a bit later.
- VM monitoring disabling and enabling is bound to freeze/thaw; this would probably have to be adjusted.

I'd suggest checking carefully what happens before freeze and after thaw in snapshot processing on the Vdsm side.

Comment 31 Michal Skrivanek 2020-04-20 09:44:18 UTC

it doesn't have to be symmetric. You can rely on engine to freeze and let vdsm thaw. But even without that it could be a good enough soltion for 4.3

For 4.4 I'd say the problem should be handled by async snapshots, but as it is right now the job(and timeout) doesn't cover the freeze part

Comment 32 Michal Skrivanek 2020-04-20 09:49:20 UTC

(In reply to Benny Zlotnik from comment #29)
> Waiting for freeze in the engine can also be a solution for
> https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a
> backport-able one).
> I'm not sure about the consequences of a stuck freeze though, can we just
> thaw in case it times out (or fails)?

Freeze itself is sync, so it will probably error out on engine side?  It will at least not corrupt anything so that is good, but corner cases are indeed ugly. That's why in 4.4 I'd rather try to cover it in the vdsm job

Comment 33 Tomáš Golembiovský 2020-04-20 11:07:44 UTC

(In reply to Michal Skrivanek from comment #32)
> (In reply to Benny Zlotnik from comment #29)
> > Waiting for freeze in the engine can also be a solution for
> > https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a
> > backport-able one).
> > I'm not sure about the consequences of a stuck freeze though, can we just
> > thaw in case it times out (or fails)?
> 
> Freeze itself is sync, so it will probably error out on engine side?  It
> will at least not corrupt anything so that is good, but corner cases are
> indeed ugly. That's why in 4.4 I'd rather try to cover it in the vdsm job

Calling thaw every time is a good idea. Although depending on the reason why the freeze failed the thaw is likely to fail too so the code should expect that. For example the OS may remain in bad state or the guest agent is stuck waiting for the freeze. If the thaw fails it may also be good idea to flag the VM with exclamation mark suggesting reboot.

Comment 34 Benny Zlotnik 2020-04-20 11:47:57 UTC

(In reply to Michal Skrivanek from comment #32)
> (In reply to Benny Zlotnik from comment #29)
> > Waiting for freeze in the engine can also be a solution for
> > https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a
> > backport-able one).
> > I'm not sure about the consequences of a stuck freeze though, can we just
> > thaw in case it times out (or fails)?
> 
> Freeze itself is sync, so it will probably error out on engine side?

yes, I meant in a scenario where freeze timeouts but actually keeps on running on the VM, will thaw help in that case... but according to comment #33, we can consider this bad state and recommend reboot

Comment 40 Evelina Shames 2020-06-01 12:40:27 UTC

After consulting with Liran, verified with the following flow:
1. Add a sleep on VDSM virt/vm.py on host
2. Restart vdsmd on that host
3. On engine:
     engine-config -s LiveSnapshotPerformFreezeInEngine=true
     engine-config -s LiveSnapshotTimeoutInMinutes=3
4. Restart ovirt-engine-service
5. Run VM and create live snapshot without memory
6 In the engine log and vdsm log look for : 
     FreezeVDSCommand
     ThawVDSCommand
--> The snapshot is created and ready to be used.

Version: rhv-4.4.1-2

Moving to 'Verified'.

Comment 46 errata-xmlrpc 2020-08-04 13:20:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247