Bug 1764943
Summary: | Create Snapshot does not proceed beyond CreateVolume | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | ovirt-engine | Assignee: | Liran Rotenberg <lrotenbe> | |
Status: | CLOSED ERRATA | QA Contact: | Evelina Shames <eshames> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.3.5 | CC: | aefrat, bzlotnik, lrotenbe, michal.skrivanek, mkalinin, mperina, mzamazal, pelauter, rdlugyhe, rhodain, tgolembi, tnisan | |
Target Milestone: | ovirt-4.4.0 | Keywords: | ZStream | |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Previously, while creating virtual machine snapshots, if the VDSM's command to freeze a virtual machines' file systems exceeded the snapshot command's 3-minute timeout period, creating snapshots failed, causing virtual machines and disks to lock.
The current release adds two key-value pairs to the engine configuration. You can configure these using the engine-config tool:
- Setting `LiveSnapshotPerformFreezeInEngine` to `true` enables the {engine-name} to freeze VMs' file systems before it creates a snapshot of them.
- Setting `LiveSnapshotAllowInconsistent` to `true` enables the {engine-name} to continue creating snapshots if it fails to freeze VMs' file systems.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1842377 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 13:20:56 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1842377 |
Description
Germano Veit Michel
2019-10-24 05:34:30 UTC
104 stuck snapshot creations. engine=> select count(*) from job where action_type = 'CreateSnapshotForVm' and status ='STARTED'; count ------- 104 Where are these 2 in the logs? I think they can help explaining the problem... SerialChildCommandsExecutionCallback ConcurrentChildCommandsExecutionCallback For example, from the last succesfull snapshot, 3 minutes before all new snapshot got stuck: 2019-10-20 04:00:05,113+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [700ea60d-d4df-40b7-8933-18ea2c666486] Command 'CreateSnapshotForVm' (id: '7ae35e42-9aa3-4aec-be9a-3b6500ea6475') waiting on child command id: 'd5769a42-eb6a-4555-80a8-15c2ff47496d' type:'CreateSnapshotDisk' to complete 2019-10-20 04:00:05,113+02 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-72) [700ea60d-d4df-40b7-8933-18ea2c666486] Command 'CreateSnapshotDisk' (id: 'd5769a42-eb6a-4555-80a8-15c2ff47496d') waiting on child command id: '24118e8b-e130-42f7-9008-a7383e4b5b9f' type:'CreateSnapshot' to complete I don't see this on the stuck snapshots. There is also this stuck SnapshotVDSCommand from a few minutes before the problem: 2019-10-20 04:00:42,187+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [dbadca6a-5e2f-4c20-add6-e401a54c673a] START, SnapshotVDSCommand(HostName = <removed>, SnapshotVDSCommandParameters:{hostId='7bba1392-c460-4e00-8029-4f79fe10da7c', vmId='42298abb-79fd-b354-04bd-dfb1df9e96c6'}), log id: ba3f335 Its still on Thread Dump 2 days later: 2019-10-22 12:09:00 "EE-ManagedThreadFactory-engineScheduled-Thread-91" #352 daemon prio=5 os_prio=0 tid=0x0000000006cb9000 nid=0x2f1c waiting on condition [0x00007fa03aa26000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000006f8576f20> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.ovirt.vdsm.jsonrpc.client.internal.BatchCall.get(BatchCall.java:60) at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:53) at org.ovirt.vdsm.jsonrpc.client.internal.Call.get(Call.java:20) at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.lazyEval(FutureMap.java:113) at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.get(FutureMap.java:249) at org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturn.<init>(StatusOnlyReturn.java:13) at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.snapshot(JsonRpcVdsServer.java:1057) at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.snapshot(JsonRpcVdsServer.java:1040) at org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand.executeSnapshotVerb(SnapshotVDSCommand.java:36) at org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand.executeVdsBrokerCommand(SnapshotVDSCommand.java:24) at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVdsCommandWithNetworkEvent(VdsBrokerCommand.java:123) at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:398) at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51) at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:78) at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12) at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.weld.interceptor.reader.SimpleInterceptorInvocation$SimpleMethodInvocation.invoke(SimpleInterceptorInvocation.java:73) at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeAroundInvoke(InterceptorMethodHandler.java:84) at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeInterception(InterceptorMethodHandler.java:72) at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.invoke(InterceptorMethodHandler.java:56) at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:79) at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:68) at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2075) at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.lambda$performLiveSnapshot$4(CreateSnapshotForVmCommand.java:376) at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand$$Lambda$1555/1376732601.runInTransaction(Unknown Source) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.performLiveSnapshot(CreateSnapshotForVmCommand.java:375) at org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand.performNextOperation(CreateSnapshotForVmCommand.java:171) at org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) at org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:77) at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175) at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller$$Lambda$932/892810885.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) The snapshot from comment #4 hung on the host, fsfreeze. Could this have contributed to the engine hung too? 2019-10-22 12:15:43,569+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'42298abb-79fd-b354-04bd-dfb1df9e96c6', u'snapDrives': [{u'baseVolumeID': u'31ed219a-7de5-41bc-a7f1-044a6eb7cd31', u'domainID': u'c18a93ed-e9d7-41fa-893c-e714283ae8e8', u'volumeID': u'1d055a41-00c3-46f6-9b46-40ea7ae01151', u'imageID': u'efca48c7-685e-43e8-8bfb-cfbb768b7dd3'}, {u'baseVolumeID': u'ec125970-6a7d-4222-bce1-9ddd94ce89d8', u'domainID': u'c18a93ed-e9d7-41fa-893c-e714283ae8e8', u'volumeID': u'34252e09-b025-4255-b92a-1a7b7ed7a95c', u'imageID': u'4888f48f-e356-4d69-a977-382e68fea8ca'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'ba1577a6-034b-49e1-8c1f-2843986b6681'} at 0x7f93f0126d10> timeout=60, duration=202501.39 at 0x7f93f04c6b50> task#=970386 at 0x7f93f07d0bd0>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in snapshot File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File: "<string>", line 2, in snapshot File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 733, in snapshot return vm.snapshot(snapDrives, memoryParams, frozen=frozen) File: "<string>", line 2, in snapshot File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 101, in method return func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4497, in snapshot freezed = self.freeze() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4260, in freeze frozen = self._dom.fsFreeze() File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2875, in fsFreeze ret = libvirtmod.virDomainFSFreeze(self._o, mountpoints, flags) (executor:363) How reproducible: Partially I can reproduce the behaviour that all subsequent snapshot operations (completely unrelated, on different VMs and Hosts) do not go ahead after CreateVolume() is there is a previous SnapshotVDSCommand() that is stuck. 1. Add a sleep on VDSM virt/vm.py on Host A ~~~ self.log.info("Freezing guest filesystems") try: time.sleep(40000) frozen = self._dom.fsFreeze() ~~~ 2. Snapshot VM1 on Host A 3. Snapshot VM2 on Host B Snapshot of VM2 [3] just proceeds to SnapshotVDSCommand once Snapshot of VM1 fails with vdsTimeout. See the timestamps: [VM1 SnapshotVDS] 2019-10-25 10:24:38,039+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [1788a3a9-03b0-4614-b2d1-a9358f0a0337] START, SnapshotVDSCommand(HostName = host1.kvm, SnapshotVDSCommandParameters:{hostId='11d447a9-7fa7-4c3a-a2f9-ad9def20e797', vmId='0871f8a8-cdda-40a0-ba9e-88f21cae9f3a'}), log id: 6fd1a782 2019-10-25 10:27:38,346+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [1788a3a9-03b0-4614-b2d1-a9358f0a0337] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host1.kvm command SnapshotVDS failed: Message timeout which can be caused by communication issues [VM2] 2019-10-25 10:25:14,034+10 INFO [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Running command: CreateSnapshotForVmCommand internal: false. Entities affected : ID: 7f028f9d-840e-4190-838f-c0416467cdd0 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER 2019-10-25 10:25:14,213+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='c97145ac-0cdd-422a-b330-ee425e418723', imageSizeInBytes='21474836480', volumeFormat='COW', newImageId='4ecbf36b-2db4-4025-8e94-8f4e667f5334', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='08620f4b-e821-46b5-a29a-c3535e06102d', sourceImageGroupId='c97145ac-0cdd-422a-b330-ee425e418723'}), log id: 775be77f 2019-10-25 10:25:14,308+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1185) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, CreateVolumeVDSCommand, return: 4ecbf36b-2db4-4025-8e94-8f4e667f5334, log id: 775be77 2019-10-25 10:25:17,575+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, HSMClearTaskVDSCommand, return: , log id: 361a734c 2019-10-25 10:25:17,575+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] FINISH, SPMClearTaskVDSCommand, return: , log id: 17f5c770 2019-10-25 10:25:17,587+10 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] BaseAsyncTask::removeTaskFromDB: Removed task 'fd1ac433-b013-4d47-aee8-36d168acef29' from DataBase 2019-10-25 10:25:17,587+10 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-1188) [367f396c-a500-4194-87d0-b2dbc76bbe4a] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity 'eb76d231-7167-4bf7-9012-9275a6b7abaf' ... stuck until timeout of SnapshotVDS of VM1 .... 2019-10-25 10:27:40,231+10 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Command 'CreateSnapshotDisk' id: '94eaa561-87f3-4e08-9cd2-6b6937c9906c' child commands '[eb76d231-7167-4bf7-9012-9275a6b7abaf]' executions were completed, status 'SUCCEEDED' 2019-10-25 10:27:40,231+10 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [367f396c-a500-4194-87d0-b2dbc76bbe4a] Command 'CreateSnapshotDisk' id: '94eaa561-87f3-4e08-9cd2-6b6937c9906c' Updating status to 'SUCCEEDED', The command end method logic will be executed by one of its parent commands. 2019-10-25 10:27:42,304+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-66) [367f396c-a500-4194-87d0-b2dbc76bbe4a] START, SnapshotVDSCommand(HostName = host2.kvm, SnapshotVDSCommandParameters:{hostId='848b56c1-b024-438c-8a55-de4c74057e67', vmId='7f028f9d-840e-4190-838f-c0416467cdd0'}), log id: 28ccab6b Now the remaining questions are: a) why the customer's SnapshotVDS() never timed-out/returned (see comment #4) b) why the fsfreeze hung Martin, seems to me like an Infra issue, can someone have a look please? Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest Agent, (4.1.7). So the hung fsfreeze could have been https://bugzilla.redhat.com/show_bug.cgi?id=1596457. (In reply to Germano Veit Michel from comment #20) > Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest > Agent, (4.1.7). > So the hung fsfreeze could have been > https://bugzilla.redhat.com/show_bug.cgi?id=1596457. That explains a lot. Can't they upgrade? (In reply to Michal Skrivanek from comment #22) > (In reply to Germano Veit Michel from comment #20) > > Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest > > Agent, (4.1.7). > > So the hung fsfreeze could have been > > https://bugzilla.redhat.com/show_bug.cgi?id=1596457. > > > That explains a lot. Can't they upgrade? Possibly, we can ask them. But it could partially explain the fsfreeze hang, which is not what this bug is about: * Why engine didn't timeout from that SnapshotVDS with hung fsfreeze * Why engine got stuck after CreateVolumeVDS for all other VMs that had nothing to do with the hung fsfreeze of a single VM, breaking almost all customers VMs. (In reply to Germano Veit Michel from comment #23) > (In reply to Michal Skrivanek from comment #22) > > (In reply to Germano Veit Michel from comment #20) > > > Regarding the fsfreeze, the related VM is Windows 2012 with very old Guest > > > Agent, (4.1.7). > > > So the hung fsfreeze could have been > > > https://bugzilla.redhat.com/show_bug.cgi?id=1596457. > > > > > > That explains a lot. Can't they upgrade? > > Possibly, we can ask them. But it could partially explain the fsfreeze hang, > which is not what this bug is about: > * Why engine didn't timeout from that SnapshotVDS with hung fsfreeze > * Why engine got stuck after CreateVolumeVDS for all other VMs that had > nothing to do with the hung fsfreeze of a single VM, breaking almost all > customers VMs. Martin, can you please review is this is on your plate or still virt? it's a stuck storage call (fsfreeze, part of snapshot operation). there's nothing wrong in infra it seems, it's just that the call shouldn't be synchronous if it can get stuck ( it shouldn't get stuck in a first place of course, but anything can happen when interacting with guests) Bug 1749284 may significantly change the situation, but it can very well be still a problem Tal/Benny, thoughts? (In reply to Michal Skrivanek from comment #25) > it's a stuck storage call (fsfreeze, part of snapshot operation). there's > nothing wrong in infra it seems, it's just that the call shouldn't be > synchronous if it can get stuck ( it shouldn't get stuck in a first place of > course, but anything can happen when interacting with guests) > Bug 1749284 may significantly change the situation, but it can very well be > still a problem > > Tal/Benny, thoughts? We have an API call for freeze which is used when the snapshot is created for storage not managed by vdsm (cinder, cinderlib). For managed storage vdsm calls freeze internally, but maybe we can just let engine call freeze and if it times out - just fail the snapshot operation (or not if inconsistent snapshots are fine)? There could be ramifications I am not aware of, but it seems like it could make things easier What do you think Michal regarding comment 26? timing out and not proceeding with snapshot sounds safe(r). We may not have a way how to abort, but still better to wait until it ends eventually, or even if it doesn't it's better not to attempt further snapshots (In reply to Michal Skrivanek from comment #28) > timing out and not proceeding with snapshot sounds safe(r). We may not have > a way how to abort, but still better to wait until it ends eventually, or > even if it doesn't it's better not to attempt further snapshots Waiting for freeze in the engine can also be a solution for https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a backport-able one). I'm not sure about the consequences of a stuck freeze though, can we just thaw in case it times out (or fails)? Milan, do you know of any issues we might have if we always send "False" in[1]? We currently only do this when the snapshot volume is not managed by vdsm (cinder/cinderlib volumes) [1] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L445 According to https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L180, calling thaw should be OK after a failed freeze. I don't know though what happens if the stuck freeze call is still running. Calling freeze for snapshots from Engine rather than from Vdsm may be possible. But it should be considered that: - Engine must ensure thaw is always called after attempting a snapshot; should be no problem. - There are operations that are currently performed after thaw before the snapshot finishes, such as memory volume padding; they probably don't need the VM running but calling thaw after means it will be called unnecessarily a bit later. - VM monitoring disabling and enabling is bound to freeze/thaw; this would probably have to be adjusted. I'd suggest checking carefully what happens before freeze and after thaw in snapshot processing on the Vdsm side. it doesn't have to be symmetric. You can rely on engine to freeze and let vdsm thaw. But even without that it could be a good enough soltion for 4.3 For 4.4 I'd say the problem should be handled by async snapshots, but as it is right now the job(and timeout) doesn't cover the freeze part (In reply to Benny Zlotnik from comment #29) > Waiting for freeze in the engine can also be a solution for > https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a > backport-able one). > I'm not sure about the consequences of a stuck freeze though, can we just > thaw in case it times out (or fails)? Freeze itself is sync, so it will probably error out on engine side? It will at least not corrupt anything so that is good, but corner cases are indeed ugly. That's why in 4.4 I'd rather try to cover it in the vdsm job (In reply to Michal Skrivanek from comment #32) > (In reply to Benny Zlotnik from comment #29) > > Waiting for freeze in the engine can also be a solution for > > https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a > > backport-able one). > > I'm not sure about the consequences of a stuck freeze though, can we just > > thaw in case it times out (or fails)? > > Freeze itself is sync, so it will probably error out on engine side? It > will at least not corrupt anything so that is good, but corner cases are > indeed ugly. That's why in 4.4 I'd rather try to cover it in the vdsm job Calling thaw every time is a good idea. Although depending on the reason why the freeze failed the thaw is likely to fail too so the code should expect that. For example the OS may remain in bad state or the guest agent is stuck waiting for the freeze. If the thaw fails it may also be good idea to flag the VM with exclamation mark suggesting reboot. (In reply to Michal Skrivanek from comment #32) > (In reply to Benny Zlotnik from comment #29) > > Waiting for freeze in the engine can also be a solution for > > https://bugzilla.redhat.com/show_bug.cgi?id=1821164 (maybe even a > > backport-able one). > > I'm not sure about the consequences of a stuck freeze though, can we just > > thaw in case it times out (or fails)? > > Freeze itself is sync, so it will probably error out on engine side? yes, I meant in a scenario where freeze timeouts but actually keeps on running on the VM, will thaw help in that case... but according to comment #33, we can consider this bad state and recommend reboot After consulting with Liran, verified with the following flow: 1. Add a sleep on VDSM virt/vm.py on host 2. Restart vdsmd on that host 3. On engine: engine-config -s LiveSnapshotPerformFreezeInEngine=true engine-config -s LiveSnapshotTimeoutInMinutes=3 4. Restart ovirt-engine-service 5. Run VM and create live snapshot without memory 6 In the engine log and vdsm log look for : FreezeVDSCommand ThawVDSCommand --> The snapshot is created and ready to be used. Version: rhv-4.4.1-2 Moving to 'Verified'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |