Description of problem: Many commands are not removed from the database and remain in ACTIVE status [1]. All the operations are successfully ended, but it seems that CommandCallbacksPoller is not executed. The consequence, in this case, is that VMs in a VM pool cannot be pre-started as they are in snapshot removal progress. Version-Release number of selected component (if applicable): RHV 4.1.10 How reproducible: Just ones Steps to Reproduce: Unknown Actual results: The operations are never finished. Expected results: The operations are finished successfully Additional info: We an see that the child object is removed, but the parrent remains there and ConcurrentChildCommandsExecutionCallback is not called. As soon as the engine is restarted and the issues is narrowed down. 2018-04-09 22:17:17,554+02 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.pool-7-thread-28) [2cf16ee3] BaseAsyncTask::removeTaskFromDB: Removed task '2e48cbf5-c329-4fdf-b056-c301444eed09' from DataBase 2018-04-09 22:17:17,554+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-28) [2cf16ee3] CommandAsyncTask::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity 'df074505-3e32-4e60-a420-0f05790451b7' [1]: created_at | command_id | command_params_class | status -------------------------------+--------------------------------------+---------------------------------------------------------------------------------------+-------- 2018-04-09 22:16:55.144181+02 | 19c9ad42-bbb0-48a3-b676-b5047ef3aba4 | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE 2018-04-09 22:21:52.167734+02 | 300f15ae-16eb-47c5-90bb-f943bb563d38 | org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters | ACTIVE 2018-04-10 08:32:56.480064+02 | 738468d3-01c8-4944-bbfb-bbcb5702741f | org.ovirt.engine.core.common.action.AttachUserToVmFromPoolAndRunParameters | ACTIVE 2018-04-09 22:19:52.087792+02 | 85901b1d-38bb-4341-96ab-90833b1e208e | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE 2018-04-09 22:21:51.62691+02 | 5baf78a0-2d7f-4163-b13d-16374a670216 | org.ovirt.engine.core.common.action.RunVmParams | ACTIVE 2018-04-09 22:15:38.618483+02 | bc72a993-bfbd-48c2-95eb-f7c9e9d28a76 | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE 2018-04-09 22:17:57.998023+02 | 288697b8-c31f-4e47-9da1-0ac9b6e506cb | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE 2018-04-09 22:19:00.653187+02 | 1e606b62-dd48-4d74-be8c-2169caa634da | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE 2018-04-09 22:20:56.313792+02 | b3ebec70-6a30-41a4-a7d7-f42fb422b25f | org.ovirt.engine.core.common.action.RestoreAllSnapshotsParameters | ACTIVE ...
With out thread dumps it is hard to pin point the issue. A single non-responsive hypervisor should not impact the stability of the system. If the issue occurs again and we get the thread dumps, I can look into the issue further.
Lucie, could you please try to reproduce?
I did not succeed in reproducing. I tried many combination of creating pool with prestarted VMs, every time the host was slowed down (I used https://gist.github.com/obscurerichard/3740206), commands waited till the host was up again and finished successfully, even after engine was restarted. With no-responsive host commands did not appear and VMs were not started.
Sorry, I forgot to mention engine version, where I tested it. ovirt-engine-4.1.11.2-0.1.el7.noarch and host vdsm-4.20.9.3-1.el7ev.x86_64
Roman, could we close the bug with worksforme and reopen it if the bug is reproduced and thread dump provided?
Sure
Feel free to reopen when reproduced and provide thread dump to enable further investigation
BZ<2>Jira Resync