Description of problem: Create a VM from Template and restart the engine while the VM gets created might cause the VM to stay in lock status forever Version-Release number of selected component (if applicable): engine 4.0 How reproducible: 40% (2/5 restarts) Steps to Reproduce: 1. Add a Template with disks 2. Create a VM from this Template 3. Restart the engine which the VM gets created Actual results: The VM stay in lock status for ever Expected results: The VM should be removed from the DB Additional info: Looks like the compensation is not forced for AddVmTemplate command. so when starting the engine the VM stays in the DB with lock status. This is part of BZ1288852 verification scenarios although BZ1288852 indicates the VM was removed but it might also stay in lock status.
Created attachment 1116924 [details] Logs01 Reproduced ========== rhevm-3.6.2.6-0.1.el6.noarch vdsm-4.17.18-0.el7ev.noarch libvirt-client-1.2.17-13.el7_2.2.x86_64 Steps to Reproduce: =================== 1. Add a Template with disks 2. Create a VM from this Template 3. Restart the engine which the VM gets created
What is the use case? which disks were the VM created from? Can you please add a print screen of the locked VM
The setup is available for you to check (already provided you the details offline)
The fix for this bug is related to the compensation when the engine gets restarted at the middle of the execute phase before starting any tasks. It looks like you encountered another bug, which related to the async task manager in the infrastructure. From the logs it looks like the reboot which was done was at the point which the task was already finished and was prepared to cleaned. Engine: Adding the task and the command: 2016-01-21 09:30:26,715 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (ajp-/127.0.0.1:8702-6) [4093f7f3] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command 'bb60b530-7060-4080-b9fa-5e06d3ab23a1' ... 2016-01-21 09:30:26,922 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (ajp-/127.0.0.1:8702-6) [4093f7f3] Adding task '80599970-da31-43c7-ae2d-533b651c8b21' (Parent Command 'CreateSnapshotFromTemplate', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet.. Polling the task: SPMAsyncTask::PollTask: Polling task '80599970-da31-43c7-ae2d-533b651c8b21' (Parent Command 'CreateSnapshotFromTemplate', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'. Now the task has finished with success and the engine trying to remove it: BaseAsyncTask::onTaskEndSuccess: Task '80599970-da31-43c7-ae2d-533b651c8b21' (Parent Command 'CreateSnapshotFromTemplate', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully. At this point the engine was restarted, just before the task has changed its status in the DB: 2016-01-21 09:30:31,463 ERROR [org.ovirt.engine.core.dal.dbbroker.DbFacade] (DefaultQuartzScheduler_Worker-64) [] Can't find dao for interface org.ovirt.engine.core.dao.CommandEntityDao Once the engine was restarted the engine does not find the right constructor: 2016-01-21 09:32:01,648 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (org.ovirt.thread.pool-7-thread-6) [563a4d4] Can't find constructor for type org.ovirt.engine.core.bll.CreateSnapshotFromTemplateCommand with parameter types: [class org.ovirt.engine.core.common.action.CreateSnapshotFromTemplateParameters] and that is why the task doesn't gets cleared. DB: engine=# SELECT * FROM async_tasks; task_id | action_type | status | result | step_id | command_id | started_at | st orage_pool_id | task_type | vdsm_task_id | root_command_id | user_id --------------------------------------+-------------+--------+--------+--------------------------------------+--------------------------------------+---------------------------+------------- -------------------------+-----------+--------------------------------------+--------------------------------------+-------------------------------------- ec7b9e9e-ef02-4988-929d-b599c8d6a9ed | 208 | 2 | 0 | abd7f4ec-6ff0-4dc8-88af-8fe720a43263 | bb60b530-7060-4080-b9fa-5e06d3ab23a1 | 2016-01-21 09:30:25.47+02 | 00000001-000 1-0001-0001-0000000000a8 | 3 | 80599970-da31-43c7-ae2d-533b651c8b21 | bb60b530-7060-4080-b9fa-5e06d3ab23a1 | 00000019-0019-0019-0019-0000000001f4 (1 row) engine=# SELECT command_id, command_type, root_command_id, status, parent_command_id, job_id FROM command_entities where command_type = 208; command_id | command_type | root_command_id | status | parent_command_id | job_id --------------------------------------+--------------+--------------------------------------+--------+--------------------------------------+-------------------------------------- bb60b530-7060-4080-b9fa-5e06d3ab23a1 | 208 | f197726c-e378-443f-acb8-af9006de3481 | ACTIVE | f197726c-e378-443f-acb8-af9006de3481 | 88696c0c-31ac-4793-b7e0-2dbc137b0ea7 Host: [root@camel-vdsc ~]# vdsClient -s 0 getAllTasksStatuses {'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'80599970-da31-43c7-ae2d-533b651c8b21': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': '80599970-da31-43c7-ae2d-533b651c8b21', 'taskResult': 'success', 'taskState': 'finished'}}} I will open a new bug which is referenced to this scenario and use this bug as dependent on it.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Basically this specific bug is a failure of failed scenario when engine gets restarted and this issue has always been there though it was never reproduced. patch 51652 already fixed this issue though BZ1300757 currently blocks it from being verified. Moving this to high to be correlated with BZ1300757 which it depends on.
Maor, I see that all patches are merged and the dependent bug is MODIFIED, should this bug moved to MODIFIED as well?
Maor, the patch backports go down to 3.6.2 in the 11th of January, are those the only patches or was there something missing that prevented you to move this bug to MODIFIED? Cause if nothing else it means that this bug is already fixed in 3.6.2 and can be tested in this version instead of waiting for 3.6.5
The bug is about a scenario that VM is locked due to compensation issue (before any tasks were running). The patches were already fixing it in 3.6.2, and basically it should not be failed at all. but since there was another bug that a VM stayed in lock status because of running tasks (a different issue) this was re-opened by mistake. so I opened a new bug and made this one defendant on it. The verification for this bug (only compensation without running tasks) can be verified in 3.6.2 although it might be a difficult one.
I suggest, to make the verification more easy, to verify both of the bugs on 3.6.5
Bug 1300757 was moved to ON_QA, moving this one as well.
Verified with the following code: --------------------------------------------- rhevm-3.6.3.2-0.1.el6.noarch vdsm-4.17.21-0.el7ev.noarch Verified with the following scenario: --------------------------------------------- Steps to Reproduce: =================== 1. Add a Template with disks 2. Create a VM from this Template 3. Restart the engine during the VM creation operation a) When restarting the engine immediately during the disk creation, before all the disks have been created >>>>> as expected the VM is not created b) When restarting the engine after the VM disks creation has completed but while the VM is still locked >>>> The VM is completed successfully Moving to VERIFIED!