+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1477042 +++ ====================================================================== Description of problem: Engine won't start, environment is down. Apparently a successful snapshot removal, but there were entries in the Database triggering a callback for a command that doesn't exist, so the engine fails to start. The snapshot: 2017-06-27 17:05:11,157 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MergeSnapshotsVDSCommand] (default task-48) [231f337f] START, MergeSnapshotsVDSCommand( MergeSnapshotsVDSCommandParameters:{runAsync='true', storagePoolId='7fa2b154-e990-4f7d-bf4d-de30026ccedf', ignoreFailoverLimit='false', storageDomainId='cb2f38db-5560-4762-b700-8e13c55a1e30', imageGroupId='ee7838f0-263e-477b-ac30-157472812b7b', imageId='7abcc25c-a459-4452-be5c-5369284a8d95', imageId2='2fac3f76-bf64-4073-a16a-98ce71958c51', vmId='9a254614-ce1a-40d3-9ca2-3955372653b1', postZero='false'}), log id: 3babecb0 2017-06-27 17:05:12,271 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (default task-48) [231f337f] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command 'de74fa43-2210-4cb8-811f-5b76b23bf107' 2017-06-27 17:05:12,271 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-48) [231f337f] CommandMultiAsyncTasks::attachTask: Attaching task 'f0837f50-63b8-4dfd-9200-83a4a20a5a68' to command 'de74fa43-2210-4cb8-811f-5b76b23bf107'. 2017-06-27 17:05:13,156 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (default task-48) [3a647aaa] CommandMultiAsyncTasks::attachTask: Attaching task 'c8498212-6567-4c66-912e-459e77fb9c56' to command 'de74fa43-2210-4cb8-811f-5b76b23bf107'. 2017-06-27 17:19:53,306 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler4) [3aea9bbc] SPMAsyncTask::PollTask: Polling task 'f0837f50-63b8-4dfd-9200-83a4a20a5a68' (Parent Command 'RemoveSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'. 2017-06-27 18:14:24,065 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler8) [18d5ac84] SPMAsyncTask::PollTask: Polling task 'c8498212-6567-4c66-912e-459e77fb9c56' (Parent Command 'RemoveSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'. 2017-06-27 18:14:24,069 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler8) [18d5ac84] CommandAsyncTask::endActionIfNecessary: All tasks of command 'de74fa43-2210-4cb8-811f-5b76b23bf107' has ended -> executing 'endAction' 2017-06-27 18:14:24,361 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-41) [537c0fab] Correlation ID: 537c0fab, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'XYZ' deletion for VM 'XYZ' has been completed. 2017-06-27 18:14:24,362 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.pool-6-thread-41) [537c0fab] SPMAsyncTask::ClearAsyncTask: Attempting to clear task 'f0837f50-63b8-4dfd-9200-83a4a20a5a68' 2017-06-27 18:14:25,390 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.pool-6-thread-41) [537c0fab] BaseAsyncTask::removeTaskFromDB: Removed task 'f0837f50-63b8-4dfd-9200-83a4a20a5a68' from DataBase 2017-06-27 18:14:25,391 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.pool-6-thread-41) [537c0fab] SPMAsyncTask::ClearAsyncTask: Attempting to clear task 'c8498212-6567-4c66-912e-459e77fb9c56' 2017-06-27 18:14:26,034 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (org.ovirt.thread.pool-6-thread-41) [537c0fab] BaseAsyncTask::removeTaskFromDB: Removed task 'c8498212-6567-4c66-912e-459e77fb9c56' from DataBase The above looks good to me. But 2 days later: 2017-06-29 19:05:11,479 WARN [org.ovirt.engine.core.bll.tasks.CommandsRepository] (DefaultQuartzScheduler3) [] Marking expired command as Failed: command 'RemoveSnapshotSingleDisk (dc957381-fd6d-41ed-bf77-ea761a5970a4)' that started at 'Tue Jun 27 17:05:11 JST 2017' has been marked as Failed. 2017-06-29 19:05:12,498 WARN [org.ovirt.engine.core.bll.tasks.CommandsRepository] (DefaultQuartzScheduler9) [] Marking expired command as Failed: command 'RemoveSnapshotSingleDisk (ae4fe47c-dd40-4a4d-82a9-e7248014fd68)' that started at 'Tue Jun 27 17:05:12 JST 2017' has been marked as Failed. There is no engine-restart between June 26th and June 29th. The first restart is on July 26th, when the customer wanted to change a engine-config parameter. I don't know why these entries below were still in the DB if snapshot command succeded. engine=> select command_id,parent_command_id,command_params_class,status,executed from command_entities ; -[ RECORD 1 ]--------+----------------------------------------------------------------------- command_id | dc957381-fd6d-41ed-bf77-ea761a5970a4 parent_command_id | de74fa43-2210-4cb8-811f-5b76b23bf107 command_params_class | org.ovirt.engine.core.common.action.RemoveSnapshotSingleDiskParameters status | FAILED executed | t -[ RECORD 2 ]--------+----------------------------------------------------------------------- command_id | ae4fe47c-dd40-4a4d-82a9-e7248014fd68 parent_command_id | de74fa43-2210-4cb8-811f-5b76b23bf107 command_params_class | org.ovirt.engine.core.common.action.RemoveSnapshotSingleDiskParameters status | FAILED executed | t -[ RECORD 3 ]--------+----------------------------------------------------------------------- command_id | de74fa43-2210-4cb8-811f-5b76b23bf107 parent_command_id | 00000000-0000-0000-0000-000000000000 command_params_class | org.ovirt.engine.core.common.action.RemoveSnapshotParameters status | ENDED_SUCCESSFULLY executed | t And these entries make the engine fail to start as there is no callback for RemoveSnapshotCommand: 2017-07-28 13:29:38,026 ERROR [org.ovirt.engine.core.bll.InitBackendServicesOnStartupBean] (ServerService Thread Pool -- 50) [537c0fab] Failed to initialize backend: org.jboss.weld.exceptions.WeldException: WELD-000049: Unable to invoke private void org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.init() on org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller@14522b5f [....] Caused by: java.lang.NullPointerException at org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand.getCallback(RemoveSnapshotCommand.java:493) [bll.jar:] Version-Release number of selected component (if applicable): ovirt-engine-4.0.7 (Originally by Germano Veit Michel)
Is this on track for 4.1.5? (Originally by Yaniv Kaul)
(In reply to Yaniv Kaul from comment #2) > Is this on track for 4.1.5? I am looking into this now. Will update more once I have additional data. (Originally by Ala Hino)
(In reply to Ala Hino from comment #3) > (In reply to Yaniv Kaul from comment #2) > > Is this on track for 4.1.5? > > I am looking into this now. Will update more once I have additional data. Thanks, keeping NEEDINFO for the time being. (Originally by Yaniv Kaul)
Cancelling the needinfo flag as this bug is MODIFIED now. (Originally by Ala Hino)
Ala, What are the steps to reproduce this? (Originally by Raz Tamir)
This one is a bit tricky to reproduce but maybe you can help. In order to reproduce, we need a situation where live merge step is ACTIVE and we were able to remove the VM. Afterwards, restart the engine. Without the fix, we would get a NPE, with the fix the engine should start. (Originally by Ala Hino)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found clone flags: ['rhevm-4.1.z', 'rhevm-4.2-ga'], ] For more info please contact: rhv-devops (Originally by rhev-integ)
This scenario cannot be reproduced and based on Comment 8 closing current release
Cancelling the needinfo flag as this bug is CLOSED now.