Description of problem: If the CreateSnapshotVDSCommand fails while creating a stateless snapshot, then the commands in the command_entities are not getting cleared. If it's in a large environment with many pool VMs, this can grow very large and can even prevent engine service to come up. Here it grew very large. === engine=> select count(*) from command_entities ; count ------- 18620 (1 row) === Majority of them are with status "ENDED_WITH_FAILURE". === engine=> select count(*) from command_entities where status = 'ENDED_WITH_FAILURE' ; count ------- 15615 === All these are related to snapshot operation which is part of stateless VM startup events. === engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.RunVmParams'; count ------- 5199 (1 row) engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.ImagesActionsParametersBase'; count ------- 5199 (1 row) engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters'; count ------- 5199 (1 row) === Because of this, if you restart the engine service, it will be stuck at "Start initializing CommandCallbacksPoller" and will timeout after 5 minutes. 2018-02-07 10:54:27,995+09 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ If you delete all these entries which are having status as "ENDED_WITH_FAILURE", then the service will start up fine. Version-Release number of selected component (if applicable): rhevm-4.1.3 How reproducible: 100% Steps to Reproduce: This is reproducible in 4.1.8 as well. 1. Restart the vdsmd service in the SPM host when a pool VMs stateless snapshot is getting created so that the snapshot operation will fail. Another option is to clear the image's metadata so that the snapshot creation will fail with error MetaDataKeyNotFoundError. 2. Each failure will create three entries in the command_entities and will stay there. Actual results: Failure in snapshot operation of stateless VM is causing the issue in starting the engine service.
Testing shows that the bug is not reproducible on the master branch. Error messages appear in the log as described the bug description, but records in command_entities table are cleared correctly.
Patch 85286 in Gerrit "core: Extend compensation use to callbacks" have extended the compensation mechanism to compensate correctly when the 'sync' part of a command succeeds but one of the child commands failed. It also fixed this bug in 4.2 branch, performing a correct cleanup when CreateAllSnapshotsFromVm child command of RunVm fails.
Verify with: Software Version:4.2.2.2-0.1.el7 Steps: 1. Create VM pool with 20 vms 2. Restart vdsmd server on SPM VM pool create failed 3. Check DB: select count(*) from command_entities ; Results: Command are clean engine=# select count(*) from command_entities ; count ------- 14 (1 row) engine=# select count(*) from command_entities ; count ------- 14 (1 row) engine=# select count(*) from command_entities ; count ------- 1 (1 row)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync