Bug 1544692

Summary: Commands in command_entities table is not cleared if CreateSnapshotVDSCommand failed while starting the Pool VMs
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-engineAssignee: Shmuel Melamud <smelamud>
Status: CLOSED ERRATA QA Contact: Israel Pinto <ipinto>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.3CC: apinnick, lsurette, michal.skrivanek, mkalinin, nashok, rbalakri, Rhev-m-bugs, smelamud, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.2.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 17:48:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2018-02-13 10:06:14 UTC
Description of problem:

If the CreateSnapshotVDSCommand fails while creating a stateless snapshot, then the commands in the command_entities are not getting cleared. If it's in a large environment with many pool VMs, this can grow very large and can even prevent engine service to come up.

Here it grew very large.

===
engine=> select count(*) from command_entities ;
 count 
-------
 18620
(1 row)
===

Majority of them are with status "ENDED_WITH_FAILURE".

===
engine=> select count(*) from command_entities where status = 'ENDED_WITH_FAILURE' ;
 count 
-------
 15615
===

All these are related to snapshot operation which is part of stateless VM startup events.

===
engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.RunVmParams';
 count 
-------
  5199
(1 row)

engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.ImagesActionsParametersBase';
 count 
-------
  5199
(1 row)

engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters';
 count 
-------
  5199
(1 row)

===

Because of this, if you restart the engine service, it will be stuck at "Start initializing CommandCallbacksPoller" and will timeout after 5 minutes.

2018-02-07 10:54:27,995+09 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[


If you delete all these entries which are having status as "ENDED_WITH_FAILURE", then the service will start up fine.

Version-Release number of selected component (if applicable):

rhevm-4.1.3


How reproducible:

100% 

Steps to Reproduce:

This is reproducible in 4.1.8 as well.

1. Restart the vdsmd service in the SPM host when a pool VMs stateless snapshot is getting created so that the snapshot operation will fail. Another option is to clear the image's metadata so that the snapshot creation will fail with error MetaDataKeyNotFoundError.

2. Each failure will create three entries in the command_entities and will stay there.

Actual results:

Failure in snapshot operation of stateless VM is causing the issue in starting the engine service.

Comment 3 Shmuel Melamud 2018-02-21 22:25:19 UTC
Testing shows that the bug is not reproducible on the master branch. Error messages appear in the log as described the bug description, but records in command_entities table are cleared correctly.

Comment 4 Shmuel Melamud 2018-02-26 11:54:02 UTC
Patch 85286 in Gerrit "core: Extend compensation use to callbacks" have extended the compensation mechanism to compensate correctly when the 'sync' part of a command succeeds but one of the child commands failed. It also fixed this bug in 4.2 branch, performing a correct cleanup when CreateAllSnapshotsFromVm child command of RunVm fails.

Comment 5 Israel Pinto 2018-03-05 14:20:44 UTC
Verify with:
Software Version:4.2.2.2-0.1.el7

Steps:
1. Create VM pool with 20 vms
2. Restart vdsmd server on SPM 
   VM pool create failed
3. Check DB: 
   select count(*) from command_entities ;

Results:
Command are clean

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
     1
(1 row)

Comment 11 errata-xmlrpc 2018-05-15 17:48:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 12 Franta Kust 2019-05-16 13:06:36 UTC
BZ<2>Jira Resync