Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1544692 - Commands in command_entities table is not cleared if CreateSnapshotVDSCommand failed while starting the Pool VMs
Commands in command_entities table is not cleared if CreateSnapshotVDSCommand...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.3
All Linux
unspecified Severity high
: ovirt-4.2.1
: ---
Assigned To: Shmuel Melamud
Israel Pinto
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-13 05:06 EST by nijin ashok
Modified: 2018-05-15 13:49 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-05-15 13:48:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 85286 None None None 2018-02-26 06:54 EST
Red Hat Product Errata RHEA-2018:1488 None None None 2018-05-15 13:49 EDT

  None (edit)
Description nijin ashok 2018-02-13 05:06:14 EST
Description of problem:

If the CreateSnapshotVDSCommand fails while creating a stateless snapshot, then the commands in the command_entities are not getting cleared. If it's in a large environment with many pool VMs, this can grow very large and can even prevent engine service to come up.

Here it grew very large.

===
engine=> select count(*) from command_entities ;
 count 
-------
 18620
(1 row)
===

Majority of them are with status "ENDED_WITH_FAILURE".

===
engine=> select count(*) from command_entities where status = 'ENDED_WITH_FAILURE' ;
 count 
-------
 15615
===

All these are related to snapshot operation which is part of stateless VM startup events.

===
engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.RunVmParams';
 count 
-------
  5199
(1 row)

engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.ImagesActionsParametersBase';
 count 
-------
  5199
(1 row)

engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters';
 count 
-------
  5199
(1 row)

===

Because of this, if you restart the engine service, it will be stuck at "Start initializing CommandCallbacksPoller" and will timeout after 5 minutes.

2018-02-07 10:54:27,995+09 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[


If you delete all these entries which are having status as "ENDED_WITH_FAILURE", then the service will start up fine.

Version-Release number of selected component (if applicable):

rhevm-4.1.3


How reproducible:

100% 

Steps to Reproduce:

This is reproducible in 4.1.8 as well.

1. Restart the vdsmd service in the SPM host when a pool VMs stateless snapshot is getting created so that the snapshot operation will fail. Another option is to clear the image's metadata so that the snapshot creation will fail with error MetaDataKeyNotFoundError.

2. Each failure will create three entries in the command_entities and will stay there.

Actual results:

Failure in snapshot operation of stateless VM is causing the issue in starting the engine service.
Comment 3 Shmuel Melamud 2018-02-21 17:25:19 EST
Testing shows that the bug is not reproducible on the master branch. Error messages appear in the log as described the bug description, but records in command_entities table are cleared correctly.
Comment 4 Shmuel Melamud 2018-02-26 06:54:02 EST
Patch 85286 in Gerrit "core: Extend compensation use to callbacks" have extended the compensation mechanism to compensate correctly when the 'sync' part of a command succeeds but one of the child commands failed. It also fixed this bug in 4.2 branch, performing a correct cleanup when CreateAllSnapshotsFromVm child command of RunVm fails.
Comment 5 Israel Pinto 2018-03-05 09:20:44 EST
Verify with:
Software Version:4.2.2.2-0.1.el7

Steps:
1. Create VM pool with 20 vms
2. Restart vdsmd server on SPM 
   VM pool create failed
3. Check DB: 
   select count(*) from command_entities ;

Results:
Command are clean

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
     1
(1 row)
Comment 11 errata-xmlrpc 2018-05-15 13:48:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Note You need to log in before you can comment on or make changes to this bug.