Bug 1544692 - Commands in command_entities table is not cleared if CreateSnapshotVDSCommand failed while starting the Pool VMs
Summary: Commands in command_entities table is not cleared if CreateSnapshotVDSCommand...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.3
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Shmuel Melamud
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-13 10:06 UTC by nijin ashok
Modified: 2021-09-09 13:11 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-05-15 17:48:28 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43479 0 None None None 2021-09-09 13:11:54 UTC
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:49:44 UTC
oVirt gerrit 85286 0 None MERGED core: Extend compensation use to callbacks 2020-03-04 12:34:57 UTC

Description nijin ashok 2018-02-13 10:06:14 UTC
Description of problem:

If the CreateSnapshotVDSCommand fails while creating a stateless snapshot, then the commands in the command_entities are not getting cleared. If it's in a large environment with many pool VMs, this can grow very large and can even prevent engine service to come up.

Here it grew very large.

===
engine=> select count(*) from command_entities ;
 count 
-------
 18620
(1 row)
===

Majority of them are with status "ENDED_WITH_FAILURE".

===
engine=> select count(*) from command_entities where status = 'ENDED_WITH_FAILURE' ;
 count 
-------
 15615
===

All these are related to snapshot operation which is part of stateless VM startup events.

===
engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.RunVmParams';
 count 
-------
  5199
(1 row)

engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.ImagesActionsParametersBase';
 count 
-------
  5199
(1 row)

engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters';
 count 
-------
  5199
(1 row)

===

Because of this, if you restart the engine service, it will be stuck at "Start initializing CommandCallbacksPoller" and will timeout after 5 minutes.

2018-02-07 10:54:27,995+09 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[


If you delete all these entries which are having status as "ENDED_WITH_FAILURE", then the service will start up fine.

Version-Release number of selected component (if applicable):

rhevm-4.1.3


How reproducible:

100% 

Steps to Reproduce:

This is reproducible in 4.1.8 as well.

1. Restart the vdsmd service in the SPM host when a pool VMs stateless snapshot is getting created so that the snapshot operation will fail. Another option is to clear the image's metadata so that the snapshot creation will fail with error MetaDataKeyNotFoundError.

2. Each failure will create three entries in the command_entities and will stay there.

Actual results:

Failure in snapshot operation of stateless VM is causing the issue in starting the engine service.

Comment 3 Shmuel Melamud 2018-02-21 22:25:19 UTC
Testing shows that the bug is not reproducible on the master branch. Error messages appear in the log as described the bug description, but records in command_entities table are cleared correctly.

Comment 4 Shmuel Melamud 2018-02-26 11:54:02 UTC
Patch 85286 in Gerrit "core: Extend compensation use to callbacks" have extended the compensation mechanism to compensate correctly when the 'sync' part of a command succeeds but one of the child commands failed. It also fixed this bug in 4.2 branch, performing a correct cleanup when CreateAllSnapshotsFromVm child command of RunVm fails.

Comment 5 Israel Pinto 2018-03-05 14:20:44 UTC
Verify with:
Software Version:4.2.2.2-0.1.el7

Steps:
1. Create VM pool with 20 vms
2. Restart vdsmd server on SPM 
   VM pool create failed
3. Check DB: 
   select count(*) from command_entities ;

Results:
Command are clean

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
     1
(1 row)

Comment 11 errata-xmlrpc 2018-05-15 17:48:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 12 Franta Kust 2019-05-16 13:06:36 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.