Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1544692

Summary:	Commands in command_entities table is not cleared if CreateSnapshotVDSCommand failed while starting the Pool VMs
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	nijin ashok <nashok>
Component:	ovirt-engine	Assignee:	Shmuel Melamud <smelamud>
Status:	CLOSED ERRATA	QA Contact:	Israel Pinto <ipinto>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.3	CC:	apinnick, lsurette, michal.skrivanek, mkalinin, nashok, rbalakri, Rhev-m-bugs, smelamud, srevivo, ykaul, ylavi
Target Milestone:	ovirt-4.2.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 17:48:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nijin ashok 2018-02-13 10:06:14 UTC

Description of problem:

If the CreateSnapshotVDSCommand fails while creating a stateless snapshot, then the commands in the command_entities are not getting cleared. If it's in a large environment with many pool VMs, this can grow very large and can even prevent engine service to come up.

Here it grew very large.

===
engine=> select count(*) from command_entities ;
 count 
-------
 18620
(1 row)
===

Majority of them are with status "ENDED_WITH_FAILURE".

===
engine=> select count(*) from command_entities where status = 'ENDED_WITH_FAILURE' ;
 count 
-------
 15615
===

All these are related to snapshot operation which is part of stateless VM startup events.

===
engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.RunVmParams';
 count 
-------
  5199
(1 row)

engine=> select count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.ImagesActionsParametersBase';
 count 
-------
  5199
(1 row)

engine=> select  count(*) from command_entities where command_params_class = 'org.ovirt.engine.core.common.action.CreateAllSnapshotsFromVmParameters';
 count 
-------
  5199
(1 row)

===

Because of this, if you restart the engine service, it will be stuck at "Start initializing CommandCallbacksPoller" and will timeout after 5 minutes.

2018-02-07 10:54:27,995+09 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[


If you delete all these entries which are having status as "ENDED_WITH_FAILURE", then the service will start up fine.

Version-Release number of selected component (if applicable):

rhevm-4.1.3


How reproducible:

100% 

Steps to Reproduce:

This is reproducible in 4.1.8 as well.

1. Restart the vdsmd service in the SPM host when a pool VMs stateless snapshot is getting created so that the snapshot operation will fail. Another option is to clear the image's metadata so that the snapshot creation will fail with error MetaDataKeyNotFoundError.

2. Each failure will create three entries in the command_entities and will stay there.

Actual results:

Failure in snapshot operation of stateless VM is causing the issue in starting the engine service.

Comment 3 Shmuel Melamud 2018-02-21 22:25:19 UTC

Testing shows that the bug is not reproducible on the master branch. Error messages appear in the log as described the bug description, but records in command_entities table are cleared correctly.

Comment 4 Shmuel Melamud 2018-02-26 11:54:02 UTC

Patch 85286 in Gerrit "core: Extend compensation use to callbacks" have extended the compensation mechanism to compensate correctly when the 'sync' part of a command succeeds but one of the child commands failed. It also fixed this bug in 4.2 branch, performing a correct cleanup when CreateAllSnapshotsFromVm child command of RunVm fails.

Comment 5 Israel Pinto 2018-03-05 14:20:44 UTC

Verify with:
Software Version:4.2.2.2-0.1.el7

Steps:
1. Create VM pool with 20 vms
2. Restart vdsmd server on SPM 
   VM pool create failed
3. Check DB: 
   select count(*) from command_entities ;

Results:
Command are clean

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
    14
(1 row)

engine=#  select count(*) from command_entities ;
 count 
-------
     1
(1 row)

Comment 11 errata-xmlrpc 2018-05-15 17:48:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 12 Franta Kust 2019-05-16 13:06:36 UTC

BZ<2>Jira Resync