1310426 – VmPool related jobs are stuck in job and steps tables in DB when several consecutive actions are called

Bug 1310426 - VmPool related jobs are stuck in job and steps tables in DB when several consecutive actions are called

Summary: VmPool related jobs are stuck in job and steps tables in DB when several cons...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.0.0-beta
Target Release:	4.0.0
Assignee:	Liron Aravot
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1312012 1367024 (view as bug list)
Depends On:
Blocks:	1367024
TreeView+	depends on / blocked

Reported:	2016-02-21 13:11 UTC by sefi litmanovich
Modified:	2016-08-18 12:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-07-05 07:43:20 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.0+ rule-engine: planning_ack+ tjelinek: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
engine log (527.82 KB, application/x-gzip) 2016-02-21 13:11 UTC, sefi litmanovich	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	54897	master	ABANDONED	core: Close FINALIZING step when created	2016-04-07 16:48:26 UTC
oVirt gerrit	55613	master	MERGED	core: split ExecutionHandler.endTaskJob()	2016-04-05 12:51:20 UTC
oVirt gerrit	55614	master	MERGED	core: Async tasks - move steps/jobs handling to CommandBase	2016-04-05 12:51:53 UTC
oVirt gerrit	55615	master	MERGED	core: add vm pool flow callbacks usage	2016-04-06 06:39:30 UTC

Description sefi litmanovich 2016-02-21 13:11:39 UTC

Created attachment 1129009 [details]
engine log

Description of problem:

The bug appears in VmPool automation testing.
After the test is finished, even if successfully, Most Create vm pool tasks are stuck in DB job table (with sub steps in steps table) and also a Configuring VM Pool {pool_name} properties sometimes.
The problem is hard to re produce manually as it doesn't seem to happen when creating a vm pool and waiting some time until the next one.
All tests in automation waits until engine reports an action as successful/fail, and statuses are checked, also in practice the Create vm pool is done successful and VMs in the pool are used and are in the correct state so this is some internal bug with the task handling in DB I guess and seems like some sort of timing issue.
I would think that either a task waits for all sub steps to be done before allowing another task on the same entity, or on the other hand that if a new task (update, delete) is started and finished on a certain entity then also previous task on the same entity will be cleaned from DB even if some step is still stuck.
I added engine.log with an automation run with only 2 test cases, in both cases Creating VM Pool {pool_name} from template {template_name} were stuck in DB. You can see that part of the log starting from the following point:

2016-02-21 14:36:34,513 INFO  [org.ovirt.engine.core.bll.AddVmPoolWithVmsCommand] (ajp-/127.0.0.1:8702-6) [vmpools_create_e36d477f-e14e-4725] Lock Acquired to object 'EngineLock:{exclusiveLocks='[Virt_pool_size_more_than_prestarted_user_and_admin=<VM_POOL_NAME, ACTION_TYPE_FAILED_VM_POOL_IS_BEING_CREATED$VmPoolName Virt_pool_size_more_than_prestarted_user_and_admin>]', sharedLocks='null'}'


Version-Release number of selected component (if applicable):
rhevm-3.6.3.2-0.1.el6.noarch

How reproducible:
when running VM Pool test automation.

Steps to Reproduce:

run a script that creates a vm pool- update some attribute and delete the pool.


Actual results:
'Creating VM Pool {pool_name} from template {template_name}' is stuck in job table.

Expected results:
The job is reported as FINISHED and is remove from DB according to the clean finished jobs interval.

Additional info:

Comment 1 sefi litmanovich 2016-02-21 13:13:38 UTC

This bug might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1211512

Comment 2 Yaniv Kaul 2016-02-24 16:08:24 UTC

Sefi - can you share the script, or the essence of it? The actions it performs?

Comment 3 sefi litmanovich 2016-02-25 14:45:20 UTC

Yaniv - for an example in our last tier 1 job, 3 create vm pool calls resulted in stuck jobs in the DB although they were reported as successful. 
you can see here a snippet of the code with two of these tests - Virt_vmpool_update_prestarted and Virt_vmpool_full_cycle:

http://pastebin.test.redhat.com/351758

Here you can have a look at the REST API calls this tests generate, look for the correlating test names to find yourself in this mess:

https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/artifact/logs/art_tests_20160224_105300.log/*view*/

you can find all the information regarding this test run here (although it contains many other tests):
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/

If there's any further information regarding these tests let me know, and if needed me and Shmuel can maybe have a sit down to look at our test scripts and your code and see where we hit this problem.

Comment 4 Tomas Jelinek 2016-02-26 08:17:28 UTC

*** Bug 1312012 has been marked as a duplicate of this bug. ***

Comment 5 Shmuel Melamud 2016-03-13 16:59:15 UTC

Currently I'm able to reproduce the issue in my environment even with single pool creation.

Investigation shows that the issue is caused by race between creation of FINALIZING step and its completion.

FINALIZING step is closed by ExecutionHandler.endTaskJob() called from endAction(). In the case of AddVmCommand, endAction() is invoked from CoCo callback - ConcurrentChildCommandsExecutionCallback. CoCo periodically polls command status and calls callback methods when detects change in the command state (see CommandCallbacksPoller.invokeCallbackMethods()). endTaskJob() closes FINALIZING step, its parent step and the whole job, if there are no other steps running in the job.

EXECUTING step is closed and FINALIZING step is created by CommandAsyncTask.endCommandAction(). When periodic polling detects that all commmand's async tasks are finished (disks creation in the case of AddVmCommand), it calls CommandAsyncTask.endCommandAction() that invokes ExecutionHandler.createFinalizingContext().

Since these two polling processes are completely independent, it is possible that endTaskJob() will be called first and will not be able to close FINALIZING step that is not created yet. The whole job will be left open, because EXECUTING step is still running.

After that, createFinalizingContext() will close EXECUTING and will create FINALIZING step that will be left open.

Comment 6 Shmuel Melamud 2016-03-13 18:52:53 UTC

There are several possible ways to fix this:

1. I'd personally prefer to create FINALIZING step together with EXECUTING step (if needed). This will solve any synchronization problems. From the other side, both EXECUTING and FINALIZING will be displayed as running at the same time.

2. Allow endTaskJob() to create FINALIZING step, if it's absent. In this case it is possible that both createFinalizingContext() and endTaskJob() will create the step, so we will finish having two FINALIZING steps. To avoid this, we'll need to introduce some synchronization. The simplest way is to do this on DB level, with stored procedure like addStepIfNotExists().

3. Leave endTaskJob() untouched. In createFinalizingContext(), if it founds that the parent step is closed, treat this as a signal that endTaskJob() was run already. In this case, createFinalizingContext() will create FINALIZING step as closed and, if there are no other steps running in the job, close the job. I've implemented this approach and it works.

Comment 7 Mike McCune 2016-03-28 23:36:13 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 8 sefi litmanovich 2016-06-07 11:09:02 UTC

Verified with rhevm-4.0.0.2-0.1.el7ev.noarch, after automation runs no pool related jobs are stuck in DB as was the case earlier.
Couldn't re produce the bug manually as well. It seems the issue is solved.

Comment 9 Sandro Bonazzola 2016-07-05 07:43:20 UTC

oVirt 4.0.0 has been released, closing current release.

Comment 10 Allon Mureinik 2016-08-18 12:42:24 UTC

*** Bug 1367024 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.