Bug 1310426 - VmPool related jobs are stuck in job and steps tables in DB when several consecutive actions are called
VmPool related jobs are stuck in job and steps tables in DB when several cons...
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt (Show other bugs)
4.0.0
Unspecified Unspecified
high Severity medium (vote)
: ovirt-4.0.0-beta
: 4.0.0
Assigned To: Liron Aravot
sefi litmanovich
: Automation
: 1312012 1367024 (view as bug list)
Depends On:
Blocks: 1367024
  Show dependency treegraph
 
Reported: 2016-02-21 08:11 EST by sefi litmanovich
Modified: 2016-08-18 08:42 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-05 03:43:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.0.0+
rule-engine: planning_ack+
tjelinek: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
engine log (527.82 KB, application/x-gzip)
2016-02-21 08:11 EST, sefi litmanovich
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 54897 master ABANDONED core: Close FINALIZING step when created 2016-04-07 12:48 EDT
oVirt gerrit 55613 master MERGED core: split ExecutionHandler.endTaskJob() 2016-04-05 08:51 EDT
oVirt gerrit 55614 master MERGED core: Async tasks - move steps/jobs handling to CommandBase 2016-04-05 08:51 EDT
oVirt gerrit 55615 master MERGED core: add vm pool flow callbacks usage 2016-04-06 02:39 EDT

  None (edit)
Description sefi litmanovich 2016-02-21 08:11:39 EST
Created attachment 1129009 [details]
engine log

Description of problem:

The bug appears in VmPool automation testing.
After the test is finished, even if successfully, Most Create vm pool tasks are stuck in DB job table (with sub steps in steps table) and also a Configuring VM Pool {pool_name} properties sometimes.
The problem is hard to re produce manually as it doesn't seem to happen when creating a vm pool and waiting some time until the next one.
All tests in automation waits until engine reports an action as successful/fail, and statuses are checked, also in practice the Create vm pool is done successful and VMs in the pool are used and are in the correct state so this is some internal bug with the task handling in DB I guess and seems like some sort of timing issue.
I would think that either a task waits for all sub steps to be done before allowing another task on the same entity, or on the other hand that if a new task (update, delete) is started and finished on a certain entity then also previous task on the same entity will be cleaned from DB even if some step is still stuck.
I added engine.log with an automation run with only 2 test cases, in both cases Creating VM Pool {pool_name} from template {template_name} were stuck in DB. You can see that part of the log starting from the following point:

2016-02-21 14:36:34,513 INFO  [org.ovirt.engine.core.bll.AddVmPoolWithVmsCommand] (ajp-/127.0.0.1:8702-6) [vmpools_create_e36d477f-e14e-4725] Lock Acquired to object 'EngineLock:{exclusiveLocks='[Virt_pool_size_more_than_prestarted_user_and_admin=<VM_POOL_NAME, ACTION_TYPE_FAILED_VM_POOL_IS_BEING_CREATED$VmPoolName Virt_pool_size_more_than_prestarted_user_and_admin>]', sharedLocks='null'}'


Version-Release number of selected component (if applicable):
rhevm-3.6.3.2-0.1.el6.noarch

How reproducible:
when running VM Pool test automation.

Steps to Reproduce:

run a script that creates a vm pool- update some attribute and delete the pool.


Actual results:
'Creating VM Pool {pool_name} from template {template_name}' is stuck in job table.

Expected results:
The job is reported as FINISHED and is remove from DB according to the clean finished jobs interval.

Additional info:
Comment 1 sefi litmanovich 2016-02-21 08:13:38 EST
This bug might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1211512
Comment 2 Yaniv Kaul 2016-02-24 11:08:24 EST
Sefi - can you share the script, or the essence of it? The actions it performs?
Comment 3 sefi litmanovich 2016-02-25 09:45:20 EST
Yaniv - for an example in our last tier 1 job, 3 create vm pool calls resulted in stuck jobs in the DB although they were reported as successful. 
you can see here a snippet of the code with two of these tests - Virt_vmpool_update_prestarted and Virt_vmpool_full_cycle:

http://pastebin.test.redhat.com/351758

Here you can have a look at the REST API calls this tests generate, look for the correlating test names to find yourself in this mess:

https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/artifact/logs/art_tests_20160224_105300.log/*view*/

you can find all the information regarding this test run here (although it contains many other tests):
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/

If there's any further information regarding these tests let me know, and if needed me and Shmuel can maybe have a sit down to look at our test scripts and your code and see where we hit this problem.
Comment 4 Tomas Jelinek 2016-02-26 03:17:28 EST
*** Bug 1312012 has been marked as a duplicate of this bug. ***
Comment 5 Shmuel Melamud 2016-03-13 12:59:15 EDT
Currently I'm able to reproduce the issue in my environment even with single pool creation.

Investigation shows that the issue is caused by race between creation of FINALIZING step and its completion.

FINALIZING step is closed by ExecutionHandler.endTaskJob() called from endAction(). In the case of AddVmCommand, endAction() is invoked from CoCo callback - ConcurrentChildCommandsExecutionCallback. CoCo periodically polls command status and calls callback methods when detects change in the command state (see CommandCallbacksPoller.invokeCallbackMethods()). endTaskJob() closes FINALIZING step, its parent step and the whole job, if there are no other steps running in the job.

EXECUTING step is closed and FINALIZING step is created by CommandAsyncTask.endCommandAction(). When periodic polling detects that all commmand's async tasks are finished (disks creation in the case of AddVmCommand), it calls CommandAsyncTask.endCommandAction() that invokes ExecutionHandler.createFinalizingContext().

Since these two polling processes are completely independent, it is possible that endTaskJob() will be called first and will not be able to close FINALIZING step that is not created yet. The whole job will be left open, because EXECUTING step is still running.

After that, createFinalizingContext() will close EXECUTING and will create FINALIZING step that will be left open.
Comment 6 Shmuel Melamud 2016-03-13 14:52:53 EDT
There are several possible ways to fix this:

1. I'd personally prefer to create FINALIZING step together with EXECUTING step (if needed). This will solve any synchronization problems. From the other side, both EXECUTING and FINALIZING will be displayed as running at the same time.

2. Allow endTaskJob() to create FINALIZING step, if it's absent. In this case it is possible that both createFinalizingContext() and endTaskJob() will create the step, so we will finish having two FINALIZING steps. To avoid this, we'll need to introduce some synchronization. The simplest way is to do this on DB level, with stored procedure like addStepIfNotExists().

3. Leave endTaskJob() untouched. In createFinalizingContext(), if it founds that the parent step is closed, treat this as a signal that endTaskJob() was run already. In this case, createFinalizingContext() will create FINALIZING step as closed and, if there are no other steps running in the job, close the job. I've implemented this approach and it works.
Comment 7 Mike McCune 2016-03-28 19:36:13 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 8 sefi litmanovich 2016-06-07 07:09:02 EDT
Verified with rhevm-4.0.0.2-0.1.el7ev.noarch, after automation runs no pool related jobs are stuck in DB as was the case earlier.
Couldn't re produce the bug manually as well. It seems the issue is solved.
Comment 9 Sandro Bonazzola 2016-07-05 03:43:20 EDT
oVirt 4.0.0 has been released, closing current release.
Comment 10 Allon Mureinik 2016-08-18 08:42:24 EDT
*** Bug 1367024 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.