Hide Forgot
Created attachment 1129009 [details] engine log Description of problem: The bug appears in VmPool automation testing. After the test is finished, even if successfully, Most Create vm pool tasks are stuck in DB job table (with sub steps in steps table) and also a Configuring VM Pool {pool_name} properties sometimes. The problem is hard to re produce manually as it doesn't seem to happen when creating a vm pool and waiting some time until the next one. All tests in automation waits until engine reports an action as successful/fail, and statuses are checked, also in practice the Create vm pool is done successful and VMs in the pool are used and are in the correct state so this is some internal bug with the task handling in DB I guess and seems like some sort of timing issue. I would think that either a task waits for all sub steps to be done before allowing another task on the same entity, or on the other hand that if a new task (update, delete) is started and finished on a certain entity then also previous task on the same entity will be cleaned from DB even if some step is still stuck. I added engine.log with an automation run with only 2 test cases, in both cases Creating VM Pool {pool_name} from template {template_name} were stuck in DB. You can see that part of the log starting from the following point: 2016-02-21 14:36:34,513 INFO [org.ovirt.engine.core.bll.AddVmPoolWithVmsCommand] (ajp-/127.0.0.1:8702-6) [vmpools_create_e36d477f-e14e-4725] Lock Acquired to object 'EngineLock:{exclusiveLocks='[Virt_pool_size_more_than_prestarted_user_and_admin=<VM_POOL_NAME, ACTION_TYPE_FAILED_VM_POOL_IS_BEING_CREATED$VmPoolName Virt_pool_size_more_than_prestarted_user_and_admin>]', sharedLocks='null'}' Version-Release number of selected component (if applicable): rhevm-3.6.3.2-0.1.el6.noarch How reproducible: when running VM Pool test automation. Steps to Reproduce: run a script that creates a vm pool- update some attribute and delete the pool. Actual results: 'Creating VM Pool {pool_name} from template {template_name}' is stuck in job table. Expected results: The job is reported as FINISHED and is remove from DB according to the clean finished jobs interval. Additional info:
This bug might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1211512
Sefi - can you share the script, or the essence of it? The actions it performs?
Yaniv - for an example in our last tier 1 job, 3 create vm pool calls resulted in stuck jobs in the DB although they were reported as successful. you can see here a snippet of the code with two of these tests - Virt_vmpool_update_prestarted and Virt_vmpool_full_cycle: http://pastebin.test.redhat.com/351758 Here you can have a look at the REST API calls this tests generate, look for the correlating test names to find yourself in this mess: https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/artifact/logs/art_tests_20160224_105300.log/*view*/ you can find all the information regarding this test run here (although it contains many other tests): https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.6/view/GE/job/3.6-GE-Tier1-x86/222/ If there's any further information regarding these tests let me know, and if needed me and Shmuel can maybe have a sit down to look at our test scripts and your code and see where we hit this problem.
*** Bug 1312012 has been marked as a duplicate of this bug. ***
Currently I'm able to reproduce the issue in my environment even with single pool creation. Investigation shows that the issue is caused by race between creation of FINALIZING step and its completion. FINALIZING step is closed by ExecutionHandler.endTaskJob() called from endAction(). In the case of AddVmCommand, endAction() is invoked from CoCo callback - ConcurrentChildCommandsExecutionCallback. CoCo periodically polls command status and calls callback methods when detects change in the command state (see CommandCallbacksPoller.invokeCallbackMethods()). endTaskJob() closes FINALIZING step, its parent step and the whole job, if there are no other steps running in the job. EXECUTING step is closed and FINALIZING step is created by CommandAsyncTask.endCommandAction(). When periodic polling detects that all commmand's async tasks are finished (disks creation in the case of AddVmCommand), it calls CommandAsyncTask.endCommandAction() that invokes ExecutionHandler.createFinalizingContext(). Since these two polling processes are completely independent, it is possible that endTaskJob() will be called first and will not be able to close FINALIZING step that is not created yet. The whole job will be left open, because EXECUTING step is still running. After that, createFinalizingContext() will close EXECUTING and will create FINALIZING step that will be left open.
There are several possible ways to fix this: 1. I'd personally prefer to create FINALIZING step together with EXECUTING step (if needed). This will solve any synchronization problems. From the other side, both EXECUTING and FINALIZING will be displayed as running at the same time. 2. Allow endTaskJob() to create FINALIZING step, if it's absent. In this case it is possible that both createFinalizingContext() and endTaskJob() will create the step, so we will finish having two FINALIZING steps. To avoid this, we'll need to introduce some synchronization. The simplest way is to do this on DB level, with stored procedure like addStepIfNotExists(). 3. Leave endTaskJob() untouched. In createFinalizingContext(), if it founds that the parent step is closed, treat this as a signal that endTaskJob() was run already. In this case, createFinalizingContext() will create FINALIZING step as closed and, if there are no other steps running in the job, close the job. I've implemented this approach and it works.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
Verified with rhevm-4.0.0.2-0.1.el7ev.noarch, after automation runs no pool related jobs are stuck in DB as was the case earlier. Couldn't re produce the bug manually as well. It seems the issue is solved.
oVirt 4.0.0 has been released, closing current release.
*** Bug 1367024 has been marked as a duplicate of this bug. ***