Bug 1040952

Summary: Job and step tables not cleaned after the failure or completion of some tasks.
Product: Red Hat Enterprise Virtualization Manager Reporter: Lee Yarwood <lyarwood>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.0CC: aberezin, ahadas, bazulay, emesika, iheim, jbelka, jentrena, lpeer, lyarwood, mavital, michal.skrivanek, ofrenkel, oourfali, pdwyer, pep, pstehlik, rbalakri, Rhev-m-bugs, sbonazzo, sherold, s.kieske, slitmano, ydossow, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: vt2.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1055162 1099505 (view as bug list) Environment:
Last Closed: 2015-02-11 17:56:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1055162, 1078909, 1099505, 1142923, 1142926, 1156165    

Description Lee Yarwood 2013-12-12 11:58:41 UTC
Description of problem:

Job and step tables not cleaned after the failure or completion of some tasks. Leaving many tasks marked as running (hourglass) in the lower tasks tab.

Version-Release number of selected component (if applicable):
(3.2.1 with the patch associated with BZ#1008634)

How reproducible:

Steps to Reproduce:
Unclear at present, the customer has provided examples of migration and guest removal failures that have led to this.

Actual results:
Tasks appear to be running in the tasks tab but have actually stopped or failed.

Expected results:
Tasks are cleared from the task tab if the fail or end.

Additional info:

Comment 5 Liran Zelkha 2013-12-17 13:14:49 UTC
Hi Lee,

Any chance you found those engine.log files?

Comment 7 Liran Zelkha 2014-01-06 12:22:16 UTC
These tasks are added again and again, and the bug was in our cleaning job. This is merged for master (see patch http://gerrit.ovirt.org/#/c/22474/) but for now you need to run the workaround daily...

Comment 9 Liran Zelkha 2014-01-07 13:49:07 UTC
Hi Julio,

It seems like we have issues that sometimes jobs don't have asynch tasks - and our cleanup job does not clean those jobs and they are forever saved. I'm working on a patch for 3.3.z for this, hopefully will be merged in a matter of days.

Comment 13 sefi litmanovich 2014-02-16 16:38:30 UTC
Problem still occurs after reproducing according to the following steps of verification on zstream:

1. started a task migrate VM.
2. stop engine immediately.
3. status of job says SATARTED
4. start engine
5. status of job says UNKNOWN
6. jobs table is cleaned out after amount of minutes defined in FailedJobCleanupTimeInMinutes in vdc_options table

I changed FailedJobCleanupTimeInMinutes value in vdc_options table to 2.
reproduced according to the above mentioned steps. after the status of the job changed to UNKNOWN waited more then 2 minutes, ran:

select job_id, correlation_id, action_type, status, start_time, last_update_time from job;

 3fb914c4-6006-428e-84f2-8059db76de27 | 6bd0d51c       | MigrateVm   | UNKNOWN  | 2014-02-16 18:27:51.833+02 | 2014-02-16 18:28:59.908+02

the entry wasn't deleted after more then 2 minutes, eventually it was deleted after 10 minutes according to SucceededJobCleanupTimeInMinutes value = 10.

Comment 14 Liran Zelkha 2014-02-18 11:29:56 UTC
Hi Sefi,

This is caused because JobCleanupRateInMinutes in vdc_options is configured to 10 minutes. The FailedJobCleanupTimeInMinutes marks how old jobs are removed, not when the cleanup job runs.

Comment 16 sefi litmanovich 2014-02-23 08:32:34 UTC
Hi Liran.

I initially reproduced according to verification of the z_stream bug.
I will verifiy today with your instructions.

Comment 17 sefi litmanovich 2014-02-23 09:08:12 UTC
Verified with ovirt-engine-3.4.0-0.11.beta3.el6.noarch.

1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options to 2 minutes
2. started vm migration
3. stoped engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | STARTED | 2014-02-23 10:59:32.619+02 | 2014-02-
23 10:59:32.646+02
(1 row)

4. started engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | UNKNOWN | 2014-02-23 10:59:32.619+02 | 2014-02-
23 11:00:21.019+02
(1 row)

5. waited 2 minutes and checked again in job table - UNKNOWN MigrateVm job was cleaned

Comment 18 Liran Zelkha 2014-04-01 11:43:46 UTC
It seems that removing the jobs automatically creates problems in many other sections of the engine (see bugs 1079287,1064227,1076246). As a result, the solution for this bug was reverted. We need to solve the specific problem in MigrateVm that keeps the job at started state. 
Oved - please re-assign.

Comment 19 Oved Ourfali 2014-04-01 12:00:24 UTC
Marking the bug as a virt bug.

This bug describes VM-related flows that are put in UNKNOWN status.
Fixing those in the infrastructure by deleting  is wrong, as the flow itself should cope with this issue.

Comment 20 Michal Skrivanek 2014-04-04 08:48:56 UTC
(linked revert patches as per comment #18)

Comment 21 Michal Skrivanek 2014-04-04 08:55:40 UTC
cleaning the ZStream and flags as this needs to be re-evaluated once we investigate our options and propose a new/different fix

Comment 24 Arik 2014-05-15 13:00:35 UTC
Remove VM - It seems like all the tasks were finished but the end-action part of the command wasn't called. The logs don't cover the relevant period of time which can explain what happened, we need additional information in order to investigate it properly.

Migrate VM - we fixed a lot of flows where the handling of the migration's job was wrong since 3.2 and it shouldn't happen anymore. One exception is when the engine restarts, which is a case that wasn't fixed. The solution for that case is to remove the jobs of running migrate/run operations when the engine starts - we don't have the original commands after the restart, so the jobs won't be updated anyway and this is a simple solution for a problem which is very not likely to happen much.

Comment 26 Michal Skrivanek 2014-08-13 08:43:36 UTC
*** Bug 1127642 has been marked as a duplicate of this bug. ***

Comment 28 sefi litmanovich 2014-09-10 13:37:39 UTC
tested on rhevm-3.5.0-0.10.master.el6ev.noarch

1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options 

2. Checked the following flows: MigrateVm', 'MigrateVmToServer', 'InternalMigrateVm', 'RunVm', 'RunVmOnc'.

in each case I invoked the action to start and restarted the engine before action could finish properly.

in all cases the job (or step in case of InternalMigrateVm) had status STARTED (which can be seen in DB) at the beginning.
upon engine restart the job/step was deleted from DB.

Please let me know if there should be further scenarios to test.

Comment 30 errata-xmlrpc 2015-02-11 17:56:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.