Bug 1040952 - Job and step tables not cleaned after the failure or completion of some tasks.
Summary: Job and step tables not cleaned after the failure or completion of some tasks.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 3.5.0
Assignee: Arik
QA Contact: sefi litmanovich
URL:
Whiteboard: virt
: 1127642 (view as bug list)
Depends On:
Blocks: 1055162 rhev3.4beta 1099505 rhev3.5beta 1142926 1156165
TreeView+ depends on / blocked
 
Reported: 2013-12-12 11:58 UTC by Lee Yarwood
Modified: 2018-12-04 16:40 UTC (History)
24 users (show)

Fixed In Version: vt2.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1055162 1099505 (view as bug list)
Environment:
Last Closed: 2015-02-11 17:56:34 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC
oVirt gerrit 22474 0 None MERGED engine: Delete jobs that their steps have no async-tasks 2020-10-21 14:42:20 UTC
oVirt gerrit 23036 0 None MERGED engine: Add dao_unit_test to DeleteCompletedJobsOlderThanDate 2020-10-21 14:42:32 UTC
oVirt gerrit 26291 0 None MERGED core: Revert Delete jobs that their steps have no async-tasks 2020-10-21 14:42:19 UTC
oVirt gerrit 26292 0 None MERGED core: Revert Delete jobs that their steps have no async-tasks 2020-10-21 14:42:20 UTC
oVirt gerrit 26293 0 None None None Never
oVirt gerrit 27372 0 master MERGED core: remove jobs of IVdsAsyncCommands on engine startup 2020-10-21 14:42:20 UTC
oVirt gerrit 31153 0 master MERGED core: remove job with migration step on engine startup 2020-10-21 14:42:33 UTC
oVirt gerrit 31430 0 master MERGED core: do not remove execution jobs with ongoing tasks 2020-10-21 14:42:33 UTC
oVirt gerrit 31485 0 ovirt-engine-3.5 MERGED core: remove job with migration step on engine startup 2020-10-21 14:42:20 UTC
oVirt gerrit 31486 0 ovirt-engine-3.5 MERGED core: do not remove execution jobs with ongoing tasks 2020-10-21 14:42:20 UTC

Description Lee Yarwood 2013-12-12 11:58:41 UTC
Description of problem:

Job and step tables not cleaned after the failure or completion of some tasks. Leaving many tasks marked as running (hourglass) in the lower tasks tab.

Version-Release number of selected component (if applicable):
rhevm-3.2.1-0.40.bz988339.el6ev.noarch 
(3.2.1 with the patch associated with BZ#1008634)

How reproducible:
Unclear.

Steps to Reproduce:
Unclear at present, the customer has provided examples of migration and guest removal failures that have led to this.

Actual results:
Tasks appear to be running in the tasks tab but have actually stopped or failed.

Expected results:
Tasks are cleared from the task tab if the fail or end.

Additional info:

Comment 5 Liran Zelkha 2013-12-17 13:14:49 UTC
Hi Lee,

Any chance you found those engine.log files?

Comment 7 Liran Zelkha 2014-01-06 12:22:16 UTC
These tasks are added again and again, and the bug was in our cleaning job. This is merged for master (see patch http://gerrit.ovirt.org/#/c/22474/) but for now you need to run the workaround daily...

Comment 9 Liran Zelkha 2014-01-07 13:49:07 UTC
Hi Julio,

It seems like we have issues that sometimes jobs don't have asynch tasks - and our cleanup job does not clean those jobs and they are forever saved. I'm working on a patch for 3.3.z for this, hopefully will be merged in a matter of days.

Comment 13 sefi litmanovich 2014-02-16 16:38:30 UTC
Problem still occurs after reproducing according to the following steps of verification on zstream:

1. started a task migrate VM.
2. stop engine immediately.
3. status of job says SATARTED
4. start engine
5. status of job says UNKNOWN
6. jobs table is cleaned out after amount of minutes defined in FailedJobCleanupTimeInMinutes in vdc_options table


I changed FailedJobCleanupTimeInMinutes value in vdc_options table to 2.
reproduced according to the above mentioned steps. after the status of the job changed to UNKNOWN waited more then 2 minutes, ran:

select job_id, correlation_id, action_type, status, start_time, last_update_time from job;

 3fb914c4-6006-428e-84f2-8059db76de27 | 6bd0d51c       | MigrateVm   | UNKNOWN  | 2014-02-16 18:27:51.833+02 | 2014-02-16 18:28:59.908+02

the entry wasn't deleted after more then 2 minutes, eventually it was deleted after 10 minutes according to SucceededJobCleanupTimeInMinutes value = 10.

Comment 14 Liran Zelkha 2014-02-18 11:29:56 UTC
Hi Sefi,

This is caused because JobCleanupRateInMinutes in vdc_options is configured to 10 minutes. The FailedJobCleanupTimeInMinutes marks how old jobs are removed, not when the cleanup job runs.

Comment 16 sefi litmanovich 2014-02-23 08:32:34 UTC
Hi Liran.

I initially reproduced according to verification of the z_stream bug.
I will verifiy today with your instructions.

Comment 17 sefi litmanovich 2014-02-23 09:08:12 UTC
Verified with ovirt-engine-3.4.0-0.11.beta3.el6.noarch.

1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options to 2 minutes
2. started vm migration
3. stoped engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
t_update_time      
--------------------------------------+----------------+-------------+---------+----------------------------+---------
-------------------
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | STARTED | 2014-02-23 10:59:32.619+02 | 2014-02-
23 10:59:32.646+02
(1 row)

4. started engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
t_update_time      
--------------------------------------+----------------+-------------+---------+----------------------------+---------
-------------------
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | UNKNOWN | 2014-02-23 10:59:32.619+02 | 2014-02-
23 11:00:21.019+02
(1 row)

5. waited 2 minutes and checked again in job table - UNKNOWN MigrateVm job was cleaned

Comment 18 Liran Zelkha 2014-04-01 11:43:46 UTC
It seems that removing the jobs automatically creates problems in many other sections of the engine (see bugs 1079287,1064227,1076246). As a result, the solution for this bug was reverted. We need to solve the specific problem in MigrateVm that keeps the job at started state. 
Oved - please re-assign.

Comment 19 Oved Ourfali 2014-04-01 12:00:24 UTC
Marking the bug as a virt bug.

This bug describes VM-related flows that are put in UNKNOWN status.
Fixing those in the infrastructure by deleting  is wrong, as the flow itself should cope with this issue.

Comment 20 Michal Skrivanek 2014-04-04 08:48:56 UTC
(linked revert patches as per comment #18)

Comment 21 Michal Skrivanek 2014-04-04 08:55:40 UTC
cleaning the ZStream and flags as this needs to be re-evaluated once we investigate our options and propose a new/different fix

Comment 24 Arik 2014-05-15 13:00:35 UTC
Remove VM - It seems like all the tasks were finished but the end-action part of the command wasn't called. The logs don't cover the relevant period of time which can explain what happened, we need additional information in order to investigate it properly.

Migrate VM - we fixed a lot of flows where the handling of the migration's job was wrong since 3.2 and it shouldn't happen anymore. One exception is when the engine restarts, which is a case that wasn't fixed. The solution for that case is to remove the jobs of running migrate/run operations when the engine starts - we don't have the original commands after the restart, so the jobs won't be updated anyway and this is a simple solution for a problem which is very not likely to happen much.

Comment 26 Michal Skrivanek 2014-08-13 08:43:36 UTC
*** Bug 1127642 has been marked as a duplicate of this bug. ***

Comment 28 sefi litmanovich 2014-09-10 13:37:39 UTC
tested on rhevm-3.5.0-0.10.master.el6ev.noarch


1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options 

2. Checked the following flows: MigrateVm', 'MigrateVmToServer', 'InternalMigrateVm', 'RunVm', 'RunVmOnc'.

in each case I invoked the action to start and restarted the engine before action could finish properly.

in all cases the job (or step in case of InternalMigrateVm) had status STARTED (which can be seen in DB) at the beginning.
upon engine restart the job/step was deleted from DB.

Please let me know if there should be further scenarios to test.

Comment 30 errata-xmlrpc 2015-02-11 17:56:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.