1040952 – Job and step tables not cleaned after the failure or completion of some tasks.

Bug 1040952 - Job and step tables not cleaned after the failure or completion of some tasks.

Summary: Job and step tables not cleaned after the failure or completion of some tasks.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Arik
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	virt
Duplicates (1):	1127642 (view as bug list)
Depends On:
Blocks:	1055162 rhev3.4beta 1099505 rhev3.5beta 1142926 1156165
TreeView+	depends on / blocked

Reported:	2013-12-12 11:58 UTC by Lee Yarwood
Modified:	2018-12-04 16:40 UTC (History)
CC List:	24 users (show)
Fixed In Version:	vt2.2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1055162 1099505 (view as bug list)
Environment:
Last Closed:	2015-02-11 17:56:34 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	22474	None	MERGED	engine: Delete jobs that their steps have no async-tasks	2020-10-21 14:42:20 UTC
oVirt gerrit	23036	None	MERGED	engine: Add dao_unit_test to DeleteCompletedJobsOlderThanDate	2020-10-21 14:42:32 UTC
oVirt gerrit	26291	None	MERGED	core: Revert Delete jobs that their steps have no async-tasks	2020-10-21 14:42:19 UTC
oVirt gerrit	26292	None	MERGED	core: Revert Delete jobs that their steps have no async-tasks	2020-10-21 14:42:20 UTC
oVirt gerrit	26293	None	None	None	Never
oVirt gerrit	27372	master	MERGED	core: remove jobs of IVdsAsyncCommands on engine startup	2020-10-21 14:42:20 UTC
oVirt gerrit	31153	master	MERGED	core: remove job with migration step on engine startup	2020-10-21 14:42:33 UTC
oVirt gerrit	31430	master	MERGED	core: do not remove execution jobs with ongoing tasks	2020-10-21 14:42:33 UTC
oVirt gerrit	31485	ovirt-engine-3.5	MERGED	core: remove job with migration step on engine startup	2020-10-21 14:42:20 UTC
oVirt gerrit	31486	ovirt-engine-3.5	MERGED	core: do not remove execution jobs with ongoing tasks	2020-10-21 14:42:20 UTC

Description Lee Yarwood 2013-12-12 11:58:41 UTC

Description of problem:

Job and step tables not cleaned after the failure or completion of some tasks. Leaving many tasks marked as running (hourglass) in the lower tasks tab.

Version-Release number of selected component (if applicable):
rhevm-3.2.1-0.40.bz988339.el6ev.noarch 
(3.2.1 with the patch associated with BZ#1008634)

How reproducible:
Unclear.

Steps to Reproduce:
Unclear at present, the customer has provided examples of migration and guest removal failures that have led to this.

Actual results:
Tasks appear to be running in the tasks tab but have actually stopped or failed.

Expected results:
Tasks are cleared from the task tab if the fail or end.

Additional info:

Comment 5 Liran Zelkha 2013-12-17 13:14:49 UTC

Hi Lee,

Any chance you found those engine.log files?

Comment 7 Liran Zelkha 2014-01-06 12:22:16 UTC

These tasks are added again and again, and the bug was in our cleaning job. This is merged for master (see patch http://gerrit.ovirt.org/#/c/22474/) but for now you need to run the workaround daily...

Comment 9 Liran Zelkha 2014-01-07 13:49:07 UTC

Hi Julio,

It seems like we have issues that sometimes jobs don't have asynch tasks - and our cleanup job does not clean those jobs and they are forever saved. I'm working on a patch for 3.3.z for this, hopefully will be merged in a matter of days.

Comment 13 sefi litmanovich 2014-02-16 16:38:30 UTC

Problem still occurs after reproducing according to the following steps of verification on zstream:

1. started a task migrate VM.
2. stop engine immediately.
3. status of job says SATARTED
4. start engine
5. status of job says UNKNOWN
6. jobs table is cleaned out after amount of minutes defined in FailedJobCleanupTimeInMinutes in vdc_options table


I changed FailedJobCleanupTimeInMinutes value in vdc_options table to 2.
reproduced according to the above mentioned steps. after the status of the job changed to UNKNOWN waited more then 2 minutes, ran:

select job_id, correlation_id, action_type, status, start_time, last_update_time from job;

 3fb914c4-6006-428e-84f2-8059db76de27 | 6bd0d51c       | MigrateVm   | UNKNOWN  | 2014-02-16 18:27:51.833+02 | 2014-02-16 18:28:59.908+02

the entry wasn't deleted after more then 2 minutes, eventually it was deleted after 10 minutes according to SucceededJobCleanupTimeInMinutes value = 10.

Comment 14 Liran Zelkha 2014-02-18 11:29:56 UTC

Hi Sefi,

This is caused because JobCleanupRateInMinutes in vdc_options is configured to 10 minutes. The FailedJobCleanupTimeInMinutes marks how old jobs are removed, not when the cleanup job runs.

Comment 16 sefi litmanovich 2014-02-23 08:32:34 UTC

Hi Liran.

I initially reproduced according to verification of the z_stream bug.
I will verifiy today with your instructions.

Comment 17 sefi litmanovich 2014-02-23 09:08:12 UTC

Verified with ovirt-engine-3.4.0-0.11.beta3.el6.noarch.

1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options to 2 minutes
2. started vm migration
3. stoped engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
t_update_time      
--------------------------------------+----------------+-------------+---------+----------------------------+---------
-------------------
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | STARTED | 2014-02-23 10:59:32.619+02 | 2014-02-
23 10:59:32.646+02
(1 row)

4. started engine:

engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job;
                job_id                | correlation_id | action_type | status  |         start_time         |      las
t_update_time      
--------------------------------------+----------------+-------------+---------+----------------------------+---------
-------------------
 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762       | MigrateVm   | UNKNOWN | 2014-02-23 10:59:32.619+02 | 2014-02-
23 11:00:21.019+02
(1 row)

5. waited 2 minutes and checked again in job table - UNKNOWN MigrateVm job was cleaned

Comment 18 Liran Zelkha 2014-04-01 11:43:46 UTC

It seems that removing the jobs automatically creates problems in many other sections of the engine (see bugs 1079287,1064227,1076246). As a result, the solution for this bug was reverted. We need to solve the specific problem in MigrateVm that keeps the job at started state. 
Oved - please re-assign.

Comment 19 Oved Ourfali 2014-04-01 12:00:24 UTC

Marking the bug as a virt bug.

This bug describes VM-related flows that are put in UNKNOWN status.
Fixing those in the infrastructure by deleting  is wrong, as the flow itself should cope with this issue.

Comment 20 Michal Skrivanek 2014-04-04 08:48:56 UTC

(linked revert patches as per comment #18)

Comment 21 Michal Skrivanek 2014-04-04 08:55:40 UTC

cleaning the ZStream and flags as this needs to be re-evaluated once we investigate our options and propose a new/different fix

Comment 24 Arik 2014-05-15 13:00:35 UTC

Remove VM - It seems like all the tasks were finished but the end-action part of the command wasn't called. The logs don't cover the relevant period of time which can explain what happened, we need additional information in order to investigate it properly.

Migrate VM - we fixed a lot of flows where the handling of the migration's job was wrong since 3.2 and it shouldn't happen anymore. One exception is when the engine restarts, which is a case that wasn't fixed. The solution for that case is to remove the jobs of running migrate/run operations when the engine starts - we don't have the original commands after the restart, so the jobs won't be updated anyway and this is a simple solution for a problem which is very not likely to happen much.

Comment 26 Michal Skrivanek 2014-08-13 08:43:36 UTC

*** Bug 1127642 has been marked as a duplicate of this bug. ***

Comment 28 sefi litmanovich 2014-09-10 13:37:39 UTC

tested on rhevm-3.5.0-0.10.master.el6ev.noarch


1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options 

2. Checked the following flows: MigrateVm', 'MigrateVmToServer', 'InternalMigrateVm', 'RunVm', 'RunVmOnc'.

in each case I invoked the action to start and restarted the engine before action could finish properly.

in all cases the job (or step in case of InternalMigrateVm) had status STARTED (which can be seen in DB) at the beginning.
upon engine restart the job/step was deleted from DB.

Please let me know if there should be further scenarios to test.

Comment 30 errata-xmlrpc 2015-02-11 17:56:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.

aberezin
ahadas
bazulay
dossow
emesika
iheim
jbelka
jentrena
lpeer
lyarwood
mavital
michal.skrivanek
ofrenkel
oourfali
pdwyer
pep
pstehlik
rbalakri
Rhev-m-bugs
sbonazzo
sherold
s.kieske
slitmano
yeylon