Description of problem: Job and step tables not cleaned after the failure or completion of some tasks. Leaving many tasks marked as running (hourglass) in the lower tasks tab. Version-Release number of selected component (if applicable): rhevm-3.2.1-0.40.bz988339.el6ev.noarch (3.2.1 with the patch associated with BZ#1008634) How reproducible: Unclear. Steps to Reproduce: Unclear at present, the customer has provided examples of migration and guest removal failures that have led to this. Actual results: Tasks appear to be running in the tasks tab but have actually stopped or failed. Expected results: Tasks are cleared from the task tab if the fail or end. Additional info:
Hi Lee, Any chance you found those engine.log files?
These tasks are added again and again, and the bug was in our cleaning job. This is merged for master (see patch http://gerrit.ovirt.org/#/c/22474/) but for now you need to run the workaround daily...
Hi Julio, It seems like we have issues that sometimes jobs don't have asynch tasks - and our cleanup job does not clean those jobs and they are forever saved. I'm working on a patch for 3.3.z for this, hopefully will be merged in a matter of days.
Problem still occurs after reproducing according to the following steps of verification on zstream: 1. started a task migrate VM. 2. stop engine immediately. 3. status of job says SATARTED 4. start engine 5. status of job says UNKNOWN 6. jobs table is cleaned out after amount of minutes defined in FailedJobCleanupTimeInMinutes in vdc_options table I changed FailedJobCleanupTimeInMinutes value in vdc_options table to 2. reproduced according to the above mentioned steps. after the status of the job changed to UNKNOWN waited more then 2 minutes, ran: select job_id, correlation_id, action_type, status, start_time, last_update_time from job; 3fb914c4-6006-428e-84f2-8059db76de27 | 6bd0d51c | MigrateVm | UNKNOWN | 2014-02-16 18:27:51.833+02 | 2014-02-16 18:28:59.908+02 the entry wasn't deleted after more then 2 minutes, eventually it was deleted after 10 minutes according to SucceededJobCleanupTimeInMinutes value = 10.
Hi Sefi, This is caused because JobCleanupRateInMinutes in vdc_options is configured to 10 minutes. The FailedJobCleanupTimeInMinutes marks how old jobs are removed, not when the cleanup job runs.
Hi Liran. I initially reproduced according to verification of the z_stream bug. I will verifiy today with your instructions.
Verified with ovirt-engine-3.4.0-0.11.beta3.el6.noarch. 1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options to 2 minutes 2. started vm migration 3. stoped engine: engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job; job_id | correlation_id | action_type | status | start_time | las t_update_time --------------------------------------+----------------+-------------+---------+----------------------------+--------- ------------------- 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762 | MigrateVm | STARTED | 2014-02-23 10:59:32.619+02 | 2014-02- 23 10:59:32.646+02 (1 row) 4. started engine: engine=# select job_id, correlation_id, action_type, status, start_time, last_update_time from job; job_id | correlation_id | action_type | status | start_time | las t_update_time --------------------------------------+----------------+-------------+---------+----------------------------+--------- ------------------- 98cdde72-3a05-4cfa-b73e-c2fb0220e05e | 76de3762 | MigrateVm | UNKNOWN | 2014-02-23 10:59:32.619+02 | 2014-02- 23 11:00:21.019+02 (1 row) 5. waited 2 minutes and checked again in job table - UNKNOWN MigrateVm job was cleaned
It seems that removing the jobs automatically creates problems in many other sections of the engine (see bugs 1079287,1064227,1076246). As a result, the solution for this bug was reverted. We need to solve the specific problem in MigrateVm that keeps the job at started state. Oved - please re-assign.
Marking the bug as a virt bug. This bug describes VM-related flows that are put in UNKNOWN status. Fixing those in the infrastructure by deleting is wrong, as the flow itself should cope with this issue.
(linked revert patches as per comment #18)
cleaning the ZStream and flags as this needs to be re-evaluated once we investigate our options and propose a new/different fix
Remove VM - It seems like all the tasks were finished but the end-action part of the command wasn't called. The logs don't cover the relevant period of time which can explain what happened, we need additional information in order to investigate it properly. Migrate VM - we fixed a lot of flows where the handling of the migration's job was wrong since 3.2 and it shouldn't happen anymore. One exception is when the engine restarts, which is a case that wasn't fixed. The solution for that case is to remove the jobs of running migrate/run operations when the engine starts - we don't have the original commands after the restart, so the jobs won't be updated anyway and this is a simple solution for a problem which is very not likely to happen much.
*** Bug 1127642 has been marked as a duplicate of this bug. ***
tested on rhevm-3.5.0-0.10.master.el6ev.noarch 1. updated FailedJobCleanupTimeInMinutes and JobCleanupRateInMinutes in vdc_options 2. Checked the following flows: MigrateVm', 'MigrateVmToServer', 'InternalMigrateVm', 'RunVm', 'RunVmOnc'. in each case I invoked the action to start and restarted the engine before action could finish properly. in all cases the job (or step in case of InternalMigrateVm) had status STARTED (which can be seen in DB) at the beginning. upon engine restart the job/step was deleted from DB. Please let me know if there should be further scenarios to test.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html