Bug 981420

Summary: A running zombie task is not removed leading to it being repolled after engine restart and spamming engine.log.
Product: Red Hat Enterprise Virtualization Manager Reporter: Lee Yarwood <lyarwood>
Component: ovirt-engineAssignee: Ravi Nori <rnori>
Status: CLOSED ERRATA QA Contact: Pavel Stehlik <pstehlik>
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: aberezin, acathrow, bazulay, emesika, iheim, jkt, lpeer, lyarwood, pstehlik, Rhev-m-bugs, rnori, tpoitras, yeylon, yzaslavs
Target Milestone: ---   
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-09 14:59:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lee Yarwood 2013-07-04 16:06:59 UTC
Description of problem:

A MoveOrCopyDisk task is not cleared after being classified as a zombie leading to it being repolled after an engine restart.

engine.log.1-20130703_031502_3979.xz

- MoveOrCopyDisk task starts :

46474 2013-06-21 08:39:48,733 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] START, MoveImageGroupVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLi mit = false, compatabilityVersion = 3.2, storageDomainId = 6809866e-07cc-49b1-a342-6658b1cf7ab6, imageGroupId = fc27802f-594e-4ded-99ff-8bef43734c64, dstDomainId = 6dd9719a-5268-42ab-b43b-d493c9c7907c, vmId = 00000000-0000-0000-0000-000000000000, op = Move, postZero = false, force = false), log id: 110f3519
46475 2013-06-21 08:39:49,389 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] FINISH, MoveImageGroupVDSCommand, log id: 110f3519
46476 2013-06-21 08:39:49,392 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) [5c1a816b] EntityAsyncTask::Adding EntityMultiAsyncTasks object for entity 9395daf3-073d-4521-9120-ca57c8abc0ba
46477 2013-06-21 08:39:49,392 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (pool-4-thread-49) [5c1a816b] EntityMultiAsyncTasks::AttachTask: Attaching task 44fcdad3-695a-462a-b11d-a09e004c8a90 to entity 9395daf3-073d-4521-9120-ca57c8abc0ba.
46478 2013-06-21 08:39:49,427 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-4-thread-49) [5c1a816b] Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
46487 2013-06-21 08:39:56,490 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-83) SPMAsyncTask::PollTask: Polling task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status running.

- Hits the zombie timeout and we attempt to clean it but as it's 'running' we can't :

47926 2013-06-23 10:39:51,535 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-18) Cleaning zombie tasks: Stopping async task MoveOrCopyDisk that started at 6/21/13 8:39 AM
[..]
47932 2013-06-23 10:40:01,641 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-60) BaseAsyncTask::OnTaskEndSuccess: Task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
[..]
47958 2013-06-23 10:40:02,291 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90
47959 2013-06-23 10:40:02,291 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) START, SPMClearTaskVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLimit = false, compatabilityVersion = null, taskId = 44fcdad3-695a-462a-b11d-a09e004c8a90), log id: dbf6a03
47960 2013-06-23 10:40:02,316 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) START, HSMClearTaskVDSCommand(HostName = jtest01, HostId = c5ea366a-43a0-11e2-b207-ff9e163144da, taskId=44fcdad3-695a-462a-b11d-a09e004c8a90), log id: 6b03ae51
47961 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) FINISH, HSMClearTaskVDSCommand, log id: 6b03ae51
47962 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) FINISH, SPMClearTaskVDSCommand, log id: dbf6a03
47963 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned

- Engine is restarted some time later and the task is picked back up :

54942 2013-07-02 16:45:55,717 INFO  [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 34) Start time: 7/2/13 4:45 PM
[..]
55109 2013-07-02 16:46:07,594 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-31) Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
55113 2013-07-02 16:46:07,645 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-31) BaseAsyncTask::StartPollingTask: Starting to poll task 44fcdad3-695a-462a-b11d-a09e004c8a90.

- We then start to see the following in the logs, after a few days this continues to spam the logs :

55142 2013-07-02 16:46:17,038 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-59) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis.      
55143 2013-07-02 16:46:27,040 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-90) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis.      
55144 2013-07-02 16:46:37,041 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-27) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.30.el6ev.noarch                              Wed Jun 12 14:01:19 2013
rhevm-backend-3.2.0-11.30.el6ev.noarch                      Wed Jun 12 14:01:01 2013
rhevm-cli-3.2.0.9-1.el6ev.noarch                            Wed Jun 12 14:00:47 2013
rhevm-config-3.2.0-11.30.el6ev.noarch                       Wed Jun 12 14:01:01 2013
rhevm-dbscripts-3.2.0-11.30.el6ev.noarch                    Wed Jun 12 14:01:02 2013
rhevm-doc-3.2.0-4.el6eng.noarch                             Wed Jun 12 14:00:52 2013
rhevm-genericapi-3.2.0-11.30.el6ev.noarch                   Wed Jun 12 14:01:00 2013
rhevm-image-uploader-3.2.2-1.el6ev.noarch                   Wed Jun 12 14:00:47 2013
rhevm-iso-uploader-3.2.2-1.el6ev.noarch                     Wed Jun 12 14:00:47 2013
rhevm-log-collector-3.2.2-3.el6ev.noarch                    Wed Jun 12 14:00:47 2013
rhevm-notification-service-3.2.0-11.30.el6ev.noarch         Wed Jun 12 14:01:01 2013
rhevm-restapi-3.2.0-11.30.el6ev.noarch                      Wed Jun 12 14:01:01 2013
rhevm-sdk-3.2.0.11-1.el6ev.noarch                           Wed Jun 12 14:00:46 2013
rhevm-setup-3.2.0-11.30.el6ev.noarch                        Wed Jun 12 13:59:24 2013
rhevm-spice-client-x64-cab-3.2-10.el6ev.noarch              Wed Jun 12 14:01:02 2013
rhevm-spice-client-x86-cab-3.2-10.el6ev.noarch              Wed Jun 12 14:01:09 2013
rhevm-tools-common-3.2.0-11.30.el6ev.noarch                 Wed Jun 12 14:01:01 2013
rhevm-userportal-3.2.0-11.30.el6ev.noarch                   Wed Jun 12 14:01:09 2013
rhevm-webadmin-portal-3.2.0-11.30.el6ev.noarch              Wed Jun 12 14:01:18 2013

How reproducible:
No attempt made to reproduce as yet.

Steps to Reproduce:
1. Reduce the zombie timeout.
2. Start a task that will hit this while still 'running'.
3. Confirm the task is not stopped.
4. restart the engine service.
5. review the engine.log.

Actual results:
The task is not cleared.

Expected results:
The task is cleared and a service restart does not result in the logs being spammed.

Additional info:

Comment 3 Ravi Nori 2013-07-17 16:34:00 UTC
This issue is not reproducible with 3.3. 

MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy can be stopped so when the zombie timeout is reduced the task is stopped and cleared from DB.

Comment 4 Lee Yarwood 2013-07-23 15:18:46 UTC
(In reply to Ravi Nori from comment #3)
> This issue is not reproducible with 3.3. 
> 
> MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy
> can be stopped so when the zombie timeout is reduced the task is stopped and
> cleared from DB.

So the ClearAsyncTask call completes correctly in 3.3? Again the error in 3.2 was (I'm going to fire a commit upstream to improve the string) :

(pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned

Can you think of any other commands where we would be unable to clear a running vdsm task?

Comment 5 Lee Yarwood 2013-07-31 12:57:40 UTC
Replying to a question Ravi emailed to me here.

> My question is are there any other cases where the vdsm would throw the
> error "can't clean in state running"

Quickly checking in the vdsm src it would appear that any task that is not 'done' would return this when we attempt to clean the task :

vdsm/vdsm/storage/task.py

1020     def clean(self):
1021         if not self.store:
1022             return
1023         if not self.isDone():
1024             raise se.TaskStateError("can't clean in state %s" % self.state)
1025         self._clean(self.store)

Comment 6 Ravi Nori 2013-07-31 13:09:15 UTC
Thanks Lee. Yes the call completes in 3.3 and there is no "can't clean in state running" message

Comment 7 Ravi Nori 2013-09-11 18:48:38 UTC
The issue is not reproducible on 3.2 either. Is there more information about the clients setup that can help us reproduce it.

Comment 10 Andrew Cathrow 2013-09-16 14:06:56 UTC
Moving to 3.4
When we have a reproducer we can look at this for 3.3.z or even 3.4

Comment 14 Eli Mesika 2014-03-03 14:09:05 UTC
move to QA per comment 12

Comment 17 Timothy 2014-05-14 00:47:07 UTC
As this issue cannot be reproduced, no doc text is required for errata.

Comment 18 errata-xmlrpc 2014-06-09 14:59:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html