Bug 981420 - A running zombie task is not removed leading to it being repolled after engine restart and spamming engine.log.
A running zombie task is not removed leading to it being repolled after engin...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
high Severity high
: ---
: 3.4.0
Assigned To: Ravi Nori
Pavel Stehlik
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-04 12:06 EDT by Lee Yarwood
Modified: 2016-02-10 14:26 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-09 10:59:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lee Yarwood 2013-07-04 12:06:59 EDT
Description of problem:

A MoveOrCopyDisk task is not cleared after being classified as a zombie leading to it being repolled after an engine restart.

engine.log.1-20130703_031502_3979.xz

- MoveOrCopyDisk task starts :

46474 2013-06-21 08:39:48,733 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] START, MoveImageGroupVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLi mit = false, compatabilityVersion = 3.2, storageDomainId = 6809866e-07cc-49b1-a342-6658b1cf7ab6, imageGroupId = fc27802f-594e-4ded-99ff-8bef43734c64, dstDomainId = 6dd9719a-5268-42ab-b43b-d493c9c7907c, vmId = 00000000-0000-0000-0000-000000000000, op = Move, postZero = false, force = false), log id: 110f3519
46475 2013-06-21 08:39:49,389 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] FINISH, MoveImageGroupVDSCommand, log id: 110f3519
46476 2013-06-21 08:39:49,392 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) [5c1a816b] EntityAsyncTask::Adding EntityMultiAsyncTasks object for entity 9395daf3-073d-4521-9120-ca57c8abc0ba
46477 2013-06-21 08:39:49,392 INFO  [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (pool-4-thread-49) [5c1a816b] EntityMultiAsyncTasks::AttachTask: Attaching task 44fcdad3-695a-462a-b11d-a09e004c8a90 to entity 9395daf3-073d-4521-9120-ca57c8abc0ba.
46478 2013-06-21 08:39:49,427 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-4-thread-49) [5c1a816b] Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
46487 2013-06-21 08:39:56,490 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-83) SPMAsyncTask::PollTask: Polling task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status running.

- Hits the zombie timeout and we attempt to clean it but as it's 'running' we can't :

47926 2013-06-23 10:39:51,535 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-18) Cleaning zombie tasks: Stopping async task MoveOrCopyDisk that started at 6/21/13 8:39 AM
[..]
47932 2013-06-23 10:40:01,641 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-60) BaseAsyncTask::OnTaskEndSuccess: Task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
[..]
47958 2013-06-23 10:40:02,291 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90
47959 2013-06-23 10:40:02,291 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) START, SPMClearTaskVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLimit = false, compatabilityVersion = null, taskId = 44fcdad3-695a-462a-b11d-a09e004c8a90), log id: dbf6a03
47960 2013-06-23 10:40:02,316 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) START, HSMClearTaskVDSCommand(HostName = jtest01, HostId = c5ea366a-43a0-11e2-b207-ff9e163144da, taskId=44fcdad3-695a-462a-b11d-a09e004c8a90), log id: 6b03ae51
47961 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) FINISH, HSMClearTaskVDSCommand, log id: 6b03ae51
47962 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) FINISH, SPMClearTaskVDSCommand, log id: dbf6a03
47963 2013-06-23 10:40:02,375 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned

- Engine is restarted some time later and the task is picked back up :

54942 2013-07-02 16:45:55,717 INFO  [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 34) Start time: 7/2/13 4:45 PM
[..]
55109 2013-07-02 16:46:07,594 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-31) Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
[..]
55113 2013-07-02 16:46:07,645 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-31) BaseAsyncTask::StartPollingTask: Starting to poll task 44fcdad3-695a-462a-b11d-a09e004c8a90.

- We then start to see the following in the logs, after a few days this continues to spam the logs :

55142 2013-07-02 16:46:17,038 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-59) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis.      
55143 2013-07-02 16:46:27,040 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-90) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis.      
55144 2013-07-02 16:46:37,041 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-27) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling  period and should not be polled. Pre-polling period is 60,000 millis. 

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.30.el6ev.noarch                              Wed Jun 12 14:01:19 2013
rhevm-backend-3.2.0-11.30.el6ev.noarch                      Wed Jun 12 14:01:01 2013
rhevm-cli-3.2.0.9-1.el6ev.noarch                            Wed Jun 12 14:00:47 2013
rhevm-config-3.2.0-11.30.el6ev.noarch                       Wed Jun 12 14:01:01 2013
rhevm-dbscripts-3.2.0-11.30.el6ev.noarch                    Wed Jun 12 14:01:02 2013
rhevm-doc-3.2.0-4.el6eng.noarch                             Wed Jun 12 14:00:52 2013
rhevm-genericapi-3.2.0-11.30.el6ev.noarch                   Wed Jun 12 14:01:00 2013
rhevm-image-uploader-3.2.2-1.el6ev.noarch                   Wed Jun 12 14:00:47 2013
rhevm-iso-uploader-3.2.2-1.el6ev.noarch                     Wed Jun 12 14:00:47 2013
rhevm-log-collector-3.2.2-3.el6ev.noarch                    Wed Jun 12 14:00:47 2013
rhevm-notification-service-3.2.0-11.30.el6ev.noarch         Wed Jun 12 14:01:01 2013
rhevm-restapi-3.2.0-11.30.el6ev.noarch                      Wed Jun 12 14:01:01 2013
rhevm-sdk-3.2.0.11-1.el6ev.noarch                           Wed Jun 12 14:00:46 2013
rhevm-setup-3.2.0-11.30.el6ev.noarch                        Wed Jun 12 13:59:24 2013
rhevm-spice-client-x64-cab-3.2-10.el6ev.noarch              Wed Jun 12 14:01:02 2013
rhevm-spice-client-x86-cab-3.2-10.el6ev.noarch              Wed Jun 12 14:01:09 2013
rhevm-tools-common-3.2.0-11.30.el6ev.noarch                 Wed Jun 12 14:01:01 2013
rhevm-userportal-3.2.0-11.30.el6ev.noarch                   Wed Jun 12 14:01:09 2013
rhevm-webadmin-portal-3.2.0-11.30.el6ev.noarch              Wed Jun 12 14:01:18 2013

How reproducible:
No attempt made to reproduce as yet.

Steps to Reproduce:
1. Reduce the zombie timeout.
2. Start a task that will hit this while still 'running'.
3. Confirm the task is not stopped.
4. restart the engine service.
5. review the engine.log.

Actual results:
The task is not cleared.

Expected results:
The task is cleared and a service restart does not result in the logs being spammed.

Additional info:
Comment 3 Ravi Nori 2013-07-17 12:34:00 EDT
This issue is not reproducible with 3.3. 

MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy can be stopped so when the zombie timeout is reduced the task is stopped and cleared from DB.
Comment 4 Lee Yarwood 2013-07-23 11:18:46 EDT
(In reply to Ravi Nori from comment #3)
> This issue is not reproducible with 3.3. 
> 
> MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy
> can be stopped so when the zombie timeout is reduced the task is stopped and
> cleared from DB.

So the ClearAsyncTask call completes correctly in 3.3? Again the error in 3.2 was (I'm going to fire a commit upstream to improve the string) :

(pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned

Can you think of any other commands where we would be unable to clear a running vdsm task?
Comment 5 Lee Yarwood 2013-07-31 08:57:40 EDT
Replying to a question Ravi emailed to me here.

> My question is are there any other cases where the vdsm would throw the
> error "can't clean in state running"

Quickly checking in the vdsm src it would appear that any task that is not 'done' would return this when we attempt to clean the task :

vdsm/vdsm/storage/task.py

1020     def clean(self):
1021         if not self.store:
1022             return
1023         if not self.isDone():
1024             raise se.TaskStateError("can't clean in state %s" % self.state)
1025         self._clean(self.store)
Comment 6 Ravi Nori 2013-07-31 09:09:15 EDT
Thanks Lee. Yes the call completes in 3.3 and there is no "can't clean in state running" message
Comment 7 Ravi Nori 2013-09-11 14:48:38 EDT
The issue is not reproducible on 3.2 either. Is there more information about the clients setup that can help us reproduce it.
Comment 10 Andrew Cathrow 2013-09-16 10:06:56 EDT
Moving to 3.4
When we have a reproducer we can look at this for 3.3.z or even 3.4
Comment 14 Eli Mesika 2014-03-03 09:09:05 EST
move to QA per comment 12
Comment 17 Timothy 2014-05-13 20:47:07 EDT
As this issue cannot be reproduced, no doc text is required for errata.
Comment 18 errata-xmlrpc 2014-06-09 10:59:45 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html

Note You need to log in before you can comment on or make changes to this bug.