Description of problem: A MoveOrCopyDisk task is not cleared after being classified as a zombie leading to it being repolled after an engine restart. engine.log.1-20130703_031502_3979.xz - MoveOrCopyDisk task starts : 46474 2013-06-21 08:39:48,733 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] START, MoveImageGroupVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLi mit = false, compatabilityVersion = 3.2, storageDomainId = 6809866e-07cc-49b1-a342-6658b1cf7ab6, imageGroupId = fc27802f-594e-4ded-99ff-8bef43734c64, dstDomainId = 6dd9719a-5268-42ab-b43b-d493c9c7907c, vmId = 00000000-0000-0000-0000-000000000000, op = Move, postZero = false, force = false), log id: 110f3519 46475 2013-06-21 08:39:49,389 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-49) [5c1a816b] FINISH, MoveImageGroupVDSCommand, log id: 110f3519 46476 2013-06-21 08:39:49,392 INFO [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-49) [5c1a816b] EntityAsyncTask::Adding EntityMultiAsyncTasks object for entity 9395daf3-073d-4521-9120-ca57c8abc0ba 46477 2013-06-21 08:39:49,392 INFO [org.ovirt.engine.core.bll.EntityMultiAsyncTasks] (pool-4-thread-49) [5c1a816b] EntityMultiAsyncTasks::AttachTask: Attaching task 44fcdad3-695a-462a-b11d-a09e004c8a90 to entity 9395daf3-073d-4521-9120-ca57c8abc0ba. 46478 2013-06-21 08:39:49,427 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-4-thread-49) [5c1a816b] Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 46487 2013-06-21 08:39:56,490 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-83) SPMAsyncTask::PollTask: Polling task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status running. - Hits the zombie timeout and we attempt to clean it but as it's 'running' we can't : 47926 2013-06-23 10:39:51,535 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-18) Cleaning zombie tasks: Stopping async task MoveOrCopyDisk that started at 6/21/13 8:39 AM [..] 47932 2013-06-23 10:40:01,641 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-60) BaseAsyncTask::OnTaskEndSuccess: Task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. [..] 47958 2013-06-23 10:40:02,291 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 47959 2013-06-23 10:40:02,291 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) START, SPMClearTaskVDSCommand( storagePoolId = f7d1808d-a7d2-43b4-904d-879b908b79dd, ignoreFailoverLimit = false, compatabilityVersion = null, taskId = 44fcdad3-695a-462a-b11d-a09e004c8a90), log id: dbf6a03 47960 2013-06-23 10:40:02,316 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) START, HSMClearTaskVDSCommand(HostName = jtest01, HostId = c5ea366a-43a0-11e2-b207-ff9e163144da, taskId=44fcdad3-695a-462a-b11d-a09e004c8a90), log id: 6b03ae51 47961 2013-06-23 10:40:02,375 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (pool-4-thread-48) FINISH, HSMClearTaskVDSCommand, log id: 6b03ae51 47962 2013-06-23 10:40:02,375 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (pool-4-thread-48) FINISH, SPMClearTaskVDSCommand, log id: dbf6a03 47963 2013-06-23 10:40:02,375 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned - Engine is restarted some time later and the task is picked back up : 54942 2013-07-02 16:45:55,717 INFO [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 34) Start time: 7/2/13 4:45 PM [..] 55109 2013-07-02 16:46:07,594 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-31) Adding task 44fcdad3-695a-462a-b11d-a09e004c8a90 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. [..] 55113 2013-07-02 16:46:07,645 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-31) BaseAsyncTask::StartPollingTask: Starting to poll task 44fcdad3-695a-462a-b11d-a09e004c8a90. - We then start to see the following in the logs, after a few days this continues to spam the logs : 55142 2013-07-02 16:46:17,038 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-59) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 55143 2013-07-02 16:46:27,040 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-90) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. 55144 2013-07-02 16:46:37,041 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-27) task id 44fcdad3-695a-462a-b11d-a09e004c8a90 is in pre-polling period and should not be polled. Pre-polling period is 60,000 millis. Version-Release number of selected component (if applicable): rhevm-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:19 2013 rhevm-backend-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:01 2013 rhevm-cli-3.2.0.9-1.el6ev.noarch Wed Jun 12 14:00:47 2013 rhevm-config-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:01 2013 rhevm-dbscripts-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:02 2013 rhevm-doc-3.2.0-4.el6eng.noarch Wed Jun 12 14:00:52 2013 rhevm-genericapi-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:00 2013 rhevm-image-uploader-3.2.2-1.el6ev.noarch Wed Jun 12 14:00:47 2013 rhevm-iso-uploader-3.2.2-1.el6ev.noarch Wed Jun 12 14:00:47 2013 rhevm-log-collector-3.2.2-3.el6ev.noarch Wed Jun 12 14:00:47 2013 rhevm-notification-service-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:01 2013 rhevm-restapi-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:01 2013 rhevm-sdk-3.2.0.11-1.el6ev.noarch Wed Jun 12 14:00:46 2013 rhevm-setup-3.2.0-11.30.el6ev.noarch Wed Jun 12 13:59:24 2013 rhevm-spice-client-x64-cab-3.2-10.el6ev.noarch Wed Jun 12 14:01:02 2013 rhevm-spice-client-x86-cab-3.2-10.el6ev.noarch Wed Jun 12 14:01:09 2013 rhevm-tools-common-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:01 2013 rhevm-userportal-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:09 2013 rhevm-webadmin-portal-3.2.0-11.30.el6ev.noarch Wed Jun 12 14:01:18 2013 How reproducible: No attempt made to reproduce as yet. Steps to Reproduce: 1. Reduce the zombie timeout. 2. Start a task that will hit this while still 'running'. 3. Confirm the task is not stopped. 4. restart the engine service. 5. review the engine.log. Actual results: The task is not cleared. Expected results: The task is cleared and a service restart does not result in the logs being spammed. Additional info:
This issue is not reproducible with 3.3. MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy can be stopped so when the zombie timeout is reduced the task is stopped and cleared from DB.
(In reply to Ravi Nori from comment #3) > This issue is not reproducible with 3.3. > > MoveImageGroupVDSCommand now performs a Copy and Remove (BZ 753549), copy > can be stopped so when the zombie timeout is reduced the task is stopped and > cleared from DB. So the ClearAsyncTask call completes correctly in 3.3? Again the error in 3.2 was (I'm going to fire a commit upstream to improve the string) : (pool-4-thread-48) SPMAsyncTask::ClearAsyncTask: At time of attemp to clear task 44fcdad3-695a-462a-b11d-a09e004c8a90 the response code was Operation is not allowed in this task state: ("can't clean in state running",) and message was {3}. Task will not be cleaned Can you think of any other commands where we would be unable to clear a running vdsm task?
Replying to a question Ravi emailed to me here. > My question is are there any other cases where the vdsm would throw the > error "can't clean in state running" Quickly checking in the vdsm src it would appear that any task that is not 'done' would return this when we attempt to clean the task : vdsm/vdsm/storage/task.py 1020 def clean(self): 1021 if not self.store: 1022 return 1023 if not self.isDone(): 1024 raise se.TaskStateError("can't clean in state %s" % self.state) 1025 self._clean(self.store)
Thanks Lee. Yes the call completes in 3.3 and there is no "can't clean in state running" message
The issue is not reproducible on 3.2 either. Is there more information about the clients setup that can help us reproduce it.
Moving to 3.4 When we have a reproducer we can look at this for 3.3.z or even 3.4
move to QA per comment 12
As this issue cannot be reproduced, no doc text is required for errata.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0506.html