Bug 880547 - VDSM returns empty list of tasks at reboot - Cause the VM and images to stay in lock status
Summary: VDSM returns empty list of tasks at reboot - Cause the VM and images to stay ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: unspecified
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.3.0
Assignee: Yaniv Bronhaim
QA Contact: Katarzyna Jachim
URL:
Whiteboard: infra
Depends On:
Blocks: 813279 926990 1019461
TreeView+ depends on / blocked
 
Reported: 2012-11-27 09:58 UTC by Maor
Modified: 2016-02-10 19:29 UTC (History)
20 users (show)

Fixed In Version: vdsm-4.11.0-89 is4
Doc Type: Bug Fix
Doc Text:
Before a host took the SPM lock it did not connect to a storage pool, and VDSM returned an empty task list, which caused the engine and hosts to become out of sync. Now, VDSM only retrieves a task list when an SPM exists and has been recovered.
Clone Of: 813279
: 926990 (view as bug list)
Environment:
Last Closed: 2014-01-21 16:03:33 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (349.69 KB, application/x-gzip)
2013-03-04 15:26 UTC, Dafna Ron
no flags Details
new logs (1.16 MB, application/x-gzip)
2013-03-04 16:01 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC
oVirt gerrit 15345 0 None None None Never

Comment 1 Maor 2012-11-27 10:04:42 UTC
The relevant attachment is 652056.
copied the relevant section from my last comment :

VDSM log -
Task 3b92fca5-7e7f-42d1-abbc-57cc6bcc4918 is running:
Thread-947::INFO::2012-11-26 16:23:52,588::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'3b92fca5-7e7f-42d1-abbc-57cc6bcc4918': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '3b92fca5-7e7f-42d1-abbc-57cc6bcc4918'}}}

VDSM started:
MainThread::INFO::2012-11-26 16:24:00,479::vdsm::70::vds::(run) I am the actual vdsm 4.9-44.0

Only after three getAllTasksStatuses operations the task is being returned:
Thread-14::INFO::2012-11-26 16:24:02,601::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}}

Thread-17::INFO::2012-11-26 16:24:02,887::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}}

Thread-38::INFO::2012-11-26 16:24:21,622::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}}

Thread-814::DEBUG::2012-11-26 16:21:12,547::taskManager::112::TaskManager::(getAllTasksStatuses) Return: {'c9360be8-d7bb-497f-bf61-bac3eb0dcbee': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'c9360be8-d7bb-497f-bf61-bac3eb0dcbee'}}
Thread-814::INFO::2012-11-26 16:21:12,548::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'c9360be8-d7bb-497f-bf61-bac3eb0dcbee': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'c9360be8-d7bb-497f-bf61-bac3eb0dcbee'}}}

The race might occurred since VDSM start a new thread to fetch the tasks while starting the VDSM.

Comment 3 Maor 2012-11-27 12:29:05 UTC
After discussed it with Ayal, it could be an engine problem.
Seems that VDSM did not report this task since the VDSM was not SPM at the time engine called getAllTasks command. that is why VDSM returned an empty list.
That caused the task in the engine to be unknown.

Comment 4 Ayal Baron 2012-11-27 13:06:08 UTC
I'm not sure why engine is calling getAllTasksStatuses at all while host is not spm, but it could be that it missed the restart of vdsm.
In any event, looking at the code, we could make getAllTasksStatuses at hsm level be an spm command (so in this case it would throw a not spm exception).

Comment 5 RHEL Program Management 2012-12-14 07:50:29 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Yaniv Bronhaim 2012-12-31 14:41:50 UTC
Not sure if vdsm should distinguish in its API if it runs as SPM or not. 

I think that those getTask* API calls should return value only after initializing all internal tasks variables, meaning that before ending vdsm "recovery" process, this call should return "vdsm still working on it.." instead of sending empty list, because that's a lie

we run the recovery part in external thread during hsm.__init__, before the task is done, the engine can send getAllTasksStatuses request and get as response empty list till the recover is done - as described above.

I suggest this fix:
http://gerrit.ovirt.org/#/c/10527

Comment 7 Yaniv Bronhaim 2013-01-09 16:31:08 UTC
If this was caused by a race that engine thought it sent the request to the SPM, vdsm did the restart and got the request before turning to SPM and responded with empty list, engine thought the host doesn't contain tasks to recover. returning empty list, and then turning to SPM causes conflicts. 

This should be handled in engine's level, or the recovering thread should be synchronized to hsm startup before responding to getTaskslist request..

throwing exception or making it only spm command blocks you from knowing about tasks that are not realted to SPM

when does engine send this request? can it be this kind of race? should it be handled in engine's side or vdsm's?

Comment 8 Yaniv Bronhaim 2013-01-09 16:31:53 UTC
If this was caused by a race that engine thought it sent the request to the SPM, vdsm did the restart and got the request before turning to SPM and responded with empty list, engine thought the host doesn't contain tasks to recover. returning empty list, and then turning to SPM causes conflicts. 

This should be handled in engine's level, or the recovering thread should be synchronized to hsm startup before responding to getTaskslist request..

throwing exception or making it only spm command blocks you from knowing about tasks that are not realted to SPM

when does engine send this request? can it be this kind of race? should it be handled in engine's side or vdsm's? what does engine do if it gets exception when requesting the tasks list?

Comment 9 Maor 2013-01-09 17:35:37 UTC
(In reply to comment #8)
> If this was caused by a race that engine thought it sent the request to the
> SPM, vdsm did the restart and got the request before turning to SPM and
> responded with empty list, engine thought the host doesn't contain tasks to
> recover. returning empty list, and then turning to SPM causes conflicts. 
> 
> This should be handled in engine's level, or the recovering thread should be
> synchronized to hsm startup before responding to getTaskslist request..
> 
> throwing exception or making it only spm command blocks you from knowing
> about tasks that are not realted to SPM
> 
> when does engine send this request?
Engine send getTasks every 10 seconds.
> can it be this kind of race? 
yes, theoretically engine does not request tasks if there is no SPM, I suspect the race could be after engine already got to the ATM functionality but a moment before it fetched the tasks VDSM restarted (see comment 3)
> should it be handled in engine's side or vdsm's?
I think VDSM should also not return tasks list if there is not SPM, why can't it throw an exception?
From the engine side it's a problem, since checking if there is SPM and fetching the tasks will never be atomic (two different VDSM commands).
> what does engine do if it gets exception when requesting the tasks list?
engine will simply try again after SPM will get re-elected.

I think that if VDSM will return an exception in case it is not SPM, then this could solve this issue

Comment 10 Yaniv Bronhaim 2013-01-10 08:10:34 UTC
Vdsm should also return tasks that not related to SPM as I understand. If you block the command, engine can start tasks without query them.

Throwing exception if recover is not done or if the request sent to not spm host is a possibility, but that way the engine can't query non spm tasks and I don't know how the engine reacts if it gets exception during this request, please let me know if it doesn't harm other flows.

Comment 11 Maor 2013-01-14 11:25:53 UTC
(In reply to comment #10)
> Vdsm should also return tasks that not related to SPM as I understand.
VDSM might do so, but engine does not support fetching async tasks from HSM at this moment.
> If you block the command, engine can start tasks without query them.
> 
> Throwing exception if recover is not done or if the request sent to not spm
> host is a possibility, but that way the engine can't query non spm tasks and
> I don't know how the engine reacts if it gets exception during this request,
> please let me know if it doesn't harm other flows.
IIRC, The engine will simply query those tasks the next time the SPM will be UP.

Comment 12 Ayal Baron 2013-01-14 22:16:21 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Vdsm should also return tasks that not related to SPM as I understand.
> VDSM might do so, but engine does not support fetching async tasks from HSM
> at this moment.

this is engine problem.  tasks API is and always has been hsm.

> > If you block the command, engine can start tasks without query them.
> > 
> > Throwing exception if recover is not done or if the request sent to not spm
> > host is a possibility, but that way the engine can't query non spm tasks and
> > I don't know how the engine reacts if it gets exception during this request,
> > please let me know if it doesn't harm other flows.
> IIRC, The engine will simply query those tasks the next time the SPM will be
> UP.

engine should not send getTasks* to host before it sees that it's the spm.
if engine gets an empty list when it expects tasks to be there it can issue a getSpmStatus call and if it's spm then reissue the getTasks* command.

Comment 13 Maor 2013-01-16 09:01:57 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #10)
> > > Vdsm should also return tasks that not related to SPM as I understand.
> > VDSM might do so, but engine does not support fetching async tasks from HSM
> > at this moment.
> 
> this is engine problem.  tasks API is and always has been hsm.
> 
> > > If you block the command, engine can start tasks without query them.
> > > 
> > > Throwing exception if recover is not done or if the request sent to not spm
> > > host is a possibility, but that way the engine can't query non spm tasks and
> > > I don't know how the engine reacts if it gets exception during this request,
> > > please let me know if it doesn't harm other flows.
> > IIRC, The engine will simply query those tasks the next time the SPM will be
> > UP.
> 
> engine should not send getTasks* to host before it sees that it's the spm.
> if engine gets an empty list when it expects tasks to be there it can issue
> a getSpmStatus call and if it's spm then reissue the getTasks* command.
This solution will reduce the race but will not solve it,
besides it might increase VDSM traffic, and might influence system latency (All getTasks* verbs for each SPM are queuing).
Why not changing the API by adding a parameter whether the getAllTasks is for SPM or HSM?

Comment 14 Yair Zaslavsky 2013-01-27 09:21:31 UTC
Ayal,
Question for 3.3 -
Can we change getTasksAPI to return also a field indicating if the host
is indeed still an SPM?

(Suggesting this per Maor's comment).

Comment 15 Ayal Baron 2013-01-27 09:52:36 UTC
(In reply to comment #14)
> Ayal,
> Question for 3.3 -
> Can we change getTasksAPI to return also a field indicating if the host
> is indeed still an SPM?
> 
> (Suggesting this per Maor's comment).

I'm skeptic vdsm maintainers would accept such a patch (I wouldn't) as getAllTasksList/Info has nothing to do with SPM

Comment 16 Barak 2013-02-26 19:03:41 UTC
After discussing the issue with both Ayal and Saggi,
I think we can go forward with this approach:
* getAllTasksList/info will have additional optional parameter (in the options list in vdsm) SPM=True/False)
* when the VDSM will get this flag it means it should return the spm tasks as well, and in case this VDSM is not in SPM status it should throw the relevant exception.
* on 3.2 the engine should send flag SPM=True always when it comes to the task API.
* need to make sure  backwards compatibility

Comment 17 Yaniv Bronhaim 2013-02-28 11:20:11 UTC
When adding spm=true to getAllTasksList and Status commands only if connected storage pool exists and host is spm the command returns, otherwise not spm exception is raised

Comment 18 Dafna Ron 2013-03-04 15:25:16 UTC
here is a clear reproduction for this issue. 

1. in one host cluster with iscsi storage (I had 3 domains but you can work with 1 - should not matter for this issue), create and run a vm with 3 disks
2. once vm is up, create a live snapshots
3. once the disks become locked restart the vdsm 

engine sends ConnectStoragePool and succeeds and than sends poll on tasks which come back with null since the host is not actually SPM yet:

engine sends ConnectStoragePool and before getting a reply from vdsm sends SPMAsyncTask::PollTask

2013-03-04 04:36:27,127 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-48) START, ConnectStoragePoolVDSCommand(HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, storagePo
olId = a6835dc2-d50e-4862-b64d-9291426c319d, vds_spm_id = 1, masterDomainId = bb11a7a5-b5f0-4a51-99ed-24507f087cd4, masterVersion = 2), log id: 61f20d63
2013-03-04 04:36:28,283 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) SPMAsyncTask::PollTask: Task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.eng
ine.core.common.asynctasks.AsyncTaskParameters) was not found in VDSM, will change its status to unknown.
2013-03-04 04:36:28,283 WARN  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) SPMAsyncTask::PollTask: Polling task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.o
virt.engine.core.common.asynctasks.AsyncTaskParameters) returned status unknown.
2013-03-04 04:36:28,288 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) BaseAsyncTask::LogEndTaskFailure: Task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org
.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: success
-- Message: Asynchronous Task unknown error,
-- Exception: [null]


the reply for ConnectStoragePool comes back after we put the tasks in unknown: 

2013-03-04 04:36:36,695 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-48) FINISH, ConnectStoragePoolVDSCommand, log id: 61f20d63


vdsm log shows an exception when we poll the tasks: 

Thread-25::ERROR::2013-03-04 04:36:28,842::task::833::TaskManager.Task::(_setError) Task=`dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2121, in clearTask
    return self.taskMng.clearTask(taskID=taskID)
  File "/usr/share/vdsm/storage/taskManager.py", line 163, in clearTask
    t = self.__getTask(taskID)
  File "/usr/share/vdsm/storage/taskManager.py", line 74, in __getTask
    raise se.UnknownTask(taskID)
UnknownTask: Task id unknown: ('6fa00a72-733b-4ecf-85f3-195c371a6334',)
Thread-25::DEBUG::2013-03-04 04:36:28,845::task::852::TaskManager.Task::(_run) Task=`dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc`::Task._run: dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc ('6fa00a72-733b-4ecf-85f3-195c371a6334',) {} failed - stopping task


as a result we have tasks that were cleared in engine but will continue to run  in vdsm: 

[root@gold-vdsd ~]# vdsClient -s 0 getAllTasksInfo
807816d1-b99b-4ce5-b574-a1c0571b3623 :
	verb = createVolume
	id = 807816d1-b99b-4ce5-b574-a1c0571b3623
bb353bba-93b9-4818-8233-932fd4102bb2 :
	verb = createVolume
	id = bb353bba-93b9-4818-8233-932fd4102bb2
6fa00a72-733b-4ecf-85f3-195c371a6334 :
	verb = createVolume
	id = 6fa00a72-733b-4ecf-85f3-195c371a6334

I will put the full logs in.

Comment 19 Dafna Ron 2013-03-04 15:26:18 UTC
Created attachment 705032 [details]
logs

Comment 20 Dafna Ron 2013-03-04 16:01:06 UTC
Created attachment 705057 [details]
new logs

vdsm log rotated - correct logs

Comment 24 Leonid Natapov 2013-04-23 13:45:42 UTC
facing same issue in on sf 13.1. A bit different scenario.

Restart vdsm while trying to delete disk from disks tab causes the same issue.
tasks are cleared in engine and not cleared on vdsm.

Comment 30 Katarzyna Jachim 2013-08-12 15:07:44 UTC
Verified on is9.1, both original Kuba's scenario and Dafna's from comment 18.

Comment 31 Charlie 2013-11-28 00:31:56 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 32 errata-xmlrpc 2014-01-21 16:03:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.