The relevant attachment is 652056. copied the relevant section from my last comment : VDSM log - Task 3b92fca5-7e7f-42d1-abbc-57cc6bcc4918 is running: Thread-947::INFO::2012-11-26 16:23:52,588::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'3b92fca5-7e7f-42d1-abbc-57cc6bcc4918': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': '3b92fca5-7e7f-42d1-abbc-57cc6bcc4918'}}} VDSM started: MainThread::INFO::2012-11-26 16:24:00,479::vdsm::70::vds::(run) I am the actual vdsm 4.9-44.0 Only after three getAllTasksStatuses operations the task is being returned: Thread-14::INFO::2012-11-26 16:24:02,601::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}} Thread-17::INFO::2012-11-26 16:24:02,887::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}} Thread-38::INFO::2012-11-26 16:24:21,622::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {}} Thread-814::DEBUG::2012-11-26 16:21:12,547::taskManager::112::TaskManager::(getAllTasksStatuses) Return: {'c9360be8-d7bb-497f-bf61-bac3eb0dcbee': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'c9360be8-d7bb-497f-bf61-bac3eb0dcbee'}} Thread-814::INFO::2012-11-26 16:21:12,548::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'c9360be8-d7bb-497f-bf61-bac3eb0dcbee': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'c9360be8-d7bb-497f-bf61-bac3eb0dcbee'}}} The race might occurred since VDSM start a new thread to fetch the tasks while starting the VDSM.
After discussed it with Ayal, it could be an engine problem. Seems that VDSM did not report this task since the VDSM was not SPM at the time engine called getAllTasks command. that is why VDSM returned an empty list. That caused the task in the engine to be unknown.
I'm not sure why engine is calling getAllTasksStatuses at all while host is not spm, but it could be that it missed the restart of vdsm. In any event, looking at the code, we could make getAllTasksStatuses at hsm level be an spm command (so in this case it would throw a not spm exception).
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Not sure if vdsm should distinguish in its API if it runs as SPM or not. I think that those getTask* API calls should return value only after initializing all internal tasks variables, meaning that before ending vdsm "recovery" process, this call should return "vdsm still working on it.." instead of sending empty list, because that's a lie we run the recovery part in external thread during hsm.__init__, before the task is done, the engine can send getAllTasksStatuses request and get as response empty list till the recover is done - as described above. I suggest this fix: http://gerrit.ovirt.org/#/c/10527
If this was caused by a race that engine thought it sent the request to the SPM, vdsm did the restart and got the request before turning to SPM and responded with empty list, engine thought the host doesn't contain tasks to recover. returning empty list, and then turning to SPM causes conflicts. This should be handled in engine's level, or the recovering thread should be synchronized to hsm startup before responding to getTaskslist request.. throwing exception or making it only spm command blocks you from knowing about tasks that are not realted to SPM when does engine send this request? can it be this kind of race? should it be handled in engine's side or vdsm's?
If this was caused by a race that engine thought it sent the request to the SPM, vdsm did the restart and got the request before turning to SPM and responded with empty list, engine thought the host doesn't contain tasks to recover. returning empty list, and then turning to SPM causes conflicts. This should be handled in engine's level, or the recovering thread should be synchronized to hsm startup before responding to getTaskslist request.. throwing exception or making it only spm command blocks you from knowing about tasks that are not realted to SPM when does engine send this request? can it be this kind of race? should it be handled in engine's side or vdsm's? what does engine do if it gets exception when requesting the tasks list?
(In reply to comment #8) > If this was caused by a race that engine thought it sent the request to the > SPM, vdsm did the restart and got the request before turning to SPM and > responded with empty list, engine thought the host doesn't contain tasks to > recover. returning empty list, and then turning to SPM causes conflicts. > > This should be handled in engine's level, or the recovering thread should be > synchronized to hsm startup before responding to getTaskslist request.. > > throwing exception or making it only spm command blocks you from knowing > about tasks that are not realted to SPM > > when does engine send this request? Engine send getTasks every 10 seconds. > can it be this kind of race? yes, theoretically engine does not request tasks if there is no SPM, I suspect the race could be after engine already got to the ATM functionality but a moment before it fetched the tasks VDSM restarted (see comment 3) > should it be handled in engine's side or vdsm's? I think VDSM should also not return tasks list if there is not SPM, why can't it throw an exception? From the engine side it's a problem, since checking if there is SPM and fetching the tasks will never be atomic (two different VDSM commands). > what does engine do if it gets exception when requesting the tasks list? engine will simply try again after SPM will get re-elected. I think that if VDSM will return an exception in case it is not SPM, then this could solve this issue
Vdsm should also return tasks that not related to SPM as I understand. If you block the command, engine can start tasks without query them. Throwing exception if recover is not done or if the request sent to not spm host is a possibility, but that way the engine can't query non spm tasks and I don't know how the engine reacts if it gets exception during this request, please let me know if it doesn't harm other flows.
(In reply to comment #10) > Vdsm should also return tasks that not related to SPM as I understand. VDSM might do so, but engine does not support fetching async tasks from HSM at this moment. > If you block the command, engine can start tasks without query them. > > Throwing exception if recover is not done or if the request sent to not spm > host is a possibility, but that way the engine can't query non spm tasks and > I don't know how the engine reacts if it gets exception during this request, > please let me know if it doesn't harm other flows. IIRC, The engine will simply query those tasks the next time the SPM will be UP.
(In reply to comment #11) > (In reply to comment #10) > > Vdsm should also return tasks that not related to SPM as I understand. > VDSM might do so, but engine does not support fetching async tasks from HSM > at this moment. this is engine problem. tasks API is and always has been hsm. > > If you block the command, engine can start tasks without query them. > > > > Throwing exception if recover is not done or if the request sent to not spm > > host is a possibility, but that way the engine can't query non spm tasks and > > I don't know how the engine reacts if it gets exception during this request, > > please let me know if it doesn't harm other flows. > IIRC, The engine will simply query those tasks the next time the SPM will be > UP. engine should not send getTasks* to host before it sees that it's the spm. if engine gets an empty list when it expects tasks to be there it can issue a getSpmStatus call and if it's spm then reissue the getTasks* command.
(In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #10) > > > Vdsm should also return tasks that not related to SPM as I understand. > > VDSM might do so, but engine does not support fetching async tasks from HSM > > at this moment. > > this is engine problem. tasks API is and always has been hsm. > > > > If you block the command, engine can start tasks without query them. > > > > > > Throwing exception if recover is not done or if the request sent to not spm > > > host is a possibility, but that way the engine can't query non spm tasks and > > > I don't know how the engine reacts if it gets exception during this request, > > > please let me know if it doesn't harm other flows. > > IIRC, The engine will simply query those tasks the next time the SPM will be > > UP. > > engine should not send getTasks* to host before it sees that it's the spm. > if engine gets an empty list when it expects tasks to be there it can issue > a getSpmStatus call and if it's spm then reissue the getTasks* command. This solution will reduce the race but will not solve it, besides it might increase VDSM traffic, and might influence system latency (All getTasks* verbs for each SPM are queuing). Why not changing the API by adding a parameter whether the getAllTasks is for SPM or HSM?
Ayal, Question for 3.3 - Can we change getTasksAPI to return also a field indicating if the host is indeed still an SPM? (Suggesting this per Maor's comment).
(In reply to comment #14) > Ayal, > Question for 3.3 - > Can we change getTasksAPI to return also a field indicating if the host > is indeed still an SPM? > > (Suggesting this per Maor's comment). I'm skeptic vdsm maintainers would accept such a patch (I wouldn't) as getAllTasksList/Info has nothing to do with SPM
After discussing the issue with both Ayal and Saggi, I think we can go forward with this approach: * getAllTasksList/info will have additional optional parameter (in the options list in vdsm) SPM=True/False) * when the VDSM will get this flag it means it should return the spm tasks as well, and in case this VDSM is not in SPM status it should throw the relevant exception. * on 3.2 the engine should send flag SPM=True always when it comes to the task API. * need to make sure backwards compatibility
When adding spm=true to getAllTasksList and Status commands only if connected storage pool exists and host is spm the command returns, otherwise not spm exception is raised
here is a clear reproduction for this issue. 1. in one host cluster with iscsi storage (I had 3 domains but you can work with 1 - should not matter for this issue), create and run a vm with 3 disks 2. once vm is up, create a live snapshots 3. once the disks become locked restart the vdsm engine sends ConnectStoragePool and succeeds and than sends poll on tasks which come back with null since the host is not actually SPM yet: engine sends ConnectStoragePool and before getting a reply from vdsm sends SPMAsyncTask::PollTask 2013-03-04 04:36:27,127 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-48) START, ConnectStoragePoolVDSCommand(HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, storagePo olId = a6835dc2-d50e-4862-b64d-9291426c319d, vds_spm_id = 1, masterDomainId = bb11a7a5-b5f0-4a51-99ed-24507f087cd4, masterVersion = 2), log id: 61f20d63 2013-03-04 04:36:28,283 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) SPMAsyncTask::PollTask: Task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.eng ine.core.common.asynctasks.AsyncTaskParameters) was not found in VDSM, will change its status to unknown. 2013-03-04 04:36:28,283 WARN [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) SPMAsyncTask::PollTask: Polling task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.o virt.engine.core.common.asynctasks.AsyncTaskParameters) returned status unknown. 2013-03-04 04:36:28,288 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-23) BaseAsyncTask::LogEndTaskFailure: Task 6fa00a72-733b-4ecf-85f3-195c371a6334 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org .ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: success -- Message: Asynchronous Task unknown error, -- Exception: [null] the reply for ConnectStoragePool comes back after we put the tasks in unknown: 2013-03-04 04:36:36,695 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-48) FINISH, ConnectStoragePoolVDSCommand, log id: 61f20d63 vdsm log shows an exception when we poll the tasks: Thread-25::ERROR::2013-03-04 04:36:28,842::task::833::TaskManager.Task::(_setError) Task=`dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2121, in clearTask return self.taskMng.clearTask(taskID=taskID) File "/usr/share/vdsm/storage/taskManager.py", line 163, in clearTask t = self.__getTask(taskID) File "/usr/share/vdsm/storage/taskManager.py", line 74, in __getTask raise se.UnknownTask(taskID) UnknownTask: Task id unknown: ('6fa00a72-733b-4ecf-85f3-195c371a6334',) Thread-25::DEBUG::2013-03-04 04:36:28,845::task::852::TaskManager.Task::(_run) Task=`dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc`::Task._run: dfe45af9-2ac0-440c-9e1d-c954f1d9e6dc ('6fa00a72-733b-4ecf-85f3-195c371a6334',) {} failed - stopping task as a result we have tasks that were cleared in engine but will continue to run in vdsm: [root@gold-vdsd ~]# vdsClient -s 0 getAllTasksInfo 807816d1-b99b-4ce5-b574-a1c0571b3623 : verb = createVolume id = 807816d1-b99b-4ce5-b574-a1c0571b3623 bb353bba-93b9-4818-8233-932fd4102bb2 : verb = createVolume id = bb353bba-93b9-4818-8233-932fd4102bb2 6fa00a72-733b-4ecf-85f3-195c371a6334 : verb = createVolume id = 6fa00a72-733b-4ecf-85f3-195c371a6334 I will put the full logs in.
Created attachment 705032 [details] logs
Created attachment 705057 [details] new logs vdsm log rotated - correct logs
facing same issue in on sf 13.1. A bit different scenario. Restart vdsm while trying to delete disk from disks tab causes the same issue. tasks are cleared in engine and not cleared on vdsm.
Verified on is9.1, both original Kuba's scenario and Dafna's from comment 18.
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html