Bug 1621468
Summary: | Internal file read failure task stuck on SPM | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Evgheni Dereveanchin <ederevea> |
Component: | General | Assignee: | Dan Kenigsberg <danken> |
Status: | CLOSED CANTFIX | QA Contact: | Elad <ebenahar> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.20.23 | CC: | bugs, bzlotnik, ederevea, mperina, rnori, stirabos, tnisan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-11 10:10:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Evgheni Dereveanchin
2018-08-23 16:05:36 UTC
ovirt-srv01 is the SPM and an HE host, here's the getAllTasksStatuses result from vdsm.log: { 'allTasksStatus':{ '34d3a80d-6615-4994-87a0-b1fba7d37196':{ 'code':0, 'message':'1 jobs completed successfully', 'taskState':'finished', 'taskResult':'success', 'taskID':'34d3a80d-6615-4994-87a0-b1fba7d37196' }, '6247d2d5-7195-4ae6-bab6-8c5518b8bac3':{ 'code':2001, 'message':"Internal file read failure: ('partial data 10240 from 20480',)", 'taskState':'finished', 'taskResult':'cleanSuccess', 'taskID':'6247d2d5-7195-4ae6-bab6-8c5518b8bac3' } } } A closer look at this job: # vdsm-client Task getInfo taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3 { "verb": "uploadImageToStream", "id": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3" } # vdsm-client Task getInfo taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3 { "verb": "uploadImageToStream", "id": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3" } # vdsm-client Task getStatus taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3 { "message": "Internal file read failure: ('partial data 10240 from 20480',)", "code": 2001, "taskID": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3", "taskResult": "cleanSuccess", "taskState": "finished" } Note: It is only my assumption that this is related to BZ#1613278 as that one was caused by the fact that our env had the HE config volume of the wrong size. The error is similar but maybe this particular error refers to another volume that has also has this issue. I can't see what initiated job 6247d2d5-7195-4ae6-bab6-8c5518b8bac3, it just started appearing in "finished" state in vdsm.log after OVF_STORE volumes were created on the new block storage domains. After looking through logs and the task seems to be stuck on this host since when I logged BZ#1613278 several weeks ago. VDSM logs have since been rotated and this re-appeared in logs just because this server took over the SPM role again. I assume it's safe to evacuate the SPM at this point and clear the task using vdsm-client but if there are other precautions to do so - would like to hear them. Evacuating the SPM fails due to the stuck task: 2018-08-23 13:05:31,859-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-99) [b24c6d05-18e0-403d-8f54-1af19f92a40b] SpmStopVDSCommand::Not stopping SPM on vds 'ovirt-srv01', pool id '00000002-0002-0002-0002-000000000150' as there are uncleared tasks 'Task '6247d2d5-7195-4ae6-bab6-8c5518b8bac3', status 'finished'' How should I proceed? Clear the task on the host or engine-side modifications would also be needed? Checked out the engine DB and there are no jobs or async tasks present: engine=# SELECT task_id from async_tasks; task_id --------- (0 rows) engine=# select job_id from job; job_id -------- (0 rows) I was also able to confirm when this task happened - indeed it was during an HE host rebuild that happened several weeks ago. Still not clear why the task remained on the host and wasn't cleared but I think at this point it's harmless to clear it using vdsm-client Most likely you should clear the old task from the running host, Simone, as the root cause is bug 1613278 please keep me honest here (In reply to Tal Nisan from comment #7) > Most likely you should clear the old task from the running host, Simone, as > the root cause is bug 1613278 please keep me honest here The root cause is 1613278 for sure. but maybe this is a bug on itself: taskID 6247d2d5-7195-4ae6-bab6-8c5518b8bac3 has: "taskResult": "cleanSuccess", "taskState": "finished" So I'm not sure on why the user has to manually clear this on host side. Thanks for the confirmation, I cleaned the task successfully and was able to migrate the SPM role to a different host. I agree with Simone on this one: even though this is a rare edge case the task should be cleared after completion as otherwise it produces errors in event logs and if the host acquires the SPM role then it will be impossible to move it elsewhere without manual intervention. Tasks should be cleared automatically when completed successfully, if they aren't there is a bug or some other failure. There is a scheduled job that clears leftover tasks after "AsyncTaskZombieTaskLifeInMinutes" which defaults to 50 hours Since the original issue was fixed in bug 1613278 we are now left only with the problem of the task not being cleared, moving to infra for further inspection Ravi, could you please take a look? The engine always clears the task from the database after getting the task status from vdsm for both successful and failed execution of a task. If the task's vdsm id is known a SPMClearTask command is issued to clean the task from the vdsm side. As long as the task's vdsm id is passed back to the engine's Async Task Manager everything should work fine and the task should be cleared from database and vdsm. From the logs it does not look like an infra issue. Is the task started from webadmin? Everything started from Webadmin/restapi should be handled properly by the engine's AsyncTaskManager. The action from my side that resulted in this task was re-deploying a hypervisor from the Admin Portal. Hopefully the logs can reveal more on what part of the process triggered the task. The task was gone from the DB by the time I logged this bug so not sure what caused it to happen. The steps to confirm if this is reproducible would be: 1) deploy an HE environment on NFS 2) alter the size of the metadata volume to trigger the "Internal file read failure" error 3) attempt to deploy a new HE host The patches for BZ 1633777 should fix the infra related issues described in the BZ. We can't seem to think of a way to reproduce it, please reopen if the issue reproduces after the fix to 1633777 has been applied |