Bug 1621468

Summary: Internal file read failure task stuck on SPM
Product: [oVirt] vdsm Reporter: Evgheni Dereveanchin <ederevea>
Component: GeneralAssignee: Dan Kenigsberg <danken>
Status: CLOSED CANTFIX QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.20.23CC: bugs, bzlotnik, ederevea, mperina, rnori, stirabos, tnisan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-11 10:10:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Evgheni Dereveanchin 2018-08-23 16:05:36 UTC
Description of problem:
After adding storage domains to the system, the event log on the Engine is filled by the following error message:

VDSM ovirt-srv01 command HSMGetAllTasksStatusesVDS failed: Internal file read failure: ('partial data 10240 from 20480',)

Version-Release number of selected component (if applicable):
vdsm-4.20.27.1-1.el7.centos.x86_64

Additional info:
Started happening in one environment which was affected by BZ#1613278 - this may be related.

vdsm.log lists the task as finished but it does not going away. The issue with HE config volume size has been fixed as well but after this issue happened. Not sure if it's safe to just clear this task.

Comment 1 Evgheni Dereveanchin 2018-08-23 16:06:45 UTC
ovirt-srv01 is the SPM and an HE host, here's the getAllTasksStatuses result from vdsm.log:

{  
   'allTasksStatus':{  
      '34d3a80d-6615-4994-87a0-b1fba7d37196':{  
         'code':0,
         'message':'1 jobs completed successfully',
         'taskState':'finished',
         'taskResult':'success',
         'taskID':'34d3a80d-6615-4994-87a0-b1fba7d37196'
      },
      '6247d2d5-7195-4ae6-bab6-8c5518b8bac3':{  
         'code':2001,
         'message':"Internal file read failure: ('partial data 10240 from 20480',)",
         'taskState':'finished',
         'taskResult':'cleanSuccess',
         'taskID':'6247d2d5-7195-4ae6-bab6-8c5518b8bac3'
      }
   }
}

A closer look at this job:

# vdsm-client Task getInfo taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3
{
    "verb": "uploadImageToStream", 
    "id": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3"
}

# vdsm-client Task getInfo taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3
{
    "verb": "uploadImageToStream", 
    "id": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3"
}

# vdsm-client Task getStatus taskID=6247d2d5-7195-4ae6-bab6-8c5518b8bac3
{
    "message": "Internal file read failure: ('partial data 10240 from 20480',)", 
    "code": 2001, 
    "taskID": "6247d2d5-7195-4ae6-bab6-8c5518b8bac3", 
    "taskResult": "cleanSuccess", 
    "taskState": "finished"
}

Comment 2 Evgheni Dereveanchin 2018-08-23 16:16:50 UTC
Note: It is only my assumption that this is related to BZ#1613278 as that one was caused by the fact that our env had the HE config volume of the wrong size. The error is similar but maybe this particular error refers to another volume that has also has this issue.

I can't see what initiated job 6247d2d5-7195-4ae6-bab6-8c5518b8bac3, it just started appearing in "finished" state in vdsm.log after OVF_STORE volumes were created on the new block storage domains.

Comment 3 Evgheni Dereveanchin 2018-08-23 17:05:15 UTC
After looking through logs and the task seems to be stuck on this host since  when I logged BZ#1613278 several weeks ago. VDSM logs have since been rotated and this re-appeared in logs just because this server took over the SPM role again. I assume it's safe to evacuate the SPM at this point and clear the task using vdsm-client but if there are other precautions to do so - would like to hear them.

Comment 4 Evgheni Dereveanchin 2018-08-23 17:09:43 UTC
Evacuating the SPM fails due to the stuck task:

2018-08-23 13:05:31,859-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-99) [b24c6d05-18e0-403d-8f54-1af19f92a40b] SpmStopVDSCommand::Not stopping SPM on vds 'ovirt-srv01', pool id '00000002-0002-0002-0002-000000000150' as there are uncleared tasks 'Task '6247d2d5-7195-4ae6-bab6-8c5518b8bac3', status 'finished''

How should I proceed? Clear the task on the host or engine-side modifications would also be needed?

Comment 5 Evgheni Dereveanchin 2018-08-24 09:25:35 UTC
Checked out the engine DB and there are no jobs or async tasks present:

engine=# SELECT task_id from async_tasks;
 task_id 
---------
(0 rows)

engine=# select job_id from job;
 job_id 
--------
(0 rows)

I was also able to confirm when this task happened - indeed it was during an HE host rebuild that happened several weeks ago. Still not clear why the task remained on the host and wasn't cleared but I think at this point it's harmless to clear it using vdsm-client

Comment 7 Tal Nisan 2018-08-27 15:08:17 UTC
Most likely you should clear the old task from the running host, Simone, as the root cause is bug 1613278 please keep me honest here

Comment 8 Simone Tiraboschi 2018-08-27 16:09:12 UTC
(In reply to Tal Nisan from comment #7)
> Most likely you should clear the old task from the running host, Simone, as
> the root cause is bug 1613278 please keep me honest here

The root cause is 1613278 for sure.
but maybe this is a bug on itself:
taskID 6247d2d5-7195-4ae6-bab6-8c5518b8bac3 has:
  "taskResult": "cleanSuccess", 
  "taskState": "finished"

So I'm not sure on why the user has to manually clear this on host side.

Comment 9 Evgheni Dereveanchin 2018-08-28 12:18:44 UTC
Thanks for the confirmation, I cleaned the task successfully and was able to migrate the SPM role to a different host.

I agree with Simone on this one: even though this is a rare edge case the task should be cleared after completion as otherwise it produces errors in event logs and if the host acquires the SPM role then it will be impossible to move it elsewhere without manual intervention.

Comment 10 Benny Zlotnik 2018-08-29 08:08:06 UTC
Tasks should be cleared automatically when completed successfully, if they aren't there is a bug or some other failure. There is a scheduled job that clears leftover tasks after "AsyncTaskZombieTaskLifeInMinutes" which defaults to 50 hours

Comment 11 Tal Nisan 2018-09-02 14:47:24 UTC
Since the original issue was fixed in bug 1613278 we are now left only with the problem of the task not being cleared, moving to infra for further inspection

Comment 12 Martin Perina 2018-09-03 13:25:54 UTC
Ravi, could you please take a look?

Comment 13 Ravi Nori 2018-09-13 17:01:59 UTC
The engine always clears the task from the database after getting the task status from vdsm for both successful and failed execution of a task. If the task's vdsm id is known a SPMClearTask command is issued to clean the task from the vdsm side. As long as the task's vdsm id is passed back to the engine's Async Task Manager everything should work fine and the task should be cleared from database and vdsm. 

From the logs it does not look like an infra issue.

Comment 14 Ravi Nori 2018-09-14 15:11:31 UTC
Is the task started from webadmin? Everything started from Webadmin/restapi should be handled properly by the engine's AsyncTaskManager.

Comment 15 Evgheni Dereveanchin 2018-09-24 13:38:01 UTC
The action from my side that resulted in this task was re-deploying a hypervisor from the Admin Portal. Hopefully the logs can reveal more on what part of the process triggered the task.

The task was gone from the DB by the time I logged this bug so not sure what caused it to happen.

The steps to confirm if this is reproducible would be:
1) deploy an HE environment on NFS
2) alter the size of the metadata volume to trigger the "Internal file read failure" error
3) attempt to deploy a new HE host

Comment 16 Ravi Nori 2019-03-04 20:42:07 UTC
The patches for BZ 1633777 should fix the infra related issues described in the BZ.

Comment 17 Tal Nisan 2019-03-11 10:10:35 UTC
We can't seem to think of a way to reproduce it, please reopen if the issue reproduces after the fix to 1633777 has been applied