Bug 815359

Summary: 3.1 - vdsm: delete snapshot fails and when trying to delete it again task hangs on preparing to finish and vm is stuck in image locked
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Saggi Mizrahi <smizrahi>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.3CC: abaron, aburden, acathrow, bazulay, dyasny, ewarszaw, hateya, iheim, ilvovsky, lpeer, lyarwood, mlipchuk, Rhev-m-bugs, sgrinber, yeylon, ykaul, zdover
Target Milestone: betaKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vdsm-4.9.6-41.0 Doc Type: Bug Fix
Doc Text:
Previously, deleting snapshots failed and a second attempt to delete the snapshot hung when the deletion was "preparing to finish". This caused the associated virtual machine to become stuck in the "image locked" state for fifty hours until the associated task was declared a zombie and killed. A fix was implemented whereby teardown() was called when errors such as these presented. Attempts to delete snapshots no longer generate zombie tasks.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 18:57:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-04-23 13:11:54 UTC
Description of problem:

testing bug 773210 in which we kill dd during delete snapshot, after the task has failed I tried removing the snapshot again. 
the task is stuck on preparing to finish so vm remains in locked state for 50 hours (until the task is declared zombie and killed) 

following discussions with Eduardo he thinks that the backend should not send the second delete command. 
Doron says that backend cannot block sending the second delete command since they do not know the state that the failure happened at and maybe the second delete will succeed. 

After speaking to Simon I am opening the bug in vdsm and attaching both vdsm and backend logs. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-8.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a vm with preallocated disk + wipe and delete
2. create snapshot 
3. delete the snapshot and kill qemu during delete
4. once task is failed try to delete the snapshot again 
  
Actual results:

task is stuck and vm remains in image locked 

Expected results:

1) Task should not hang - even if the image is gone we should move task to finished (looks like resource manager issue). 

2) vdsm should either have a specific error for each failure or offer a different solution to dealing with this issue. 

Additional info: logs attached. 

the second delete's Task id is: 29f4cfaa-27a2-42e7-bfdf-c28df879ce48

Comment 1 Dafna Ron 2012-04-23 13:18:12 UTC
Created attachment 579535 [details]
logs

Comment 2 RHEL Program Management 2012-05-05 04:16:06 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Eduardo Warszawski 2012-08-12 16:54:29 UTC
After the 1st delete image issued, new operations on this image should be prevented.

If the operation (task) is lost, engine should check the status of the image and not the lost task.

The image should be unusable until the data and metadata integrity were positively verified.

Comment 5 Maor 2012-08-22 13:27:14 UTC
Dafna, I'm getting an exception when trying to extract the logs.
Eduardo, I'm not sure I follow your solution, 
if we prevent the user to send again the delete command, how he should get out of this condition.
The user might also have problem in any operation regarding the VM (like remove for example)

Comment 8 Maor 2012-08-23 12:33:49 UTC
Regarding the logs, I now managed to open them, (guess there was some temporary network issue).

Now that I see the log, I see the VDSM reported the task as finished

Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::93::TaskManager::(getTaskStatus) Entry. taskID: 4fd7432b-ec34-4ebe-b140-df61ae789586
Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}
Thread-284::DEBUG::2012-04-22 16:34:18,320::taskManager::112::TaskManager::(getAllTasksStatuses) Return: {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}
Thread-284::INFO::2012-04-22 16:34:18,320::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}
Thread-284::DEBUG::2012-04-22 16:34:18,321::task::1172::TaskManager.Task::(prepare) Task=`4ae9d40d-ea76-4886-88df-b2849ea22a97`::finished: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}

so, I move back the need info to Eduardo, regarding comment 5.

Comment 16 Eduardo Warszawski 2012-09-13 11:04:04 UTC
In addition thread is waiting in a resource 29f4cfaa-27a2-42e7-bfdf-c28df879ce48 that seems to be steal (until the end of the log) after thread 4ae9d40dea76-4886-88df-b2849ea22a97 exit.
Need to find why.

Comment 17 Eduardo Warszawski 2012-09-20 07:41:37 UTC
Movin to infra, as agreed.

Comment 18 Saggi Mizrahi 2012-10-18 19:47:21 UTC
Ended up not being infra after all.
http://gerrit.ovirt.org/#/c/8667/

Comment 19 Eduardo Warszawski 2012-10-21 07:54:53 UTC
The issue, as we agreed, is a task cleared without releasing the resources.

The task failing leaving "prepared" volumes is only the trigger and not the problem.

Clear the task should release the resources.

In addition resource logs in spite to be very verbose were of little help to understand the issue.

These are infra problems.

Comment 20 Ayal Baron 2012-10-21 09:07:52 UTC
*** Bug 864902 has been marked as a duplicate of this bug. ***

Comment 21 Saggi Mizrahi 2012-10-22 15:04:52 UTC
No, because the resource is not taken in the task context so the task is not responsible for freeing it.

Comment 24 Dafna Ron 2012-11-05 17:35:25 UTC
verified on si24

Comment 26 errata-xmlrpc 2012-12-04 18:57:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html