Bug 815359 - 3.1 - vdsm: delete snapshot fails and when trying to delete it again task hangs on preparing to finish and vm is stuck in image locked
3.1 - vdsm: delete snapshot fails and when trying to delete it again task han...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm (Show other bugs)
6.3
x86_64 Linux
high Severity urgent
: beta
: ---
Assigned To: Saggi Mizrahi
Dafna Ron
infra
: ZStream
: 864902 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-23 09:11 EDT by Dafna Ron
Modified: 2012-12-27 03:34 EST (History)
17 users (show)

See Also:
Fixed In Version: vdsm-4.9.6-41.0
Doc Type: Bug Fix
Doc Text:
Previously, deleting snapshots failed and a second attempt to delete the snapshot hung when the deletion was "preparing to finish". This caused the associated virtual machine to become stuck in the "image locked" state for fifty hours until the associated task was declared a zombie and killed. A fix was implemented whereby teardown() was called when errors such as these presented. Attempts to delete snapshots no longer generate zombie tasks.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 13:57:52 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (819.57 KB, application/x-gzip)
2012-04-23 09:18 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2012-04-23 09:11:54 EDT
Description of problem:

testing bug 773210 in which we kill dd during delete snapshot, after the task has failed I tried removing the snapshot again. 
the task is stuck on preparing to finish so vm remains in locked state for 50 hours (until the task is declared zombie and killed) 

following discussions with Eduardo he thinks that the backend should not send the second delete command. 
Doron says that backend cannot block sending the second delete command since they do not know the state that the failure happened at and maybe the second delete will succeed. 

After speaking to Simon I am opening the bug in vdsm and attaching both vdsm and backend logs. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-8.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a vm with preallocated disk + wipe and delete
2. create snapshot 
3. delete the snapshot and kill qemu during delete
4. once task is failed try to delete the snapshot again 
  
Actual results:

task is stuck and vm remains in image locked 

Expected results:

1) Task should not hang - even if the image is gone we should move task to finished (looks like resource manager issue). 

2) vdsm should either have a specific error for each failure or offer a different solution to dealing with this issue. 

Additional info: logs attached. 

the second delete's Task id is: 29f4cfaa-27a2-42e7-bfdf-c28df879ce48
Comment 1 Dafna Ron 2012-04-23 09:18:12 EDT
Created attachment 579535 [details]
logs
Comment 2 RHEL Product and Program Management 2012-05-05 00:16:06 EDT
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 4 Eduardo Warszawski 2012-08-12 12:54:29 EDT
After the 1st delete image issued, new operations on this image should be prevented.

If the operation (task) is lost, engine should check the status of the image and not the lost task.

The image should be unusable until the data and metadata integrity were positively verified.
Comment 5 Maor 2012-08-22 09:27:14 EDT
Dafna, I'm getting an exception when trying to extract the logs.
Eduardo, I'm not sure I follow your solution, 
if we prevent the user to send again the delete command, how he should get out of this condition.
The user might also have problem in any operation regarding the VM (like remove for example)
Comment 8 Maor 2012-08-23 08:33:49 EDT
Regarding the logs, I now managed to open them, (guess there was some temporary network issue).

Now that I see the log, I see the VDSM reported the task as finished

Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::93::TaskManager::(getTaskStatus) Entry. taskID: 4fd7432b-ec34-4ebe-b140-df61ae789586
Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}
Thread-284::DEBUG::2012-04-22 16:34:18,320::taskManager::112::TaskManager::(getAllTasksStatuses) Return: {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}
Thread-284::INFO::2012-04-22 16:34:18,320::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}
Thread-284::DEBUG::2012-04-22 16:34:18,321::task::1172::TaskManager.Task::(prepare) Task=`4ae9d40d-ea76-4886-88df-b2849ea22a97`::finished: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}

so, I move back the need info to Eduardo, regarding comment 5.
Comment 16 Eduardo Warszawski 2012-09-13 07:04:04 EDT
In addition thread is waiting in a resource 29f4cfaa-27a2-42e7-bfdf-c28df879ce48 that seems to be steal (until the end of the log) after thread 4ae9d40dea76-4886-88df-b2849ea22a97 exit.
Need to find why.
Comment 17 Eduardo Warszawski 2012-09-20 03:41:37 EDT
Movin to infra, as agreed.
Comment 18 Saggi Mizrahi 2012-10-18 15:47:21 EDT
Ended up not being infra after all.
http://gerrit.ovirt.org/#/c/8667/
Comment 19 Eduardo Warszawski 2012-10-21 03:54:53 EDT
The issue, as we agreed, is a task cleared without releasing the resources.

The task failing leaving "prepared" volumes is only the trigger and not the problem.

Clear the task should release the resources.

In addition resource logs in spite to be very verbose were of little help to understand the issue.

These are infra problems.
Comment 20 Ayal Baron 2012-10-21 05:07:52 EDT
*** Bug 864902 has been marked as a duplicate of this bug. ***
Comment 21 Saggi Mizrahi 2012-10-22 11:04:52 EDT
No, because the resource is not taken in the task context so the task is not responsible for freeing it.
Comment 24 Dafna Ron 2012-11-05 12:35:25 EST
verified on si24
Comment 26 errata-xmlrpc 2012-12-04 13:57:52 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.