815359 – 3.1 - vdsm: delete snapshot fails and when trying to delete it again task hangs on preparing to finish and vm is stuck in image locked

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 815359 - 3.1 - vdsm: delete snapshot fails and when trying to delete it again task hangs on preparing to finish and vm is stuck in image locked

Summary: 3.1 - vdsm: delete snapshot fails and when trying to delete it again task han...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	beta
Target Release:	---
Assignee:	Saggi Mizrahi
QA Contact:	Dafna Ron
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	864902 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-04-23 13:11 UTC by Dafna Ron
Modified:	2022-07-09 05:35 UTC (History)
CC List:	17 users (show)
Fixed In Version:	vdsm-4.9.6-41.0
Doc Type:	Bug Fix
Doc Text:	Previously, deleting snapshots failed and a second attempt to delete the snapshot hung when the deletion was "preparing to finish". This caused the associated virtual machine to become stuck in the "image locked" state for fifty hours until the associated task was declared a zombie and killed. A fix was implemented whereby teardown() was called when errors such as these presented. Attempts to delete snapshots no longer generate zombie tasks.
Clone Of:
Environment:
Last Closed:	2012-12-04 18:57:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (819.57 KB, application/x-gzip) 2012-04-23 13:18 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1508	0	normal	SHIPPED_LIVE	Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update	2012-12-04 23:48:05 UTC

Description Dafna Ron 2012-04-23 13:11:54 UTC

Description of problem:

testing bug 773210 in which we kill dd during delete snapshot, after the task has failed I tried removing the snapshot again. 
the task is stuck on preparing to finish so vm remains in locked state for 50 hours (until the task is declared zombie and killed) 

following discussions with Eduardo he thinks that the backend should not send the second delete command. 
Doron says that backend cannot block sending the second delete command since they do not know the state that the failure happened at and maybe the second delete will succeed. 

After speaking to Simon I am opening the bug in vdsm and attaching both vdsm and backend logs. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-8.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a vm with preallocated disk + wipe and delete
2. create snapshot 
3. delete the snapshot and kill qemu during delete
4. once task is failed try to delete the snapshot again 
  
Actual results:

task is stuck and vm remains in image locked 

Expected results:

1) Task should not hang - even if the image is gone we should move task to finished (looks like resource manager issue). 

2) vdsm should either have a specific error for each failure or offer a different solution to dealing with this issue. 

Additional info: logs attached. 

the second delete's Task id is: 29f4cfaa-27a2-42e7-bfdf-c28df879ce48

Comment 1 Dafna Ron 2012-04-23 13:18:12 UTC

Created attachment 579535 [details]
logs

Comment 2 RHEL Program Management 2012-05-05 04:16:06 UTC

Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Eduardo Warszawski 2012-08-12 16:54:29 UTC

After the 1st delete image issued, new operations on this image should be prevented.

If the operation (task) is lost, engine should check the status of the image and not the lost task.

The image should be unusable until the data and metadata integrity were positively verified.

Comment 5 Maor 2012-08-22 13:27:14 UTC

Dafna, I'm getting an exception when trying to extract the logs.
Eduardo, I'm not sure I follow your solution, 
if we prevent the user to send again the delete command, how he should get out of this condition.
The user might also have problem in any operation regarding the VM (like remove for example)

Comment 8 Maor 2012-08-23 12:33:49 UTC

Regarding the logs, I now managed to open them, (guess there was some temporary network issue).

Now that I see the log, I see the VDSM reported the task as finished

Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::93::TaskManager::(getTaskStatus) Entry. taskID: 4fd7432b-ec34-4ebe-b140-df61ae789586
Thread-284::DEBUG::2012-04-22 16:34:18,319::taskManager::96::TaskManager::(getTaskStatus) Return. Response: {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}
Thread-284::DEBUG::2012-04-22 16:34:18,320::taskManager::112::TaskManager::(getAllTasksStatuses) Return: {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}
Thread-284::INFO::2012-04-22 16:34:18,320::logUtils::39::dispatcher::(wrapper) Run and protect: getAllTasksStatuses, Return response: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}
Thread-284::DEBUG::2012-04-22 16:34:18,321::task::1172::TaskManager.Task::(prepare) Task=`4ae9d40d-ea76-4886-88df-b2849ea22a97`::finished: {'allTasksStatus': {'4fd7432b-ec34-4ebe-b140-df61ae789586': {'code': 252, 'message': 'Error merging snapshots', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': '4fd7432b-ec34-4ebe-b140-df61ae789586'}}}

so, I move back the need info to Eduardo, regarding comment 5.

Comment 16 Eduardo Warszawski 2012-09-13 11:04:04 UTC

In addition thread is waiting in a resource 29f4cfaa-27a2-42e7-bfdf-c28df879ce48 that seems to be steal (until the end of the log) after thread 4ae9d40dea76-4886-88df-b2849ea22a97 exit.
Need to find why.

Comment 17 Eduardo Warszawski 2012-09-20 07:41:37 UTC

Movin to infra, as agreed.

Comment 18 Saggi Mizrahi 2012-10-18 19:47:21 UTC

Ended up not being infra after all.
http://gerrit.ovirt.org/#/c/8667/

Comment 19 Eduardo Warszawski 2012-10-21 07:54:53 UTC

The issue, as we agreed, is a task cleared without releasing the resources.

The task failing leaving "prepared" volumes is only the trigger and not the problem.

Clear the task should release the resources.

In addition resource logs in spite to be very verbose were of little help to understand the issue.

These are infra problems.

Comment 20 Ayal Baron 2012-10-21 09:07:52 UTC

*** Bug 864902 has been marked as a duplicate of this bug. ***

Comment 21 Saggi Mizrahi 2012-10-22 15:04:52 UTC

No, because the resource is not taken in the task context so the task is not responsible for freeing it.

Comment 24 Dafna Ron 2012-11-05 17:35:25 UTC

verified on si24

Comment 26 errata-xmlrpc 2012-12-04 18:57:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.