Bug 1263010

Summary:	Heat stack delete failure due to vnx cinder volumes
Product:	Red Hat OpenStack	Reporter:	nalmond
Component:	openstack-cinder	Assignee:	Eric Harney <eharney>
Status:	CLOSED CURRENTRELEASE	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.0 (Juno)	CC:	bschmaus, dhill, dmaley, egafford, eharney, hrivero, jmelvin, jobernar, jruzicka, jwaterwo, mburns, nalmond, nobody, pbandark, rhel-osp-director-maint, rreddy, sbaker, scohen, sgotliv, shardy, srevivo
Target Milestone:	async	Keywords:	ZStream
Target Release:	6.0 (Juno)
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-14 15:13:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1301487
Bug Blocks:

Description nalmond 2015-09-14 21:52:55 UTC

What problem/issue/behavior are you having trouble with?  What do you expect to see?

users are unable to delete failed stack deployments.

cinder volumes are stuck in error_deleting state.  the delete attempts report:
"/var/log/cinder/volume.log:2015-09-14 12:14:14.949 9642 ERROR oslo.messaging.rpc.dispatcher [req-f02fc547-435c-48ec-8904-aef8b8feec7b 4b3bbdf7df334481a6fbfe196c7416cf ec8cbc2a9ad942bba26fbcf599f83d72 - - -] Exception during message handling: Bad or unexpected response from the storage volume backend API: EMCVnxCLICmdError : ['lun', '-destroy', '-name', u'volume-db31e942-d89e-458d-a138-1f312770b058', '-forceDetach', '-o'] (Return Code: 156) (Output: [u"Cannot unbind LUN because it's contained in a Storage Group\\n"])"

Where are you experiencing the behavior?  What environment?

pre-production environment with EMC VNX Cinder driver

What information can you provide around timeframes and urgency?

somewhat urgent.  currently delaying vendors deploying their stacks because they have stack failures and then can't delete them to try again

The three sosreports can be found here: http://collab-shell.usersys.redhat.com/01507844/

Comment 3 Steve Baker 2015-09-14 22:06:44 UTC

Can you delete the volume using the cinder cli? If not then this bug needs to be reassigned to component openstack-cinder.

Comment 4 nalmond 2015-09-14 22:27:11 UTC

Yes, the volume can be deleted using the cinder cli.

It would seem as though the driver is not correctly unmapping the LUN before attempting to delete it via the disconnect_host_from_storage_group() function before attempting to delete the volume.  Manually creating and deleting LUNs via the cinder CLI works fine if they are not mapped.

Comment 5 Steve Baker 2015-09-14 22:48:19 UTC

If heat stack-delete is attempted a second time on the DELETE_FAILED stack, does the delete complete successfully?

Can you please attach the full output for heat event-list for the first and second delete attempts?

Comment 6 Zane Bitter 2015-09-14 23:14:32 UTC

I'm not seeing the error you report in the log, instead I see a lot of lines of the form:

INFO heat.engine.stack [-] Stack DELETE FAILED (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Resource DELETE failed: BadRequest: Invalid volume: Volume status must be available or error, but current status is: error_deleting (HTTP 400)

So Cinder is returning a 400 Error, complaining that we can't delete a volume that is in the error_deleting state.

The only potentially relevant Heat patch I can find is https://git.openstack.org/cgit/openstack/heat/commit/?id=6aca29e05bb0cfe469ed17f4ab9bb9351b2f108a which prevents Heat from attempting to delete a volume that is in the 'deleting' state already. Heat does not issue multiple deletes to Cinder within a single stack delete attempt, but if you attempt to delete a stack multiple times without waiting for it to fail then it will stop the previous attempt and try again. There is a known race condition in Cinder that causes volumes to lock up in unrecoverable states when you issue multiple detach calls (see https://bugs.launchpad.net/bugs/1298350), and it may well be that something similar exists for deletes. If that is the case then it's possible that issuing multiple Heat stack-deletes on a single stack without waiting for the in-progress ones to finish could trigger this bug, in which case a backport of the patch above would prevent it.

Comment 7 Zane Bitter 2015-09-15 17:20:59 UTC

Here's what I've been able to extract from the heat-engine log on controller 3. I'm looking at a single nested stack (card03pair) for simplicity. First, the stack is deleted at 10:47:

2015-09-14 10:47:48.107 58929 INFO heat.engine.stack [-] Stack DELETE IN_PROGRESS (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Stack DELETE started

Note that the sibling nested stacks card01pair and card02pair delete successfully. The stack we're looking at starts deleting, and issues delete calls to Cinder for the two volumes it contains:

2015-09-14 10:47:54.428 58929 INFO heat.engine.resource [-] deleting CinderVolume "ALU-LCP-Block-StorageB" [ca9cff01-6a11-4cac-920c-4e70b80b4325] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]
2015-09-14 10:47:56.508 58929 INFO heat.engine.resource [-] deleting CinderVolume "ALU-LCP-Block-StorageA" [a602718b-c8d7-4b3c-b7ef-379478461828] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]

Both of these fail in Cinder (on controller 1):

2015-09-14 10:47:56.342 9642 ERROR oslo.messaging.rpc.dispatcher [req-5b6397f0-61e9-47e6-b46f-05ee861043b8 94a49664326149f181ce3c78895151ef ec8cbc2a9ad942bba26fbcf599f83d72 - - -] Exception during message handling: Bad or unexpected response from the storage volume backend API: EMCVnxCLICmdError : ['lun', '-destroy', '-name', u'volume-ca9cff01-6a11-4cac-920c-4e70b80b4325', '-forceDetach', '-o'] (Return Code: 156) (Output: [u"Cannot unbind LUN because it's contained in a Storage Group\\n"])
2015-09-14 10:47:58.741 9642 ERROR oslo.messaging.rpc.dispatcher [req-260fbc74-fff9-48a2-b25e-88e1158fb44d 94a49664326149f181ce3c78895151ef ec8cbc2a9ad942bba26fbcf599f83d72 - - -] Exception during message handling: Bad or unexpected response from the storage volume backend API: EMCVnxCLICmdError : ['lun', '-destroy', '-name', u'volume-a602718b-c8d7-4b3c-b7ef-379478461828', '-forceDetach', '-o'] (Return Code: 156) (Output: [u"Cannot unbind LUN because it's contained in a Storage Group\\n"])

The volumes go into the error_deleting state. Heat does not recognise this as an error; it just keeps polling the volume, waiting for it to disappear, until the stack times out. It's likely that someone notices this, because 6 minutes later at 10:53 they issue a second stack delete:

2015-09-14 10:53:18.149 58929 INFO heat.engine.stack [-] Stack DELETE IN_PROGRESS (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Stack DELETE started

This second delete command causes Heat to stop work on the previous stack delete and start another. When it gets to the first volume, it attempts to delete it from Cinder a second time. (This is what the patch in comment 6 would avoid, were the volume in the 'deleting state' - however, it is already in the 'error_deleting' state.) Cinder responds by complaining about the volume status:

2015-09-14 10:53:20.189 58929 INFO heat.engine.resource [-] deleting CinderVolume "ALU-LCP-Block-StorageA" [a602718b-c8d7-4b3c-b7ef-379478461828] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]
2015-09-14 10:53:20.431 58929 INFO heat.engine.resource [-] DELETE: CinderVolume "ALU-LCP-Block-StorageA" [a602718b-c8d7-4b3c-b7ef-379478461828] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]
2015-09-14 10:53:20.456 58929 INFO heat.engine.stack [-] Stack DELETE FAILED (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Resource DELETE failed: BadRequest: Invalid volume: Volume status must be available or error, but current status is: error_deleting (HTTP 400) (Request-ID: req-0f5bbb3b-5443-4195-9729-68e15418fd1e)

This causes the delete to fail. At 10:59 there is another attempt:

2015-09-14 10:59:25.985 58929 INFO heat.engine.stack [-] Stack DELETE IN_PROGRESS (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Stack DELETE started

Which results in the same thing happening to the second volume:

2015-09-14 10:59:28.024 58929 INFO heat.engine.resource [-] deleting CinderVolume "ALU-LCP-Block-StorageB" [ca9cff01-6a11-4cac-920c-4e70b80b4325] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]
2015-09-14 10:59:28.312 58929 INFO heat.engine.resource [-] DELETE: CinderVolume "ALU-LCP-Block-StorageB" [ca9cff01-6a11-4cac-920c-4e70b80b4325] Stack "me132-ALU-LCP-card03pair-hvhvfsjwa6kl" [bf61dfe7-c4f9-4a82-a295-63621ab3f525]
2015-09-14 10:59:28.337 58929 INFO heat.engine.stack [-] Stack DELETE FAILED (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Resource DELETE failed: BadRequest: Invalid volume: Volume status must be available or error, but current status is: error_deleting (HTTP 400) (Request-ID: req-2b210c35-0824-4adf-86fc-03da21a77ca6)

Again the stack fails, and there are two further attempts to delete it, with similar results (elided).

2015-09-14 11:00:19.414 58929 INFO heat.engine.stack [-] Stack DELETE IN_PROGRESS (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Stack DELETE started
2015-09-14 11:03:40.385 58929 INFO heat.engine.stack [-] Stack DELETE IN_PROGRESS (me132-ALU-LCP-card03pair-hvhvfsjwa6kl): Stack DELETE started


From the Heat perspective, the workaround is to remove the resources from Cinder manually if possible. After that, the stack delete should work.

Two changes could potentially improve Heat:
* recognise the error_deleting state as an immediate error
* backport the patch from comment 6 to prevent issuing a second delete to Cinder when in the 'deleting' state

However, these are comparatively minor improvements. The real issues are:
* That deleting the Cinder volume fails in the first place; and
* That there appears to be no way to delete a volume once it is in the "error_deleting" state.

Comment 4 above indicates that it *is* possible to delete a volume manually. I'm not sure how this is possible, since Heat calls exactly the same API to delete a volume as the command line cinder client does. It's a very basic API - an HTTP DELETE on the URL of the volume, no data is passed other than through headers - with no apparent room for variation between different clients. There is no evidence in the logs of a volume that has failed to be deleted ever being successfully deleted in the future, though of course it may have occurred after these logs were captured.

There is also no reason to suspect that the initial deletion error is caused by Heat doing something in the wrong order. The volume status immediately prior to the delete is 'available' and the attach_status 'detached'. Therefore, I'm reassigning this bug to the Cinder component for further investigation.

Comment 8 Sergey Gotliv 2015-09-16 11:52:06 UTC

Currently, in order to delete the volume in the "error_deleting" state you have to reset volume's state [1], however after doing this you may face the same problem that led to the "error_deleting" state in the first place [2].



[1] usage: cinder reset-state [--state <state>] <volume> [<volume> ...]

Explicitly updates the volume state.

Positional arguments:
  <volume>         Name or ID of volume to modify.

Optional arguments:
  --state <state>  The state to assign to the volume. Valid values are
                   "available," "error," "creating," "deleting," and
                   "error_deleting." Default=available.

[2]
2015-09-14 10:47:58.741 9642 ERROR oslo.messaging.rpc.dispatcher [req-260fbc74-fff9-48a2-b25e-88e1158fb44d 94a49664326149f181ce3c78895151ef ec8cbc2a9ad942bba26fbcf599f83d72 - - -] Exception during message handling: Bad or unexpected response from the storage volume backend API: EMCVnxCLICmdError : ['lun', '-destroy', '-name', u'volume-a602718b-c8d7-4b3c-b7ef-379478461828', '-forceDetach', '-o'] (Return Code: 156) (Output: [u"Cannot unbind LUN because it's contained in a Storage Group\\n"])

Comment 25 Sergey Gotliv 2015-11-10 07:49:07 UTC

This backport (comment#24) requires newer version of taskflow and additional tests to verify that this upgrade doesn't break anything.

Comment 30 Mike McCune 2016-03-28 22:34:54 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 31 David Hill 2016-04-24 20:24:22 UTC

Hello sir,

   Do we have any updates on this?   It seems to be stale since a month or so and we have another customer hitting this issue.

Thank you very much,

David Hill

Comment 34 David Hill 2016-07-31 15:28:28 UTC

Hello guys,

   Any updates on this ?  It's been in this state for ~ 2 months .

Thank you very much,

David Hill

Comment 37 Eric Harney 2016-10-14 15:13:45 UTC

OSP6 Hotfix build information is in comment #35.  Customers running OSP7+ should not encounter this issue.

Comment 38 David Hill 2016-11-14 15:36:50 UTC

Will reopen if we hit this again.