1622072 – Openstack didn't remove volume on instance deletion

Bug 1622072 - Openstack didn't remove volume on instance deletion

Summary: Openstack didn't remove volume on instance deletion

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Francois Palin
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1827413 1827416 1827419 1827420
TreeView+	depends on / blocked

Reported:	2018-08-24 11:34 UTC by Pablo Iranzo Gómez
Modified:	2023-03-21 18:58 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1827413 (view as bug list)
Environment:
Last Closed:	2021-07-07 10:38:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1834659	None	None	None	2019-06-28 15:39:10 UTC
OpenStack gerrit	669674	'None'	MERGED	Add retry to cinder API calls related to volume detach	2020-10-17 21:11:59 UTC
Red Hat Issue Tracker	OSP-3138	None	None	None	2022-08-23 18:48:40 UTC
Red Hat Knowledge Base (Solution)	3076931	None	None	None	2018-08-24 11:34:24 UTC

Description Pablo Iranzo Gómez 2018-08-24 11:34:25 UTC

Description of problem:

Hi
On instance deletion, the backing volume is not removed.

Sounded like https://bugzilla.redhat.com/show_bug.cgi?id=1198169 which was closed for OSP6 and what is described on kbase: https://access.redhat.com/solutions/3076931.


Version-Release number of selected component (if applicable):

openstack-cinder-9.1.4-12.el7ost.noarch

How reproducible:

When we deploy an non-ephemeral instance (i.e. Creating a new volume), and indicate "YES" in "Delete Volume on Instance delete", it does not properly work. if we delete the instance, the volume is not removed. The status remains as "In-use" and "Attached to None on /dev/vda". 
An example: 
abcfa1db-1748-4f04-9a29-128cf22efcc5	- 	130GiB 	In-use 	- 	Attached to None on /dev/vda

Comment 4 Gorka Eguileor 2018-08-24 14:15:26 UTC

I see errors in the volume logs caused by a missing default volume type named HBSDVSPG200.  We should check the configuration to see if this is expected, which is possible since these are happening on the Cinder-API.

With the logs on INFO level is hard to tell what's going on with precision, but it all points to Nova ignoring an error on the call to terminate connection (so the volume is still attached) and then trying to delete the volume, which cannot be deleted since it's still attached.

The error that Nova is ignoring, is Cinder timing out at the API service on what I assume is the terminate connection call, but we cannot know why since there are no log entries on the Volume service during the minute that the API is waiting before timing out.

We would need DEBUG log levels on the Cinder services to tell what's going on on the terminate connection.

Comment 6 Lee Yarwood 2018-08-30 11:41:36 UTC

(In reply to Gorka Eguileor from comment #4)
> I see errors in the volume logs caused by a missing default volume type
> named HBSDVSPG200.  We should check the configuration to see if this is
> expected, which is possible since these are happening on the Cinder-API.
> 
> With the logs on INFO level is hard to tell what's going on with precision,
> but it all points to Nova ignoring an error on the call to terminate
> connection (so the volume is still attached) and then trying to delete the
> volume, which cannot be deleted since it's still attached.

I can see the os-terminate_connection failures due to RPC timeouts to c-vol in the c-api logs but I can't match it up to anything on the n-cpu side. Most of these appear to be successful anyway AFAICT.

Looking at the n-cpu code in Newton I can see how failures in os-terminate_connection could result in this behaviour as we wouldn't call Cinder to actually detach the volume from the server but there's zero evidence of this happening in the logs.

> The error that Nova is ignoring, is Cinder timing out at the API service on
> what I assume is the terminate connection call, but we cannot know why since
> there are no log entries on the Volume service during the minute that the
> API is waiting before timing out.
> 
> We would need DEBUG log levels on the Cinder services to tell what's going
> on on the terminate connection.

Pablo, can we get DEBUG logs from Nova and Cinder, along with an example instance UUID so I can trace this, the example UUID in c#0 isn't present anywhere in the sosreports.

Comment 10 Alex Stupnikov 2018-12-10 08:44:30 UTC

Hello. May I ask you to update this bug and let me know if support could provide something for you? BR, Alex.

Note You need to log in before you can comment on or make changes to this bug.