Bug 1622072

Summary: Openstack didn't remove volume on instance deletion
Product: Red Hat OpenStack Reporter: Pablo Iranzo Gómez <pablo.iranzo>
Component: openstack-novaAssignee: Francois Palin <fpalin>
Status: CLOSED EOL QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: astupnik, dasmith, eglynn, fpalin, geguileo, igarciam, jhakimra, kchamart, lyarwood, nlevinki, pablo.iranzo, sbauza, scohen, sgordon, srevivo, tvvcox, vromanso
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1827413 (view as bug list) Environment:
Last Closed: 2021-07-07 10:38:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1827413, 1827416, 1827419, 1827420    

Description Pablo Iranzo Gómez 2018-08-24 11:34:25 UTC
Description of problem:

Hi
On instance deletion, the backing volume is not removed.

Sounded like https://bugzilla.redhat.com/show_bug.cgi?id=1198169 which was closed for OSP6 and what is described on kbase: https://access.redhat.com/solutions/3076931.


Version-Release number of selected component (if applicable):

openstack-cinder-9.1.4-12.el7ost.noarch

How reproducible:

When we deploy an non-ephemeral instance (i.e. Creating a new volume), and indicate "YES" in "Delete Volume on Instance delete", it does not properly work. if we delete the instance, the volume is not removed. The status remains as "In-use" and "Attached to None on /dev/vda". 
An example: 
abcfa1db-1748-4f04-9a29-128cf22efcc5	- 	130GiB 	In-use 	- 	Attached to None on /dev/vda

Comment 4 Gorka Eguileor 2018-08-24 14:15:26 UTC
I see errors in the volume logs caused by a missing default volume type named HBSDVSPG200.  We should check the configuration to see if this is expected, which is possible since these are happening on the Cinder-API.

With the logs on INFO level is hard to tell what's going on with precision, but it all points to Nova ignoring an error on the call to terminate connection (so the volume is still attached) and then trying to delete the volume, which cannot be deleted since it's still attached.

The error that Nova is ignoring, is Cinder timing out at the API service on what I assume is the terminate connection call, but we cannot know why since there are no log entries on the Volume service during the minute that the API is waiting before timing out.

We would need DEBUG log levels on the Cinder services to tell what's going on on the terminate connection.

Comment 6 Lee Yarwood 2018-08-30 11:41:36 UTC
(In reply to Gorka Eguileor from comment #4)
> I see errors in the volume logs caused by a missing default volume type
> named HBSDVSPG200.  We should check the configuration to see if this is
> expected, which is possible since these are happening on the Cinder-API.
> 
> With the logs on INFO level is hard to tell what's going on with precision,
> but it all points to Nova ignoring an error on the call to terminate
> connection (so the volume is still attached) and then trying to delete the
> volume, which cannot be deleted since it's still attached.

I can see the os-terminate_connection failures due to RPC timeouts to c-vol in the c-api logs but I can't match it up to anything on the n-cpu side. Most of these appear to be successful anyway AFAICT.

Looking at the n-cpu code in Newton I can see how failures in os-terminate_connection could result in this behaviour as we wouldn't call Cinder to actually detach the volume from the server but there's zero evidence of this happening in the logs.

> The error that Nova is ignoring, is Cinder timing out at the API service on
> what I assume is the terminate connection call, but we cannot know why since
> there are no log entries on the Volume service during the minute that the
> API is waiting before timing out.
> 
> We would need DEBUG log levels on the Cinder services to tell what's going
> on on the terminate connection.

Pablo, can we get DEBUG logs from Nova and Cinder, along with an example instance UUID so I can trace this, the example UUID in c#0 isn't present anywhere in the sosreports.

Comment 10 Alex Stupnikov 2018-12-10 08:44:30 UTC
Hello. May I ask you to update this bug and let me know if support could provide something for you? BR, Alex.