2002643 – Invalid bdm record is left when cinder api call to delete a volume attachment times out

Bug 2002643 - Invalid bdm record is left when cinder api call to delete a volume attachment times out

Summary: Invalid bdm record is left when cinder api call to delete a volume attachment...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z10
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:	2151410
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-09 12:23 UTC by Takashi Kajinami
Modified:	2024-12-20 20:59 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-nova-20.4.1-1.20221215133914.1ee93b9.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2149466 (view as bug list)
Environment:
Last Closed:	2024-01-15 16:01:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1978444	None	None	None	2022-06-13 05:53:19 UTC
OpenStack gerrit	845543	None	MERGED	Retry attachment delete API call for 504 Gateway Timeout	2022-12-30 15:42:12 UTC
OpenStack gerrit	866090	None	NEW	Fix the wrong exception used to retry detach API calls	2022-12-30 15:42:15 UTC
OpenStack gerrit	866091	None	NEW	Retry attachment delete API call for 504 Gateway Timeout	2022-12-30 15:42:16 UTC
Red Hat Issue Tracker	OSP-8580	None	None	None	2021-11-15 12:50:28 UTC

Description Takashi Kajinami 2021-09-09 12:23:07 UTC

Description of problem:

Currently nova executes the following steps while detaching a volume from a vm
 1. Detach the block device from the vm and the compute node where the vm is running
 2. Call cinder to terminate the volume connection
 3. Remove block device mappings from nova

However, when the API call to terminate the volume connection fails for some reason,
an invalid bdm record can be left in nova db.

In one customer case, the API call to terminate volume attachments is slow because of
the behavior of backend volume, and API calls frequently fail with 504 error caused
by timeout in httpd hosting cinder-api.

In such case cinder completes terminating the volume attachment and the volume becomes
available status with no attachement left, while 504 is returned to nova and the bdm record
is never deleted from nova's db.

Version-Release number of selected component (if applicable):


How reproducible:
Frequently

Steps to Reproduce:
1. Create an instance and attach a volume to the instance
2. Detach a volume from instance

Actual results:
BDM record is left in nova. The volume is detached from vm and its host, and the volume is available status in cinder.

Expected results:
BDM record is also deleted from nova.

Additional info:

Comment 1 Takashi Kajinami 2021-09-09 12:28:33 UTC

I can simulate the similar situation in my lab. Note that in my case cinder was not functional at all so volume attachment was not deleted from cinder.


1. Create an instance(test2) and attached a volume(testvolume) to the instance.

~~~
(overcloud) [stack@undercloud-0 ~]$ openstack server show test2
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                               |
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                              |
| OS-EXT-AZ:availability_zone         | nova                                                                                                                |
| OS-EXT-SRV-ATTR:host                | compute-1.redhat.local                                                                                              |
| OS-EXT-SRV-ATTR:hostname            | test2                                                                                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.redhat.local                                                                                              |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000000a                                                                                                   |
| OS-EXT-SRV-ATTR:kernel_id           |                                                                                                                     |
| OS-EXT-SRV-ATTR:launch_index        | 0                                                                                                                   |
| OS-EXT-SRV-ATTR:ramdisk_id          |                                                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id      | r-c5zj577r                                                                                                          |
| OS-EXT-SRV-ATTR:root_device_name    | /dev/vda                                                                                                            |
| OS-EXT-SRV-ATTR:user_data           | None                                                                                                                |
| OS-EXT-STS:power_state              | Running                                                                                                             |
| OS-EXT-STS:task_state               | None                                                                                                                |
| OS-EXT-STS:vm_state                 | active                                                                                                              |
| OS-SRV-USG:launched_at              | 2021-08-18T01:27:11.000000                                                                                          |
| OS-SRV-USG:terminated_at            | None                                                                                                                |
| accessIPv4                          |                                                                                                                     |
| accessIPv6                          |                                                                                                                     |
| addresses                           | private=192.168.10.207                                                                                              |
| config_drive                        |                                                                                                                     |
| created                             | 2021-08-18T01:27:01Z                                                                                                |
| description                         | None                                                                                                                |
| flavor                              | disk='1', ephemeral='0', extra_specs.hw_rng:allowed='True', original_name='m1.nano', ram='128', swap='0', vcpus='1' |
| hostId                              | 9bec297ebfece4b6798695f950595eea458e93548257dec7f346266b                                                            |
| host_status                         | UP                                                                                                                  |
| id                                  | 3a9b54b5-35b7-4427-bc64-f88b0d8949b3                                                                                |
| image                               | cirros-0.4.0-x86_64-disk.img (89a76ef6-2c14-4ff0-934f-fbd369fb0b1e)                                                 |
| key_name                            | None                                                                                                                |
| locked                              | False                                                                                                               |
| locked_reason                       | None                                                                                                                |
| name                                | test2                                                                                                               |
| progress                            | 0                                                                                                                   |
| project_id                          | e722c8ab654a440e8d3cdc5ef592f21b                                                                                    |
| properties                          |                                                                                                                     |
| security_groups                     | name='ssh-and-icmp'                                                                                                 |
| server_groups                       | []                                                                                                                  |
| status                              | ACTIVE                                                                                                              |
| tags                                | []                                                                                                                  |
| trusted_image_certificates          | None                                                                                                                |
| updated                             | 2021-08-18T01:27:11Z                                                                                                |
| user_id                             | 99f5628a6a55449fae14d97881a054ce                                                                                    |
| volumes_attached                    | delete_on_termination='False', id='35ff73cd-ebe2-40b3-8be6-f91aca202aa8'                                            |
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack volume show testvolume
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                          | Value                                                                                                                                                                                                                                                                                                                         |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attachments                    | [{'id': '35ff73cd-ebe2-40b3-8be6-f91aca202aa8', 'attachment_id': '34a2c438-aaaa-4000-8fb6-6b4c41d0b978', 'volume_id': '35ff73cd-ebe2-40b3-8be6-f91aca202aa8', 'server_id': '3a9b54b5-35b7-4427-bc64-f88b0d8949b3', 'host_name': 'compute-1.redhat.local', 'device': '/dev/vdb', 'attached_at': '2021-09-09T12:08:32.000000'}] |
| availability_zone              | nova                                                                                                                                                                                                                                                                                                                          |
| bootable                       | false                                                                                                                                                                                                                                                                                                                         |
| consistencygroup_id            | None                                                                                                                                                                                                                                                                                                                          |
| created_at                     | 2021-09-09T12:07:51.000000                                                                                                                                                                                                                                                                                                    |
| description                    | None                                                                                                                                                                                                                                                                                                                          |
| encrypted                      | False                                                                                                                                                                                                                                                                                                                         |
| id                             | 35ff73cd-ebe2-40b3-8be6-f91aca202aa8                                                                                                                                                                                                                                                                                          |
| migration_status               | None                                                                                                                                                                                                                                                                                                                          |
| multiattach                    | False                                                                                                                                                                                                                                                                                                                         |
| name                           | testvolume                                                                                                                                                                                                                                                                                                                    |
| os-vol-host-attr:host          | hostgroup@tripleo_iscsi#tripleo_iscsi                                                                                                                                                                                                                                                                                         |
| os-vol-mig-status-attr:migstat | None                                                                                                                                                                                                                                                                                                                          |
| os-vol-mig-status-attr:name_id | None                                                                                                                                                                                                                                                                                                                          |
| os-vol-tenant-attr:tenant_id   | e722c8ab654a440e8d3cdc5ef592f21b                                                                                                                                                                                                                                                                                              |
| properties                     |                                                                                                                                                                                                                                                                                                                               |
| replication_status             | None                                                                                                                                                                                                                                                                                                                          |
| size                           | 1                                                                                                                                                                                                                                                                                                                             |
| snapshot_id                    | None                                                                                                                                                                                                                                                                                                                          |
| source_volid                   | None                                                                                                                                                                                                                                                                                                                          |
| status                         | in-use                                                                                                                                                                                                                                                                                                                        |
| type                           | tripleo                                                                                                                                                                                                                                                                                                                       |
| updated_at                     | 2021-09-09T12:08:35.000000                                                                                                                                                                                                                                                                                                    |
| user_id                        | 99f5628a6a55449fae14d97881a054ce                                                                                                                                                                                                                                                                                              |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
~~~

2. Disable cinder-volume service. From this point any volume operation which requires interaction with backend storage, no longer succeeds.

~~~
[heat-admin@controller-0 ~]$ sudo pcs resource disable openstack-cinder-volume
[heat-admin@controller-0 ~]$ sudo pcs status

...

Full List of Resources:
  ...
  * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Stopped (disabled)
~~~


3. Detach the volume from the instance

(overcloud) [stack@undercloud-0 ~]$ openstack server remove volume test2 testvolume
(overcloud) [stack@undercloud-0 ~]$

4. nova-compute fails to detach the volume because API call to delete the volume attachment fails

~~~
2021-09-09 12:11:14.000 7 INFO nova.compute.manager [req-d4ea1834-53a1-4786-8c9a-f495307ec9d3 99f5628a6a55449fae14d97881a054ce e722c8ab654a440e8d3cdc5ef592f21b - default default] [instance: 3a9b54b5-35b7-4427-bc64-f88b0d8949b3] Detaching volume 35ff73cd-ebe2-40b3-8be6-f91aca202aa8
...
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server [req-d4ea1834-53a1-4786-8c9a-f495307ec9d3 99f5628a6a55449fae14d97881a054ce e722c8ab654a440e8d3cdc5ef592f21b - default default] Exception during message handling: cinderclient.excep
tions.ClientException: Gateway Timeout (HTTP 504)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
...
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 6237, in _detach_volume
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     attachment_id=attachment_id, destroy_bdm=destroy_bdm)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/nova/virt/block_device.py", line 486, in detach
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     attachment_id, destroy_bdm)
...
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/cinderclient/client.py", line 223, in delete
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     return self._cs_request(url, 'DELETE', **kwargs)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/cinderclient/client.py", line 205, in _cs_request
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     return self.request(url, method, **kwargs)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/site-packages/cinderclient/client.py", line 191, in request
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server     raise exceptions.from_response(resp, body)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server cinderclient.exceptions.ClientException: Gateway Timeout (HTTP 504)
2021-09-09 12:12:14.512 7 ERROR oslo_messaging.rpc.server 
~~~

5. At this point the volume is detached from the instance actually

~~~
[heat-admin@compute-1 ~]$ sudo podman exec nova_libvirt virsh domblklist 3a9b54b5-35b7-4427-bc64-f88b0d8949b3
 Target   Source
-----------------------------------------------------------------------------
 vda      /var/lib/nova/instances/3a9b54b5-35b7-4427-bc64-f88b0d8949b3/disk
~~~

6. However the bdm record is still left and the volume is still shown in volumes_attached.

~~~
(overcloud) [stack@undercloud-0 ~]$ openstack server show test2
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                               |
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                              |
| OS-EXT-AZ:availability_zone         | nova                                                                                                                |
| OS-EXT-SRV-ATTR:host                | compute-1.redhat.local                                                                                              |
| OS-EXT-SRV-ATTR:hostname            | test2                                                                                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.redhat.local                                                                                              |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000000a                                                                                                   |
| OS-EXT-SRV-ATTR:kernel_id           |                                                                                                                     |
| OS-EXT-SRV-ATTR:launch_index        | 0                                                                                                                   |
| OS-EXT-SRV-ATTR:ramdisk_id          |                                                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id      | r-c5zj577r                                                                                                          |
| OS-EXT-SRV-ATTR:root_device_name    | /dev/vda                                                                                                            |
| OS-EXT-SRV-ATTR:user_data           | None                                                                                                                |
| OS-EXT-STS:power_state              | Running                                                                                                             |
| OS-EXT-STS:task_state               | None                                                                                                                |
| OS-EXT-STS:vm_state                 | active                                                                                                              |
| OS-SRV-USG:launched_at              | 2021-08-18T01:27:11.000000                                                                                          |
| OS-SRV-USG:terminated_at            | None                                                                                                                |
| accessIPv4                          |                                                                                                                     |
| accessIPv6                          |                                                                                                                     |
| addresses                           | private=192.168.10.207                                                                                              |
| config_drive                        |                                                                                                                     |
| created                             | 2021-08-18T01:27:01Z                                                                                                |
| description                         | None                                                                                                                |
| flavor                              | disk='1', ephemeral='0', extra_specs.hw_rng:allowed='True', original_name='m1.nano', ram='128', swap='0', vcpus='1' |
| hostId                              | 9bec297ebfece4b6798695f950595eea458e93548257dec7f346266b                                                            |
| host_status                         | UP                                                                                                                  |
| id                                  | 3a9b54b5-35b7-4427-bc64-f88b0d8949b3                                                                                |
| image                               | cirros-0.4.0-x86_64-disk.img (89a76ef6-2c14-4ff0-934f-fbd369fb0b1e)                                                 |
| key_name                            | None                                                                                                                |
| locked                              | False                                                                                                               |
| locked_reason                       | None                                                                                                                |
| name                                | test2                                                                                                               |
| progress                            | 0                                                                                                                   |
| project_id                          | e722c8ab654a440e8d3cdc5ef592f21b                                                                                    |
| properties                          |                                                                                                                     |
| security_groups                     | name='ssh-and-icmp'                                                                                                 |
| server_groups                       | []                                                                                                                  |
| status                              | ACTIVE                                                                                                              |
| tags                                | []                                                                                                                  |
| trusted_image_certificates          | None                                                                                                                |
| updated                             | 2021-08-18T01:27:11Z                                                                                                |
| user_id                             | 99f5628a6a55449fae14d97881a054ce                                                                                    |
| volumes_attached                    | delete_on_termination='False', id='35ff73cd-ebe2-40b3-8be6-f91aca202aa8'                                            |
+-------------------------------------+---------------------------------------------------------------------------------------------------------------------+
~~~

Comment 4 melanie witt 2021-09-09 21:29:32 UTC

Thanks for providing detailed information on this bug.

Looking at the code for volume detach, I see that besides HTTP connection errors, we only retry cinder calls for 500 InternalServerError [1]:

    @translate_attachment_exception
    @retrying.retry(stop_max_attempt_number=5,
                    retry_on_exception=lambda e:
                    type(e) == cinder_apiclient.exceptions.InternalServerError)  <=== retry
    def attachment_delete(self, context, attachment_id):
        try:
            cinderclient(
                context, '3.44', skip_version_check=True).attachments.delete(
                    attachment_id)
    [...]

This retry was added to resolve a different bug from the past [2].

I think for this 504 GatewayTimeout case, we need something similar. 504 GatewayTimeout in this case doesn't necessarily mean that cinder has not or will not detach the volume. If we add cinder_apiclient.exceptions.GatewayTimeout to the retryable exceptions in the decorator, we could call cinder again and find the volume has already been detached and then continue on to delete the block_device_mapping record.

To do this, we will also need to handle and ignore 404 NotFound in case cinder has already deleted the attachment when we call again.

We already handle some neutron calls in a similar way, so that if for example, a port binding activation is slow in neutron, we try the neutron call again and if we find the port binding already activated (409 Conflict) we consider that to be a successful activation and continue.

I think a similar approach could work well for this bug.

[1] https://github.com/openstack/nova/blob/stable/train/nova/volume/cinder.py#L885-L893
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1622072

Comment 5 Takashi Kajinami 2021-09-09 23:45:38 UTC

Hi Melanie,

Thank you for the pointer.

I think we can let nova to retry delete not only for 500 but also for any 5xx request, because the 5xx error indicates something is wrong in server side.

Regarding the 404 error returned when the attachment is already deleted, I'm wondering whether we can ignore it from the first time.
To preserve the current behavior we should not ignore 404 returned to the first delete request, but because of this behavior we have needed to do manual operation when somehow attachment was deleted from only cinder.
Because from nova's perspective there is almost nothing to do when the attachment is not found in cinder, just igonoring 404 for any case, with logging a warning message, might be a better option.
Of cause there is a trade off and ignoring 404 might leave something stale.

Comment 6 Lee Yarwood 2021-09-13 14:56:13 UTC

Agree with everything Mel and Takashi have discussed here. Just wanted to add that handling 404s for AttachmentNotFound and VolumeNotFound would handle the corner race documented in #1977322.

Comment 7 Takashi Kajinami 2021-09-18 14:20:12 UTC

I was looking at the logic Malanie pointed but I'm now a bit confused with the current implementation.
Looking at the current implementation the following logic expect two different exceptions
from attachments.delete() call.

Looking at the current implementation in cinderclient, I could not find any logic which is likely
to make that API call exceptions from the cinderclient.apiclient.exceptions module, and the current
implementation looks invalid and ineffective for me...
~~~
    @translate_attachment_exception
    @retrying.retry(stop_max_attempt_number=5,
                    retry_on_exception=lambda e:
                    type(e) == cinder_apiclient.exceptions.InternalServerError) <===(*)
    def attachment_delete(self, context, attachment_id):
        try:
            cinderclient(
                context, '3.44', skip_version_check=True).attachments.delete(
                    attachment_id)
        except cinder_exception.ClientException as ex: <===(*)
            with excutils.save_and_reraise_exception():
                LOG.error(('Delete attachment failed for attachment '
                           '%(id)s. Error: %(msg)s Code: %(code)s'),
                          {'id': attachment_id,
                           'msg': six.text_type(ex),
                           'code': getattr(ex, 'code', None)})
~~~

Comment 8 Lee Yarwood 2021-09-20 09:41:16 UTC

(In reply to Takashi Kajinami from comment #7)
> I was looking at the logic Malanie pointed but I'm now a bit confused with
> the current implementation.
> Looking at the current implementation the following logic expect two
> different exceptions
> from attachments.delete() call.
> 
> Looking at the current implementation in cinderclient, I could not find any
> logic which is likely
> to make that API call exceptions from the cinderclient.apiclient.exceptions
> module, and the current
> implementation looks invalid and ineffective for me...
> ~~~
>     @translate_attachment_exception
>     @retrying.retry(stop_max_attempt_number=5,
>                     retry_on_exception=lambda e:
>                     type(e) ==
> cinder_apiclient.exceptions.InternalServerError) <===(*)
>     def attachment_delete(self, context, attachment_id):
>         try:
>             cinderclient(
>                 context, '3.44', skip_version_check=True).attachments.delete(
>                     attachment_id)
>         except cinder_exception.ClientException as ex: <===(*)
>             with excutils.save_and_reraise_exception():
>                 LOG.error(('Delete attachment failed for attachment '
>                            '%(id)s. Error: %(msg)s Code: %(code)s'),
>                           {'id': attachment_id,
>                            'msg': six.text_type(ex),
>                            'code': getattr(ex, 'code', None)})
> ~~~

Specifically InternalServerError is missing from this list so it isn't translated by cinderclient.exception.from_response() right?

https://github.com/openstack/python-cinderclient/blob/6ff9e30e175f2df80fb27701bcdc7e34f0edfb9e/cinderclient/exceptions.py#L236-L245

236 # In Python 2.4 Exception is old-style and thus doesn't have a __subclasses__() 
237 # so we can do this:                                                            
238 #     _code_map = dict((c.http_status, c)                                       
239 #                      for c in ClientException.__subclasses__())               
240 #                                                                               
241 # Instead, we have to hardcode it:                                              
242 _code_map = dict((c.http_status, c) for c in [BadRequest, Unauthorized,         
243                                               Forbidden, NotFound,              
244                                               NotAcceptable,                    
245                                               OverLimit, HTTPNotImplemented])

Comment 9 Lee Yarwood 2021-09-20 09:45:57 UTC

Sigh that isn't even the same set of exceptions, I'll tidy this up now.

Comment 10 Takashi Kajinami 2021-09-20 09:50:58 UTC

Hi Lee,

I have reported the issue with current logic here https://bugs.launchpad.net/nova/+bug/1944043
and submitted a fix for it https://review.opendev.org/c/openstack/nova/+/809934 .
Once that is merged we can extend that fixed logic.

Comment 20 Artom Lifshitz 2024-01-15 16:01:23 UTC

Given that we've stopped doing 16.2 maintenance releases, the chances of a 16.1 MR are basically none at this point. Closing EOL.

Note You need to log in before you can comment on or make changes to this bug.