Bug 1977322 - Unable to start instance due to stale volumes
Summary: Unable to start instance due to stale volumes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z9
: 16.1 (Train on RHEL 8.2)
Assignee: Gorka Eguileor
QA Contact: Evelina Shames
Andy Stillman
URL:
Whiteboard:
Depends On: 2057002
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-29 13:14 UTC by Paul Jany
Modified: 2022-12-09 08:24 UTC (History)
11 users (show)

Fixed In Version: openstack-cinder-15.4.0-1.20220707103428.58f0e73.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, a race condition occurred when the Compute service (nova) requested the Block Storage service (cinder) to detach a volume and there was an external request to delete the volume. The race condition resulted in the volume failing to detach, the volume being deleted, and the Compute service being unable to remove the non-existent volume. With this update, the race condition is resolved.
Clone Of:
: 2057002 (view as bug list)
Environment:
Last Closed: 2022-12-07 20:24:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1937084 0 None None None 2021-07-21 13:19:55 UTC
OpenStack gerrit 801913 0 None MERGED Fix: Race between attachment and volume deletion 2022-04-11 16:34:00 UTC
Red Hat Issue Tracker OSP-5606 0 None None None 2021-12-23 08:59:43 UTC
Red Hat Product Errata RHBA-2022:8795 0 None None None 2022-12-07 20:25:09 UTC

Description Paul Jany 2021-06-29 13:14:04 UTC
Description of problem:
Instance fails to start due to stale volumes attached. 

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.2 GA (Train)

How reproducible:
Cu use "k8s cinder-csi" to create/attach/delete volumes for openshift overcloud instances, and at times volumes get delete prior to detach of volume. This create issues while starting of instance. The cinder backed is standalone ceph. 

Cu has enabled debug for nova and cinder. I will share the debug logs once available. See the complete workflow logs of stale volume captured prior to debug enable.


1.)Instance and attached volumes
#openstack server show  89c9154b-7cbb-4af2-bb74-4a1047b999ac --fit-width
+-------------------------------------+----------------------------------------------------------+
| Field                               | Value                                                    |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-AZ:availability_zone         | nova                                                     |
| OS-EXT-SRV-ATTR:host                | compute-xx.yy                                            |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-xx.yy                                            |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000000                                        |
| OS-EXT-STS:power_state              | Running                                                  |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | active                                                   |
| OS-SRV-USG:launched_at              | 2020-11-23T15:12:25.000000                               |
| OS-SRV-USG:terminated_at            | None                                                     |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| addresses                           | bas1-net-worker=a.b.c.d                                  |
| config_drive                        | True                                                     |
| created                             | 2020-11-23T15:12:13Z                                     |
| flavor                              | m32.large (17127e6d-1ce2-4f3e-b146-fefc7413d7e0)         |
| hostId                              | 2edcb7ea11443ea35c018a12a38c479184ed8a9bad6dbb3a3a803357 |
| id                                  | 89c9154b-7cbb-4af2-bb74-4a1047b999ac                     |
| image                               | N/A (booted from volume)                                 |
| key_name                            | abc-ops-def                                              |
| name                                | 123-worker-456                                           |
| progress                            | 0                                                        |
| project_id                          | 30b9e06a62cb4374bfa0bf27ee55ec8d                         |
| properties                          |                                                          |
| security_groups                     | name='secgroup_workers_bas1'                             |
| status                              | ACTIVE                                                   |
| updated                             | 2020-11-25T15:28:27Z                                     |
| user_id                             | 7ef6a196394a43488a28d51261676929                         |
| volumes_attached                    | id='019b0d9f-3a9c-4e7c-8454-a71de0516687'                |
|                                     | id='21953828-ffd9-4803-85d1-bdd7cb48964b'                |
|                                     | id='25e88ac0-8412-451c-8390-97af19fbfd3f'                |
|                                     | id='269b5f53-0596-4b52-994a-bfd9e1d5d648'                |
|                                     | id='2754d281-72a8-40df-9d50-68ce2e1866e9'                |
|                                     | id='2a7ea684-ac04-400a-bcf9-cb7fb2e06a6c'                |
|                                     | id='36dc4404-dbe9-4365-adb3-6bbe3bbe5276'                |
|                                     | id='39cfe83d-5dee-4401-b3cd-6d322f21c594'                |
|                                     | id='3c91e033-cd1d-43f8-b5c9-388048270f52'                |
|                                     | id='471e8d0a-d36b-44c0-8dd9-9844306f5bb5'                |
|                                     | id='535f84c2-99aa-4cec-afb7-ec899cbb235e'                |
|                                     | id='54b611d4-eab4-489b-aa7d-8548969a53c5'                |
|                                     | id='71c487ce-fd0f-4f0a-b4bf-733cd3ac97e9'                |
|                                     | id='8350009c-963a-47c6-9a02-b438cdde4db6'                |
|                                     | id='8b8d22b8-c6b3-4e11-9d3e-d8dd1fc0aafd'                |
|                                     | id='aac78f41-09f9-4eb5-ba7f-e43e29aeb936'                |
|                                     | id='ad74261f-0077-4f5e-befb-cecfd49df2df'                |
|                                     | id='b40f9c45-4d0e-4e52-954d-f7b3e02ecd69'                |
|                                     | id='b4a15652-d8a8-4cb2-9484-6280f2c541d0'                |
|                                     | id='c188c8b6-74ba-41bf-b59b-39fe9a2cac27'                |
|                                     | id='ce7ccca6-a592-4676-9d76-424339cdb316'                |
|                                     | id='e6e86911-f426-4a34-9170-8a8ecd200022'                |
|                                     | id='ecb95982-f1c5-49a9-80c9-bb6b724dc75b'                |
+-------------------------------------+----------------------------------------------------------+


2.) Status of attached volumes, in the order of volume id, it's status below. (Captured by cu running #openstack volume show <vol_id> -c status)
019b0d9f-3a9c-4e7c-8454-a71de0516687
in-use
21953828-ffd9-4803-85d1-bdd7cb48964b
in-use
25e88ac0-8412-451c-8390-97af19fbfd3f
in-use
269b5f53-0596-4b52-994a-bfd9e1d5d648
in-use
2754d281-72a8-40df-9d50-68ce2e1866e9
in-use
2a7ea684-ac04-400a-bcf9-cb7fb2e06a6c
detaching
36dc4404-dbe9-4365-adb3-6bbe3bbe5276
in-use
39cfe83d-5dee-4401-b3cd-6d322f21c594
in-use
3c91e033-cd1d-43f8-b5c9-388048270f52
in-use
471e8d0a-d36b-44c0-8dd9-9844306f5bb5
in-use
535f84c2-99aa-4cec-afb7-ec899cbb235e
in-use
54b611d4-eab4-489b-aa7d-8548969a53c5
in-use
71c487ce-fd0f-4f0a-b4bf-733cd3ac97e9
in-use
8350009c-963a-47c6-9a02-b438cdde4db6
in-use
8b8d22b8-c6b3-4e11-9d3e-d8dd1fc0aafd
in-use
aac78f41-09f9-4eb5-ba7f-e43e29aeb936
in-use
ad74261f-0077-4f5e-befb-cecfd49df2df
No volume with a name or ID of 'ad74261f-0077-4f5e-befb-cecfd49df2df' exists.
b40f9c45-4d0e-4e52-954d-f7b3e02ecd69
in-use
b4a15652-d8a8-4cb2-9484-6280f2c541d0
in-use
c188c8b6-74ba-41bf-b59b-39fe9a2cac27
in-use
ce7ccca6-a592-4676-9d76-424339cdb316
in-use
e6e86911-f426-4a34-9170-8a8ecd200022
in-use
ecb95982-f1c5-49a9-80c9-bb6b724dc75b
in-use

#openstack server remove volume 89c9154b-7cbb-4af2-bb74-4a1047b999ac ad74261f-0077-4f5e-befb-cecfd49df2df
No volume with a name or ID of 'ad74261f-0077-4f5e-befb-cecfd49df2df' exists.

3.) Stale volume information from database.
MariaDB [cinder]> SELECT a.id,a.instance_uuid,a.volume_id,b.id,b.status,b.deleted FROM nova.block_device_mapping a LEFT JOIN cinder.volumes b ON a.volume_id = b.id  WHERE a.source_type = 'volume' AND a.destination_type = 'volume' AND NOT a.deleted AND (b.deleted OR b.deleted is NULL);
+---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+
| id      | instance_uuid                        | volume_id                            | id                                   | status  | deleted |
+---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+
| 6333993 | 89c9154b-7cbb-4af2-bb74-4a1047b999ac | ad74261f-0077-4f5e-befb-cecfd49df2df | ad74261f-0077-4f5e-befb-cecfd49df2df | deleted |       1 |
+---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+

4.)Chronology of events from logs.

1.) Volume creation
/var/log/containers/cinder/cinder-volume.log:2021-06-25 15:41:22.719 121 INFO cinder.volume.flows.manager.create_volume [req-9acd9c92-2c5f-4289-b232-98bd5fef60cb 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Volume volume-ad74261f-0077-4f5e-befb-cecfd49df2df (ad74261f-0077-4f5e-befb-cecfd49df2df): created successfully

2.) Volume attach,detach
/var/log/containers/nova/nova-compute.log:

2021-06-25 15:42:14.483 7 INFO nova.compute.manager [req-c63dc044-7c9d-4a99-8ec3-536a9ae0f650 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] [instance: 89c9154b-7cbb-4af2-bb74-4a1047b999ac] Attaching volume ad74261f-0077-4f5e-befb-cecfd49df2df to /dev/sdu
2021-06-25 15:44:28.152 7 INFO nova.compute.manager [req-e9bbf652-3016-425e-bcd2-a683fd86ece1 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] [instance: 89c9154b-7cbb-4af2-bb74-4a1047b999ac] Detaching volume ad74261f-0077-4f5e-befb-cecfd49df2df
2021-06-25 15:44:52.442 7 ERROR nova.volume.cinder [req-e9bbf652-3016-425e-bcd2-a683fd86ece1 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete attachment failed for attachment 1 [details]cedf0e9-93e2-4b60-9954-739e63f50519. Error: Volume ad74261f-0077-4f5e-befb-cecfd49df2df could not be found. (HTTP 404) (Request-ID: req-6319dabd-67c4-4dd4-9637-06350776df28) Code: 404: cinderclient.exceptions.NotFound: Volume ad74261f-0077-4f5e-befb-cecfd49df2df could not be found. (HTTP 404) (Request-ID: req-6319dabd-67c4-4dd4-9637-06350776df28)

3.)Volume deletion
/var/log/containers/cinder/cinder-api.log.1
2021-06-25 15:44:46.201 22 INFO cinder.api.openstack.wsgi [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] DELETE https://iaas.aa.bb:13776/v3/30b9e06a62cb4374bfa0bf27ee55ec8d/volumes/ad74261f-0077-4f5e-befb-cecfd49df2df
2021-06-25 15:44:46.201 22 INFO cinder.api.v3.volumes [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume with id: ad74261f-0077-4f5e-befb-cecfd49df2df
2021-06-25 15:44:46.201 22 INFO cinder.api.v3.volumes [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume with id: ad74261f-0077-4f5e-befb-cecfd49df2df
2021-06-25 15:44:46.225 22 INFO cinder.volume.api [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Volume info retrieved successfully.
2021-06-25 15:44:46.241 22 INFO cinder.volume.api [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume request issued successfully.
2021-06-25 15:44:46.241 22 INFO cinder.api.openstack.wsgi [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] https://iaas.aa.bb:13776/v3/30b9e06a62cb4374bfa0bf27ee55ec8d/volumes/ad74261f-0077-4f5e-befb-cecfd49df2df returned with HTTP 202

5.) sos report of controller, and compute node running the instance. Above logs captured from these sos report. There are sos report available post debug enable, but, needed logs are missing. I have asked cu to set log retention for a week and share the debug logs once issue re-occur.

drwxrwxrwx+ 3 yank yank         73 Jun 25 15:07 0180-sosreport-ctrl-xx-02968407-2021-06-25-ftvupwr.tar.xz
drwxrwxrwx+ 3 yank yank         73 Jun 25 15:07 0170-sosreport-ctrl-yy-02968407-2021-06-25-uxyaryr.tar.xz
drwxrwxrwx+ 3 yank yank         73 Jun 25 15:07 0160-sosreport-ctrl-zz-02968407-2021-06-25-wpgvykt.tar.xz
drwxrwxrwx+ 3 yank yank         76 Jun 25 15:24 0210-sosreport-compute-aa-02968407-2021-06-25-xbqaicy.tar.xz


Actual results:
Volume get delete prior of volume get detached from instance. 

Expected results:
Volume should delete after volume is detached.

Comment 2 Gorka Eguileor 2021-07-02 11:19:16 UTC
After looking a the logs and Cinder code this seems to be a race condition with Cinder-CSI requests caused by an internal Cinder issue.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume.
- Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation.
- Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation.

The actual set of events that leads to the issue reported in this BZ are:

[OpenShift]
- Cinder-CSI requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

[Cinder-API]
- R1: Checks the status of the volume
- R1: Sends terminate connection request (R1) to cinder-volume and **waits**

[Cinder-Volume]
- R1: Ask the driver to terminate the connection
- R1: The driver asks the backend to unmap and unexport the volume
- R1: The status of the volume is changed in the DB to "available"

[OpenShift]
- Cinder-CSI asks Cinder to delete the volume (Request R2)

[Cinder-API]
- R2: Check that the volume's status is valid.  It's available so it can be deleted.
- R2: Tell cinder-volume to delete the volume and return immediately.

[Cinder-Volume]
- R2: Volume is deleted and DB entry is deleted
- R1: Finish the termination of the connection

[Cinder-API]
- R1: Now that cinder-volume has finished the termination the code continues
- R1: Try to modify the volume in the DB
- R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB
- R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova

[Nova]
- R1: Cinder responds with 404 on the attachment delete request
- R1: Nova leaves the volume as attached, since the attachment delete failed

At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it.

**This is caused by a Cinder bug**, but there is some robustification work that could be done in Nova, since the volume could be left in a "detached from instance" state (since the os-brick call succeeded), and a second detach request could directly skip the os-brick call and when it sees that the volume or the attachment no longer exists in Cinder it can proceed to remove it from the instance's XML.

This BZ will only focus on fixing the Cinder issue.

Comment 5 Lee Yarwood 2021-09-13 12:23:21 UTC
I've raised https://github.com/kubernetes/cloud-provider-openstack/issues/1645 upstream as I honestly believe this is something that needs to be addressed in the cinder-csi plugin in k8s.

Comment 20 Evelina Shames 2022-11-23 08:51:15 UTC
Verification flow:
- version: openstack-cinder-15.4.0-1.20221003203219.58f0e73.el8ost.noarch
- Ceph backed deploymen
- Booted an instance with attached volume

(overcloud) [stack@undercloud-0 ~]$ cinder list
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+
| ID                                   | Status | Name         | Size | Volume Type | Bootable | Attached to                          |
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+
| 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 | in-use | Pansible_vol | 1    | tripleo     | true     | 64655a8f-72a3-4380-8960-c0c93c14dc48 |
+--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+

As this was already verified in a newer version, followed bug 2057002#c6 flow:

delete loop in 3 parallel sessions:
(overcloud) [stack@undercloud-0 ~]$ while true; do cinder delete 5621aeae-99c2-49c9-a740-c6e3f50bec79; done

"""
Delete for volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 failed: Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer. (HTTP 400) (Request-ID: req-50f74944-e015-4fa5-b4b4-dc19f5dd52d8)
ERROR: Unable to delete any of the specified volumes.
"""

detach command:
(overcloud) [stack@undercloud-0 ~]$ nova volume-detach inst1 5621aeae-99c2-49c9-a740-c6e3f50bec79

"""
Request to delete volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 has been accepted.

Delete for volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 failed: No volume with a name or ID of '5644fc51-1ae3-4ecc-b8a7-42ac077e6725' exists.
ERROR: Unable to delete any of the specified volumes.
"""

Cinder list post deletion:
(overcloud) [stack@undercloud-0 ~]$ cinder list
+----+--------+------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+----+--------+------+------+-------------+----------+-------------+
+----+--------+------+------+-------------+----------+-------------+ 

I ran this same cycle several times, 
on all the cycles the volume was successfully detached followed by volume deletion as excepted.

Moving to 'Verified'.

Comment 26 errata-xmlrpc 2022-12-07 20:24:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795


Note You need to log in before you can comment on or make changes to this bug.