Description of problem: Instance fails to start due to stale volumes attached. Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.1.2 GA (Train) How reproducible: Cu use "k8s cinder-csi" to create/attach/delete volumes for openshift overcloud instances, and at times volumes get delete prior to detach of volume. This create issues while starting of instance. The cinder backed is standalone ceph. Cu has enabled debug for nova and cinder. I will share the debug logs once available. See the complete workflow logs of stale volume captured prior to debug enable. 1.)Instance and attached volumes #openstack server show 89c9154b-7cbb-4af2-bb74-4a1047b999ac --fit-width +-------------------------------------+----------------------------------------------------------+ | Field | Value | +-------------------------------------+----------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-xx.yy | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-xx.yy | | OS-EXT-SRV-ATTR:instance_name | instance-00000000 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2020-11-23T15:12:25.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | bas1-net-worker=a.b.c.d | | config_drive | True | | created | 2020-11-23T15:12:13Z | | flavor | m32.large (17127e6d-1ce2-4f3e-b146-fefc7413d7e0) | | hostId | 2edcb7ea11443ea35c018a12a38c479184ed8a9bad6dbb3a3a803357 | | id | 89c9154b-7cbb-4af2-bb74-4a1047b999ac | | image | N/A (booted from volume) | | key_name | abc-ops-def | | name | 123-worker-456 | | progress | 0 | | project_id | 30b9e06a62cb4374bfa0bf27ee55ec8d | | properties | | | security_groups | name='secgroup_workers_bas1' | | status | ACTIVE | | updated | 2020-11-25T15:28:27Z | | user_id | 7ef6a196394a43488a28d51261676929 | | volumes_attached | id='019b0d9f-3a9c-4e7c-8454-a71de0516687' | | | id='21953828-ffd9-4803-85d1-bdd7cb48964b' | | | id='25e88ac0-8412-451c-8390-97af19fbfd3f' | | | id='269b5f53-0596-4b52-994a-bfd9e1d5d648' | | | id='2754d281-72a8-40df-9d50-68ce2e1866e9' | | | id='2a7ea684-ac04-400a-bcf9-cb7fb2e06a6c' | | | id='36dc4404-dbe9-4365-adb3-6bbe3bbe5276' | | | id='39cfe83d-5dee-4401-b3cd-6d322f21c594' | | | id='3c91e033-cd1d-43f8-b5c9-388048270f52' | | | id='471e8d0a-d36b-44c0-8dd9-9844306f5bb5' | | | id='535f84c2-99aa-4cec-afb7-ec899cbb235e' | | | id='54b611d4-eab4-489b-aa7d-8548969a53c5' | | | id='71c487ce-fd0f-4f0a-b4bf-733cd3ac97e9' | | | id='8350009c-963a-47c6-9a02-b438cdde4db6' | | | id='8b8d22b8-c6b3-4e11-9d3e-d8dd1fc0aafd' | | | id='aac78f41-09f9-4eb5-ba7f-e43e29aeb936' | | | id='ad74261f-0077-4f5e-befb-cecfd49df2df' | | | id='b40f9c45-4d0e-4e52-954d-f7b3e02ecd69' | | | id='b4a15652-d8a8-4cb2-9484-6280f2c541d0' | | | id='c188c8b6-74ba-41bf-b59b-39fe9a2cac27' | | | id='ce7ccca6-a592-4676-9d76-424339cdb316' | | | id='e6e86911-f426-4a34-9170-8a8ecd200022' | | | id='ecb95982-f1c5-49a9-80c9-bb6b724dc75b' | +-------------------------------------+----------------------------------------------------------+ 2.) Status of attached volumes, in the order of volume id, it's status below. (Captured by cu running #openstack volume show <vol_id> -c status) 019b0d9f-3a9c-4e7c-8454-a71de0516687 in-use 21953828-ffd9-4803-85d1-bdd7cb48964b in-use 25e88ac0-8412-451c-8390-97af19fbfd3f in-use 269b5f53-0596-4b52-994a-bfd9e1d5d648 in-use 2754d281-72a8-40df-9d50-68ce2e1866e9 in-use 2a7ea684-ac04-400a-bcf9-cb7fb2e06a6c detaching 36dc4404-dbe9-4365-adb3-6bbe3bbe5276 in-use 39cfe83d-5dee-4401-b3cd-6d322f21c594 in-use 3c91e033-cd1d-43f8-b5c9-388048270f52 in-use 471e8d0a-d36b-44c0-8dd9-9844306f5bb5 in-use 535f84c2-99aa-4cec-afb7-ec899cbb235e in-use 54b611d4-eab4-489b-aa7d-8548969a53c5 in-use 71c487ce-fd0f-4f0a-b4bf-733cd3ac97e9 in-use 8350009c-963a-47c6-9a02-b438cdde4db6 in-use 8b8d22b8-c6b3-4e11-9d3e-d8dd1fc0aafd in-use aac78f41-09f9-4eb5-ba7f-e43e29aeb936 in-use ad74261f-0077-4f5e-befb-cecfd49df2df No volume with a name or ID of 'ad74261f-0077-4f5e-befb-cecfd49df2df' exists. b40f9c45-4d0e-4e52-954d-f7b3e02ecd69 in-use b4a15652-d8a8-4cb2-9484-6280f2c541d0 in-use c188c8b6-74ba-41bf-b59b-39fe9a2cac27 in-use ce7ccca6-a592-4676-9d76-424339cdb316 in-use e6e86911-f426-4a34-9170-8a8ecd200022 in-use ecb95982-f1c5-49a9-80c9-bb6b724dc75b in-use #openstack server remove volume 89c9154b-7cbb-4af2-bb74-4a1047b999ac ad74261f-0077-4f5e-befb-cecfd49df2df No volume with a name or ID of 'ad74261f-0077-4f5e-befb-cecfd49df2df' exists. 3.) Stale volume information from database. MariaDB [cinder]> SELECT a.id,a.instance_uuid,a.volume_id,b.id,b.status,b.deleted FROM nova.block_device_mapping a LEFT JOIN cinder.volumes b ON a.volume_id = b.id WHERE a.source_type = 'volume' AND a.destination_type = 'volume' AND NOT a.deleted AND (b.deleted OR b.deleted is NULL); +---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+ | id | instance_uuid | volume_id | id | status | deleted | +---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+ | 6333993 | 89c9154b-7cbb-4af2-bb74-4a1047b999ac | ad74261f-0077-4f5e-befb-cecfd49df2df | ad74261f-0077-4f5e-befb-cecfd49df2df | deleted | 1 | +---------+--------------------------------------+--------------------------------------+--------------------------------------+---------+---------+ 4.)Chronology of events from logs. 1.) Volume creation /var/log/containers/cinder/cinder-volume.log:2021-06-25 15:41:22.719 121 INFO cinder.volume.flows.manager.create_volume [req-9acd9c92-2c5f-4289-b232-98bd5fef60cb 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Volume volume-ad74261f-0077-4f5e-befb-cecfd49df2df (ad74261f-0077-4f5e-befb-cecfd49df2df): created successfully 2.) Volume attach,detach /var/log/containers/nova/nova-compute.log: 2021-06-25 15:42:14.483 7 INFO nova.compute.manager [req-c63dc044-7c9d-4a99-8ec3-536a9ae0f650 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] [instance: 89c9154b-7cbb-4af2-bb74-4a1047b999ac] Attaching volume ad74261f-0077-4f5e-befb-cecfd49df2df to /dev/sdu 2021-06-25 15:44:28.152 7 INFO nova.compute.manager [req-e9bbf652-3016-425e-bcd2-a683fd86ece1 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] [instance: 89c9154b-7cbb-4af2-bb74-4a1047b999ac] Detaching volume ad74261f-0077-4f5e-befb-cecfd49df2df 2021-06-25 15:44:52.442 7 ERROR nova.volume.cinder [req-e9bbf652-3016-425e-bcd2-a683fd86ece1 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete attachment failed for attachment 1 [details]cedf0e9-93e2-4b60-9954-739e63f50519. Error: Volume ad74261f-0077-4f5e-befb-cecfd49df2df could not be found. (HTTP 404) (Request-ID: req-6319dabd-67c4-4dd4-9637-06350776df28) Code: 404: cinderclient.exceptions.NotFound: Volume ad74261f-0077-4f5e-befb-cecfd49df2df could not be found. (HTTP 404) (Request-ID: req-6319dabd-67c4-4dd4-9637-06350776df28) 3.)Volume deletion /var/log/containers/cinder/cinder-api.log.1 2021-06-25 15:44:46.201 22 INFO cinder.api.openstack.wsgi [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] DELETE https://iaas.aa.bb:13776/v3/30b9e06a62cb4374bfa0bf27ee55ec8d/volumes/ad74261f-0077-4f5e-befb-cecfd49df2df 2021-06-25 15:44:46.201 22 INFO cinder.api.v3.volumes [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume with id: ad74261f-0077-4f5e-befb-cecfd49df2df 2021-06-25 15:44:46.201 22 INFO cinder.api.v3.volumes [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume with id: ad74261f-0077-4f5e-befb-cecfd49df2df 2021-06-25 15:44:46.225 22 INFO cinder.volume.api [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Volume info retrieved successfully. 2021-06-25 15:44:46.241 22 INFO cinder.volume.api [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] Delete volume request issued successfully. 2021-06-25 15:44:46.241 22 INFO cinder.api.openstack.wsgi [req-0089762c-f983-4319-aeba-0fc9f2b7d826 7ef6a196394a43488a28d51261676929 30b9e06a62cb4374bfa0bf27ee55ec8d - default 52e00f437b5b4254a81a201277542e6d] https://iaas.aa.bb:13776/v3/30b9e06a62cb4374bfa0bf27ee55ec8d/volumes/ad74261f-0077-4f5e-befb-cecfd49df2df returned with HTTP 202 5.) sos report of controller, and compute node running the instance. Above logs captured from these sos report. There are sos report available post debug enable, but, needed logs are missing. I have asked cu to set log retention for a week and share the debug logs once issue re-occur. drwxrwxrwx+ 3 yank yank 73 Jun 25 15:07 0180-sosreport-ctrl-xx-02968407-2021-06-25-ftvupwr.tar.xz drwxrwxrwx+ 3 yank yank 73 Jun 25 15:07 0170-sosreport-ctrl-yy-02968407-2021-06-25-uxyaryr.tar.xz drwxrwxrwx+ 3 yank yank 73 Jun 25 15:07 0160-sosreport-ctrl-zz-02968407-2021-06-25-wpgvykt.tar.xz drwxrwxrwx+ 3 yank yank 76 Jun 25 15:24 0210-sosreport-compute-aa-02968407-2021-06-25-xbqaicy.tar.xz Actual results: Volume get delete prior of volume get detached from instance. Expected results: Volume should delete after volume is detached.
After looking a the logs and Cinder code this seems to be a race condition with Cinder-CSI requests caused by an internal Cinder issue. Some context on the issue: - Cinder API uses the volume "status" field as a locking mechanism to prevent concurrent request processing on the same volume. - Most cinder operations are asynchronous, so the API returns before the operation has been completed by the cinder-volume service, but the attachment operations such as creating/updating/deleting an attachment are synchronous, so the API only returns to the caller after the cinder-volume service has completed the operation. - Our current code **incorrectly** modifies the status of the volume both on the cinder-volume and the cinder-api services on the attachment delete operation. The actual set of events that leads to the issue reported in this BZ are: [OpenShift] - Cinder-CSI requests Nova to detach volume (Request R1) [Nova] - R1: Asks cinder-api to delete the attachment and **waits** [Cinder-API] - R1: Checks the status of the volume - R1: Sends terminate connection request (R1) to cinder-volume and **waits** [Cinder-Volume] - R1: Ask the driver to terminate the connection - R1: The driver asks the backend to unmap and unexport the volume - R1: The status of the volume is changed in the DB to "available" [OpenShift] - Cinder-CSI asks Cinder to delete the volume (Request R2) [Cinder-API] - R2: Check that the volume's status is valid. It's available so it can be deleted. - R2: Tell cinder-volume to delete the volume and return immediately. [Cinder-Volume] - R2: Volume is deleted and DB entry is deleted - R1: Finish the termination of the connection [Cinder-API] - R1: Now that cinder-volume has finished the termination the code continues - R1: Try to modify the volume in the DB - R1: DB layer raises VolumeNotFound since the volume has been deleted from the DB - R1: VolumeNotFound is converted to HTTP 404 status code which is returned to Nova [Nova] - R1: Cinder responds with 404 on the attachment delete request - R1: Nova leaves the volume as attached, since the attachment delete failed At this point the Cinder and Nova DBs are out of sync, because Nova thinks that the attachment is connected and Cinder has detached the volume and even deleted it. **This is caused by a Cinder bug**, but there is some robustification work that could be done in Nova, since the volume could be left in a "detached from instance" state (since the os-brick call succeeded), and a second detach request could directly skip the os-brick call and when it sees that the volume or the attachment no longer exists in Cinder it can proceed to remove it from the instance's XML. This BZ will only focus on fixing the Cinder issue.
I've raised https://github.com/kubernetes/cloud-provider-openstack/issues/1645 upstream as I honestly believe this is something that needs to be addressed in the cinder-csi plugin in k8s.
Verification flow: - version: openstack-cinder-15.4.0-1.20221003203219.58f0e73.el8ost.noarch - Ceph backed deploymen - Booted an instance with attached volume (overcloud) [stack@undercloud-0 ~]$ cinder list +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ | ID | Status | Name | Size | Volume Type | Bootable | Attached to | +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ | 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 | in-use | Pansible_vol | 1 | tripleo | true | 64655a8f-72a3-4380-8960-c0c93c14dc48 | +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ As this was already verified in a newer version, followed bug 2057002#c6 flow: delete loop in 3 parallel sessions: (overcloud) [stack@undercloud-0 ~]$ while true; do cinder delete 5621aeae-99c2-49c9-a740-c6e3f50bec79; done """ Delete for volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 failed: Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer. (HTTP 400) (Request-ID: req-50f74944-e015-4fa5-b4b4-dc19f5dd52d8) ERROR: Unable to delete any of the specified volumes. """ detach command: (overcloud) [stack@undercloud-0 ~]$ nova volume-detach inst1 5621aeae-99c2-49c9-a740-c6e3f50bec79 """ Request to delete volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 has been accepted. Delete for volume 5644fc51-1ae3-4ecc-b8a7-42ac077e6725 failed: No volume with a name or ID of '5644fc51-1ae3-4ecc-b8a7-42ac077e6725' exists. ERROR: Unable to delete any of the specified volumes. """ Cinder list post deletion: (overcloud) [stack@undercloud-0 ~]$ cinder list +----+--------+------+------+-------------+----------+-------------+ | ID | Status | Name | Size | Volume Type | Bootable | Attached to | +----+--------+------+------+-------------+----------+-------------+ +----+--------+------+------+-------------+----------+-------------+ I ran this same cycle several times, on all the cycles the volume was successfully detached followed by volume deletion as excepted. Moving to 'Verified'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795