Bug 1859370

Summary: Retype of RBD snapshot volume is failing
Product: Red Hat OpenStack Reporter: James Parker <jparker>
Component: openstack-cinderAssignee: Cinder Bugs List <cinder-bugs>
Status: CLOSED DUPLICATE QA Contact: Tzach Shefi <tshefi>
Severity: medium Docs Contact: Chuck Copello <ccopello>
Priority: medium    
Version: 16.1 (Train)CC: abishop, gcharot, ltoscano, lyarwood, senrique
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-01 09:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Parker 2020-07-21 20:16:17 UTC
Description of problem:
RBD Retype if failing when running tempest.api.volume.admin.test_volume_retype.VolumeRetypeWithMigrationTest.test_volume_from_snapshot_retype_with_migration from [1].  This is a multi-backend deployment consisting of RBD, NFS, NETAPP, and ISCSI.  Any instance where the test attempts to retype the RBD volume to the other backend it will fail the testcase. Retyping from another backend to RBD does not show any issues.  Below results are from RBD to NFS.

(overcloud) [stack@undercloud-0 tempest_workspace]$ tempest run --serial --regex tempest.api.volume.admin.test_volume_retype
{0} tempest.api.volume.admin.test_volume_retype.VolumeRetypeWithMigrationTest.test_available_volume_retype_with_migration [27.072103s] ... ok
{0} tempest.api.volume.admin.test_volume_retype.VolumeRetypeWithMigrationTest.test_volume_from_snapshot_retype_with_migration [304.284310s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    b'Traceback (most recent call last):'
    b'  File "/usr/lib/python3.6/site-packages/tempest/api/volume/admin/test_volume_retype.py", line 142, in test_volume_from_snapshot_retype_with_migration'
    b'    src_vol = self._create_volume_from_snapshot()'
    b'  File "/usr/lib/python3.6/site-packages/tempest/api/volume/admin/test_volume_retype.py", line 67, in _create_volume_from_snapshot'
    b"    self.snapshots_client.wait_for_resource_deletion(snapshot['id'])"
    b'  File "/usr/lib/python3.6/site-packages/tempest/lib/common/rest_client.py", line 899, in wait_for_resource_deletion'
    b'    raise exceptions.TimeoutException(message)'
    b'tempest.lib.exceptions.TimeoutException: Request timed out'
    b'Details: (VolumeRetypeWithMigrationTest:test_volume_from_snapshot_retype_with_migration) Failed to delete volume-snapshot 81982518-e552-4f6e-a805-0647e9ea2cbf within the required time (300 s).'
    b''


Version-Release number of selected component (if applicable):
16.1

How reproducible:
100% reproducible

Steps to Reproduce:
1. Create a multi-backend deployment consisting of RBD backend and another backend
2. Execute tempest test tempest.api.volume.admin.test_volume_retype.VolumeRetypeWithMigrationTest.test_volume_from_snapshot_retype_with_migration
3.

Actual results:
Testcase times out when attempting to clean up snapshot

Expected results:
Test should successfully retype the volume created from the snapshot from RBD to destination backend


Additional info:

[1] https://github.com/openstack/tempest/blob/master/tempest/api/volume/admin/test_volume_retype.py#L141

Comment 1 Alan Bishop 2020-07-21 20:40:25 UTC
We need to see the cinder logs (with DEBUG).

Comment 2 Alan Bishop 2020-07-22 14:12:18 UTC
I reviewed the logs (thanks for saving them on the hypervisor, James!), and see this in the cinder-volume log:

2020-07-21 14:07:15.071 79 INFO cinder.volume.drivers.rbd [req-a624464a-f0f3-41e7-b52e-54308ee5fccc 38438ec8d3d44978b417b1153327b587 8591c28be4224f9cbb6fb59556b50db8 - default default] Image volumes/volume-c5b26f69-87bc-494f-8d4f-9ea03c3a304d is dependent on the snapshot snapshot-81982518-e552-4f6e-a805-0647e9ea2cbf.
2020-07-21 14:07:15.079 79 ERROR cinder.volume.manager [req-a624464a-f0f3-41e7-b52e-54308ee5fccc 38438ec8d3d44978b417b1153327b587 8591c28be4224f9cbb6fb59556b50db8 - default default] Delete snapshot failed, due to snapshot busy.: cinder.exception.SnapshotIsBusy: deleting snapshot snapshot-81982518-e552-4f6e-a805-0647e9ea2cbf that has dependent volumes

This occurs because tempest is using this [1] sequence to create the volume it plans to retype.

[1] https://github.com/openstack/tempest/blob/6cb37d68b2cb40cec9dcbb9e26c0649c6e6c877a/tempest/api/volume/admin/test_volume_retype.py#L61-L67

The tempest test fails because the snapshot cannot be deleted, and this happens before attempting the actual migration/retype. The reason the snapshot cannot be deleted is the RBD driver creates a fast COW clone of the snapshot, and that creates a dependency on the snapshot that prevents it from being deleted.

One solution is to configure the RBD driver with rbd_flatten_volume_from_snapshot=True, but a better solution is to rework the tempest test to defer deleting the snapshot until after the retype operation completes.

Unless others object, I think this should be handled as a tempest bug.

Comment 3 Alan Bishop 2020-07-22 15:21:03 UTC
Ignore my previous comment about this being a tempest bug. Apparently the RBD driver is *not* supposed to behave this way, and the rbd_flatten_volume_from_snapshot parameter is not intended to address the behavior.

There are other open BZs covering this problem (e.g. bug #1437392), and the cinder squad needs to do some bz cleanup and determine a course of action.

Comment 4 Luigi Toscano 2021-04-01 09:48:23 UTC
This is going to be addressed in OSP 16.2 thanks to the usage of RBD Clone v2 API. Please see bug 1764324.

*** This bug has been marked as a duplicate of bug 1764324 ***