Bug 1383040 - [Ceph] Failed to delete a volume when running tempest tests
Summary: [Ceph] Failed to delete a volume when running tempest tests
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: 11.0 (Ocata)
Assignee: Jon Bernard
QA Contact: Tzach Shefi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-09 09:25 UTC by lkuchlan
Modified: 2017-04-06 17:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-06 17:25:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1627510 0 None None None 2016-10-09 09:25:09 UTC

Description lkuchlan 2016-10-09 09:25:09 UTC
Description of problem:
Failed to delete a volume when the volume is created from a snapshot with
time out when running tempest tests, using Ceph backend 

How reproducible:
100%

Steps to Reproduce:

Run tempest tests:
1. testr init
2. testr run tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_volume_from_snapshot
3. testr run tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern


Actual results:
Failed to delete a volume snapshot with time out while using a Ceph backend

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
        return func(*args, **kwargs)
      File "tempest/lib/common/rest_client.py", line 864, in wait_for_resource_deletion
        raise exceptions.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (VolumesV1SnapshotTestJSON:_run_cleanups) Failed to delete volume-snapshot 09b80e0f-8598-4bb1-a823-c30cecb4fd03 within the required time (196 s).

Expected results:
Volume snapshot should be deleted successfully 

Additional info:

{1} tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_volume_from_snapshot [200.510540s] ... FAILED

This is the only information I have related to this test:
http://logs.openstack.org/62/372062/10/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/6468ae2/console.html

https://projects.engineering.redhat.com/browse/RHOSINFRA-313

Comment 1 yfried 2016-10-09 11:48:22 UTC
This also fails both versions of scenario/test_volume_boot_pattern/test_volume_boot_pattern on cleanup:

Traceback (most recent call last):
  File "/root/tempest-dir/tempest/lib/common/rest_client.py", line 864, in wait_for_resource_deletion
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (TestVolumeBootPatternV2:_run_cleanups) Failed to delete volume be27b1ce-fec1-4831-a0fc-91c6fe66d9e1 within the required time (300 s).

https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/Nightly/job/qe-nightly-8_director-rhel-7.2-virthost-1cont_1comp-ipv4-vxlan-ceph-external/105/

The test is cleaning up a volume with a server and a snapshot attached to it. Even though the server and snapshot are deleted in Openstack(nova and cinder) DB, Ceph takes longer to clear its "watchers" on the attached volume, so when the cinder-delete requests is submitted it refuses to delete the volume. On the other hand, Cinder is async so it reports 202 on REST delete request.
The result is that the user (Tempest) believes the resource (volume) is being deleted and waits for it to disappear (loop on GET requests) until timeout is reached and the test fails.
In my opinion, Cinder should set the volume status to ERROR when DELETE is refused by backend (same as Nova does). Also, a delete loop could be a nice-to-have, if there isn't one already.

Comment 2 Paul Grist 2016-10-10 15:32:55 UTC
Jon can you take a look and see if you should link the launchpad bug here to the patch that looks like it may be the issue - https://review.openstack.org/#/c/281550/

Comment 3 Jon Bernard 2016-10-10 15:46:14 UTC
The theory described here is different from what the posted patch addresses, so there might be another issue.  I'm not sure I could change the volume status in that circumstance, so the solution may lie in tempest, will look closer.

Comment 4 Elise Gafford 2016-11-01 16:55:34 UTC
No recent progress on this issue. Moving to RHOS 11 for triage.

Comment 7 Paul Grist 2017-04-06 17:25:06 UTC
This is no longer reproducing, so closing it out.  Thanks for the update.


Note You need to log in before you can comment on or make changes to this bug.