Bug 1383040

Summary: [Ceph] Failed to delete a volume when running tempest tests
Product: Red Hat OpenStack Reporter: lkuchlan <lkuchlan>
Component: openstack-cinderAssignee: Jon Bernard <jobernar>
Status: CLOSED CURRENTRELEASE QA Contact: Tzach Shefi <tshefi>
Severity: unspecified Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: dsariel, egafford, eharney, jobernar, lkuchlan, pgrist, srevivo, yfried
Target Milestone: ---Keywords: Automation, AutomationBlocker, Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-06 17:25:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description lkuchlan 2016-10-09 09:25:09 UTC
Description of problem:
Failed to delete a volume when the volume is created from a snapshot with
time out when running tempest tests, using Ceph backend 

How reproducible:
100%

Steps to Reproduce:

Run tempest tests:
1. testr init
2. testr run tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_volume_from_snapshot
3. testr run tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern


Actual results:
Failed to delete a volume snapshot with time out while using a Ceph backend

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
        return func(*args, **kwargs)
      File "tempest/lib/common/rest_client.py", line 864, in wait_for_resource_deletion
        raise exceptions.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (VolumesV1SnapshotTestJSON:_run_cleanups) Failed to delete volume-snapshot 09b80e0f-8598-4bb1-a823-c30cecb4fd03 within the required time (196 s).

Expected results:
Volume snapshot should be deleted successfully 

Additional info:

{1} tempest.api.volume.test_volumes_snapshots.VolumesV1SnapshotTestJSON.test_volume_from_snapshot [200.510540s] ... FAILED

This is the only information I have related to this test:
http://logs.openstack.org/62/372062/10/check/gate-tempest-dsvm-full-devstack-plugin-ceph-ubuntu-xenial/6468ae2/console.html

https://projects.engineering.redhat.com/browse/RHOSINFRA-313

Comment 1 yfried 2016-10-09 11:48:22 UTC
This also fails both versions of scenario/test_volume_boot_pattern/test_volume_boot_pattern on cleanup:

Traceback (most recent call last):
  File "/root/tempest-dir/tempest/lib/common/rest_client.py", line 864, in wait_for_resource_deletion
    raise exceptions.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (TestVolumeBootPatternV2:_run_cleanups) Failed to delete volume be27b1ce-fec1-4831-a0fc-91c6fe66d9e1 within the required time (300 s).

https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/Nightly/job/qe-nightly-8_director-rhel-7.2-virthost-1cont_1comp-ipv4-vxlan-ceph-external/105/

The test is cleaning up a volume with a server and a snapshot attached to it. Even though the server and snapshot are deleted in Openstack(nova and cinder) DB, Ceph takes longer to clear its "watchers" on the attached volume, so when the cinder-delete requests is submitted it refuses to delete the volume. On the other hand, Cinder is async so it reports 202 on REST delete request.
The result is that the user (Tempest) believes the resource (volume) is being deleted and waits for it to disappear (loop on GET requests) until timeout is reached and the test fails.
In my opinion, Cinder should set the volume status to ERROR when DELETE is refused by backend (same as Nova does). Also, a delete loop could be a nice-to-have, if there isn't one already.

Comment 2 Paul Grist 2016-10-10 15:32:55 UTC
Jon can you take a look and see if you should link the launchpad bug here to the patch that looks like it may be the issue - https://review.openstack.org/#/c/281550/

Comment 3 Jon Bernard 2016-10-10 15:46:14 UTC
The theory described here is different from what the posted patch addresses, so there might be another issue.  I'm not sure I could change the volume status in that circumstance, so the solution may lie in tempest, will look closer.

Comment 4 Elise Gafford 2016-11-01 16:55:34 UTC
No recent progress on this issue. Moving to RHOS 11 for triage.

Comment 7 Paul Grist 2017-04-06 17:25:06 UTC
This is no longer reproducing, so closing it out.  Thanks for the update.