Bug 1481819 - Tempest cinder volume backup tests fail when using rados gateway as backup target.
Summary: Tempest cinder volume backup tests fail when using rados gateway as backup ta...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-ceph
Version: 11.0 (Ocata)
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: zstream
: 11.0 (Ocata)
Assignee: Keith Schincke
QA Contact: Tzach Shefi
URL:
Whiteboard:
: 1518875 (view as bug list)
Depends On:
Blocks: 1356451 1401639 1574659
TreeView+ depends on / blocked
 
Reported: 2017-08-15 19:32 UTC by David Paterson
Modified: 2020-01-21 19:42 UTC (History)
51 users (show)

Fixed In Version: puppet-ceph-2.4.1-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1518875 1574659 (view as bug list)
Environment:
Last Closed: 2018-05-04 17:41:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Cinder Backup Test Failures (164.03 KB, text/plain)
2017-08-30 13:35 UTC, David Paterson
no flags Details
tempest.conf from certification environment (5.56 KB, text/plain)
2017-09-01 14:38 UTC, David Paterson
no flags Details
Full test of gerrit patch 534390 (9.80 KB, text/plain)
2018-01-19 13:08 UTC, Keith Schincke
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1743602 0 None None None 2018-01-16 16:51:52 UTC
OpenStack gerrit 534390 0 None MERGED Added rgw_keystone_implicit_tenants to ceph::rgw::keystone 2020-10-13 17:09:58 UTC
OpenStack gerrit 536956 0 None MERGED Added rgw_keystone_implicit_tenants to ceph::rgw::keystone 2020-10-13 17:09:47 UTC
Red Hat Bugzilla 1262104 0 unspecified CLOSED [RFE] Configure Ceph RGW as Cinder Backup target in RHEL-OSP director 2021-02-22 00:41:40 UTC

Internal Links: 1262104

Description David Paterson 2017-08-15 19:32:30 UTC
Description of problem: Tempest volume backup tests fail when using RGW as a backup target.


Version-Release number of selected component (if applicable):


How reproducible:
Every time, this is causing our cinder driver certification tests to fail.

Steps to Reproduce:
1.Setup cinder backup service to use swift as the target for backups, where RGW is a drop in replacement for swift.

Note: cinder.conf has backup_swift_container = testbackup1

2. Run tempest volume backup tests.

Actual results: All but the first test that uses the cinder backup service fails,  exception logged is 409 BucketExists.  This error is returned by RGW when you try to create a bucket that already exists.

Expected results:  The first tempest test that uses the cinder backup service should create the container specified in cinder.conf, and the test should pass.  

Then the rest of the backup tests should utilize the existing container and pass instead of failing because of trying to create duplicate container.


Additional info:
ceph.conf section that applies to swift as backup target.

# The URL of the Swift endpoint (uri value)
#backup_swift_url = <None>

# The URL of the Keystone endpoint (uri value)
#backup_swift_auth_url = <None>

# Info to match when looking for swift in the service catalog. Format is:
# separated values of the form: <service_type>:<service_name>:<endpoint_type> -
# Only used if backup_swift_url is unset (string value)
#swift_catalog_info = object-store:swift:publicURL

# Info to match when looking for keystone in the service catalog. Format is:
# separated values of the form: <service_type>:<service_name>:<endpoint_type> -
# Only used if backup_swift_auth_url is unset (string value)
#keystone_catalog_info = identity:Identity Service:publicURL

# Swift authentication mechanism (string value)
#backup_swift_auth = per_user

# Swift authentication version. Specify "1" for auth 1.0, or "2" for auth 2.0
# or "3" for auth 3.0 (string value)
#backup_swift_auth_version = 1

# Swift tenant/account name. Required when connecting to an auth 2.0 system
# (string value)
#backup_swift_tenant = <None>

# Swift user domain name. Required when connecting to an auth 3.0 system
# (string value)
#backup_swift_user_domain = <None>

# Swift project domain name. Required when connecting to an auth 3.0 system
# (string value)
#backup_swift_project_domain = <None>

# Swift project/account name. Required when connecting to an auth 3.0 system
# (string value)
#backup_swift_project = <None>

# Swift user name (string value)
#backup_swift_user = <None>

# Swift key for authentication (string value)
#backup_swift_key = <None>

# The default Swift container to use (string value)
backup_swift_container = testbackup1

# The size in bytes of Swift backup objects (integer value)
#backup_swift_object_size = 52428800

# The size in bytes that changes are tracked for incremental backups.
# backup_swift_object_size has to be multiple of backup_swift_block_size.
# (integer value)
#backup_swift_block_size = 32768

# The number of retries to make for Swift operations (integer value)
#backup_swift_retry_attempts = 3

# The backoff time in seconds between Swift retries (integer value)
#backup_swift_retry_backoff = 2

# Enable or Disable the timer to send the periodic progress notifications to
# Ceilometer when backing up the volume to the Swift backend storage. The
# default value is True to enable the timer. (boolean value)
#backup_swift_enable_progress_timer = true

# Location of the CA certificate file to use for swift client requests. (string
# value)
#backup_swift_ca_cert_file = <None>

# Bypass verification of server certificate when making SSL connection to
# Swift. (boolean value)
#backup_swift_auth_insecure = false

# Driver to use for backups. (string value)
#backup_driver = cinder.backup.drivers.swift

Comment 1 David Paterson 2017-08-15 19:36:19 UTC
This functionality was addressed in: https://bugzilla.redhat.com/show_bug.cgi?id=1262104.  But I do not believe it was tested past creating a single backup in a single container.  It should pass a full tempest run.

Comment 2 Mike Orazi 2017-08-16 13:17:29 UTC
Unable to make a public read/write container in rgw.

Comment 4 Paul Grist 2017-08-16 19:36:18 UTC
Can you see if this tempest test is running in our RH downstream testing on OSP11?  Trying to understand if this is a general problem on OSP11 or something specific to certification.

Comment 7 arkady kanevsky 2017-08-21 19:48:45 UTC
Federico,
when was the last time Ceph team checked that current or previous Ceph version passes swift Tempest tests? Was it done for OSP11 tempest?
Thanks,
Arkady

Comment 10 Sean Merrow 2017-08-30 12:54:52 UTC
Hi David,

Engineering is looking for the following information. Can you please attach to the BZ?

[1.] Please pass the traceback and name of passed and failed tempest cinder backup tests.

[2.] Attach the tempest.conf with the bug.

[3.] Also please make sure that cinder backup service is up and running.

Thanks,
Sean

Comment 11 David Paterson 2017-08-30 13:32:41 UTC
Backup service was def running.  I no longer have the tempest.conf unfortunately and attached is the JSON with the failures.

Comment 12 David Paterson 2017-08-30 13:35:38 UTC
Created attachment 1320077 [details]
Cinder Backup Test Failures

Comment 15 David Paterson 2017-09-01 14:37:47 UTC
Attaching tempest.conf from new deployment, tests results are identical.

Comment 16 David Paterson 2017-09-01 14:38:48 UTC
Created attachment 1320987 [details]
tempest.conf from certification environment

Comment 17 Yogev Rabl 2017-09-06 13:19:03 UTC
(In reply to Paul Grist from comment #13)
> I checked with dsariel and our default cinder back up tests are not
> configured for rgw.
> 
> yogev, do you know if this is a ceph test case or who to ping from ceph QA
> to answer arkady's question in #c7?

We can ask Benny Kopilov or Liron Kuchlani about it. Both of them are working in the storage DFG and on Tempest.

Comment 21 arkady kanevsky 2017-10-07 21:48:28 UTC
What is the progress on this BZ?

Comment 23 Keith Schincke 2017-10-10 22:46:25 UTC
I have deployed an OSP11/ocata overcloud on physical hardware and been able to create a cinder volume and make a backup of the volume. 

1) What backend is cinder configured to use?
2) What size of volume is used for the backup test?
3) from a controller, what is the output of 
  - rados lspools
  - rados ls -p images
  - rados ls -p default.rgw.buckets.data

4) What actions does tempest attempt when it is doing the cinder backup tests?
5) what user and role does tempest use for its tests?

Comment 24 Thiago da Silva 2017-10-11 14:02:00 UTC
re-assigning to ceph-rgw team

Comment 27 David Paterson 2017-10-11 14:39:32 UTC
1) What backend is cinder configured to use?
        When you mean backend are you referring to block storage driver or backup target?
        This behavior happens when using the Dell PS, Dell SC, or Ceph block storage drivers for cinder. 
        Also cinder is configured to use the swift backup driver which is the default:
            # Driver to use for backups. (string value)
            #backup_driver = cinder.backup.drivers.swift
 
2) What size of volume is used for the backup test?
        volume_size = 1, the default
3) from a controller, what is the output of 
  - rados lspools
        rbd
        .rgw.root
        default.rgw.control
        default.rgw.data.root
        default.rgw.gc
        default.rgw.log
        .rgw.buckets
        backups
        images
        manila_data
        manila_metadata
        metrics
        vms
        volumes
        default.rgw.users.uid
        default.rgw.buckets.index
        default.rgw.buckets.data

  - rados ls -p images
        rbd_object_map.1236ce73f9df.0000000000000005
        rbd_header.1235742cc55c8
        rbd_object_map.1236ce73f9df
        rbd_data.1236ce73f9df.0000000000000001
        rbd_data.1235742cc55c8.0000000000000001
        rbd_directory
        rbd_header.1236ce73f9df
        rbd_data.1236ce73f9df.0000000000000000
        rbd_id.e8e914db-272b-4261-8933-d70891acabac
        rbd_id.e6678fea-58b5-4137-9e9b-70df63524e6e
        rbd_object_map.1235742cc55c8.0000000000000004
        rbd_data.1235742cc55c8.0000000000000000
        rbd_object_map.1235742cc55c8
  - rados ls -p default.rgw.buckets.data
        returns nothing 
4) What actions does tempest attempt when it is doing the cinder backup tests?
From description field #2
         2. Run tempest volume backup tests.

            Actual results: All but the first test that uses the cinder backup service fails,  exception logged is 409 BucketExists.  This error is returned by RGW when you try to create a bucket that already exists.

            Expected results:  The first tempest test that uses the cinder backup service should create the container specified in cinder.conf, and all the rest of the tests should pass, reusing the existing container.  

Then the rest of the backup tests should utilize the existing container and pass instead of failing because of trying to create duplicate container.
5) what user and role does tempest use for its tests?
        [auth]
        tempest_roles = swiftoperator

        [object-storage]
        operator_role = swiftoperator

Comment 28 David Paterson 2017-10-11 14:45:02 UTC
Keith, have you tried to create more than one backup?  In my experience the first backup is created fine because the container (bucket) does not exist, and is created and used.

Any future backups trying to use the same container from a different project (tenant), fail.  It's important to note that tempest creates dynamically creates projects and users and all backup tests try to use the same container.

Could you please try running the tempest backup tests in your environment with tempest set to use dynamic credentials in tempest.conf:

[auth]
# Allows test cases to create/destroy projects and users. This option
# requires that OpenStack Identity API admin credentials are known. If
# false, isolated test cases and parallel execution, can still be
# achieved configuring a list of test accounts (boolean value)
# Deprecated group/name - [auth]/allow_tenant_isolation
# Deprecated group/name - [compute]/allow_tenant_isolation
# Deprecated group/name - [orchestration]/allow_tenant_isolation
#use_dynamic_credentials = true

Comment 29 Paul Grist 2017-10-13 14:15:08 UTC
Just a brief update for now until Keith can add more, but we are running some tests on OSP-11 to try to reproduce that behavior and help isolate the problems.

Comment 31 Keith Schincke 2017-11-07 13:16:56 UTC
I have been able to replicate this issue.
I did the following:
1) deployed OSP11 with cinder backup configured to use swift provided by radosgw.
2) created a volume as the overcloud admin  user
3) created 2 backups as the overcloud admin user
4) created a overcloud a demo user and project and added them to the _member_role
5) created a volume as the demo user
6) attempted a backup of the volume as the demo user.

The fails with the error "Container PUT failed: http://10.19.139.214:8080/swift/v1/volumebackups 409 Conflict   BucketAlreadyExists"

This looks like it may be a software issue to me.

Comment 32 Federico Lucifredi 2017-11-17 08:23:49 UTC
Arkady, the team is validating Swift with Tempest in the 2.x series (and now the 3.x series), but did not do so in the 1.3.x series.

Paul, I am planning 1.3.4 and if there is an impact there I will need to know shortly. For 3.1, 2.5 we have a few weeks before the planning window closes. 

Please let us know.

Comment 33 arkady kanevsky 2017-11-18 07:56:24 UTC
These are not tempest swift but tempest cinder. This is a NEW issue that is specific when running cinder backup against swift API on Ceph.
This issue is not exposed by tempest swift on RGW. It is not specific to ceph version as we see it for both ceph 1.x and 2.x.

Comment 34 Paul Grist 2017-11-27 20:34:44 UTC
This is an RGW bug that will need a fix for Dell OSP-11 certification (and future ones).  As arkady said, this is seen in the use case of Cinder tempest back up tests when using RGW instead of native swift.

Given the need is for OSP-11 and future, I do not think the fix is needed in the 1.x streams, but will look for Arkady to confirm.

2.x and 3.x fixes for this will be needed when they become available (next maint release after there is a fix).  Right now I believe the plan is to allow the exception for certification and release the fix when it's available.

thanks,
Paul

Comment 36 arkady kanevsky 2017-11-27 20:37:57 UTC
Concur that it should be fixed from Ceph 2.x and going forward.
No need to fix for 1.x

Comment 40 Matt Benjamin (redhat) 2017-12-20 21:58:44 UTC
*** Bug 1518875 has been marked as a duplicate of this bug. ***

Comment 66 Alan Bishop 2018-01-12 18:16:48 UTC
Keith's comment #31 gives me an idea for some tests I'll run that I hope will isolate which component is misbehaving.

Dave (Paterson), do you hang out on freenode? I looked for "dpaterson" but maybe you're just not online now.

Comment 67 Alan Bishop 2018-01-15 15:28:52 UTC
I gather the urgency is due to this BZ blocking OSP-11 certification of Dell
EMC's SC Series Cinder driver. This is unfortunate because Cinder backups to a
Swift (whether actual Swift or RGW) have nothing to do with the SC Series
Cinder driver. In order to unblock the driver certification, I recommend
modifying the overcloud deployment to either not use Swift for Cinder backups,
or to deploy Swift and not the RGW (you can still deploy Ceph for RBD).

Comment 68 Alan Bishop 2018-01-15 15:30:35 UTC
I am able to reproduce the problem, and confirmed the issue appears to be
with the RGW's compatibility with Swift. In my tests, I was able to
demonstrate differences between the RGW and Swift using the OpenStack 'swift'
CLI tool, effectively removing both Cinder and Tempest from the equation.

There is a discrepency between the way Swift and the RGW handle containers
with the same name, but in different OpenStack projects. I repeated the
following test using two OSP-11 deployments, one with Swift and one with RGW.

- Create container as user X in project X
- Verify the container is visible to user X but not user Y in project Y, or
  even user X in  project Y)
- Create the same container (name) as user in project Y

When Swift is deployed, user Y is able to create the container, and both users
"see" their own containers, which just happened to be named the same.
When RGW is deployed, user Y is unable to create the container

Here is an example of the failure (RGW in OSP-11):

# Authenticate as admin/admin, create another project (demo) and add the
# admin user to the project. User X (admin) is a member of projects X and Y
# (admin and demo).
$ . overcloudrc.v3
$ openstack project create --domain default demo
$ openstack role add --project demo --user admin admin

# Create "test" container as admin/admin, and observe it isn't visible in demo
# project
$ swift post test
$ swift list
test
$ OS_PROJECT_NAME=demo swift list
<nothing>

# Attempt to create "test" container for demo project fails
# NOTE: This failure does not occur when Swift is deployed
$ OS_PROJECT_NAME=demo swift post test
Container POST failed: http://192.168.24.15:8080/swift/v1/test 401 Unauthorized   AccessDenied
Failed Transaction ID: tx000000000000000000007-005a5cbf40-5e48-default

# Verify same user is able to create other containers in project Y
$ OS_PROJECT_NAME=demo swift post test-2
$ OS_PROJECT_NAME=demo swift list                                                                                                
test-2
$ swift list
test

As noted, Swift does not exhibit this behavior. User X is able to create
separate but identically named containers in projects X and Y. I'd like an RGW
expert to comment.

OSP-11 is deploying ceph-radosgw-10.2.7-48.el7cp.x86_64, which I believe is
version 2.4.

Comment 71 Keith Schincke 2018-01-16 16:51:53 UTC
Alan and I were able to able to successfully complete our tests by adding "rgw keystone implicit tenants = true"

I have opened a launchpad bug for the puppet-ceph change. The new parameter will default to true so no new tht changes will be needed. I will be submitting a git issue for the ceph-ansible change later.

Comment 72 Keith Schincke 2018-01-16 17:39:54 UTC
Submitted upstream openstack patch 534390.

Comment 74 Alan Bishop 2018-01-16 20:16:26 UTC
Updating the BZ now that we know it can be addressed in puppet-ceph, and reassigning to Keith because he posted the patch.

Comment 75 Keith Schincke 2018-01-19 13:08:18 UTC
Created attachment 1383396 [details]
Full test of gerrit patch 534390

This is the output of a two user test making backups of of two cinder volumes. 
The patch has merged upstream.

Comment 77 Keith Schincke 2018-01-25 23:35:21 UTC
The following patches have been merged upstream:

(1) the current master branch of puppet-ceph.
(2) the stable/jewel branch of puppet-ceph
(3) the current master of tripleo-heat-templates. This should find its way into OSP13/queens.

I just cherry picked (4) into OSP12/pike. It needs to go through CI and workflow. 

puppet-ceph only has the master and stable/jewel branches. I dont know if OSP11/ocata or OSP10/newton shipped with a branched version of puppet-ceph. 

Keith


1: https://review.openstack.org/#/c/534390/ 
2: https://review.openstack.org/#/c/536956/
3: https://review.openstack.org/#/c/536901/
4: https://review.openstack.org/#/c/537330/

Comment 78 Keith Schincke 2018-01-31 12:20:31 UTC
Upstream patch 53609 has been merged upstream. 

The last patch, 536330, is in the final stages of the CI workflow.

Comment 80 Sean Merrow 2018-02-02 16:38:54 UTC
Dave P from Dell EMC has said that he'll reach out to Keith to get the packages and test.

Comment 82 Keith Schincke 2018-02-15 21:16:44 UTC
I dont know if the change is packaged yet but I can provide clear instructions on how to obtain the relevant puppet module and version and add it to the overcloud as it is being deployed. 

Keith

Comment 84 Mike Burns 2018-03-23 13:59:41 UTC
(In reply to Keith Schincke from comment #82)
> I dont know if the change is packaged yet but I can provide clear
> instructions on how to obtain the relevant puppet module and version and add
> it to the overcloud as it is being deployed. 
> 
> Keith

This is not in any osp packages yet that I see.  Can you work with slinaber/eggs to get this into the next osp11 update?


Note You need to log in before you can comment on or make changes to this bug.