Description of problem: Tempest volume backup tests fail when using RGW as a backup target. Version-Release number of selected component (if applicable): How reproducible: Every time, this is causing our cinder driver certification tests to fail. Steps to Reproduce: 1.Setup cinder backup service to use swift as the target for backups, where RGW is a drop in replacement for swift. Note: cinder.conf has backup_swift_container = testbackup1 2. Run tempest volume backup tests. Actual results: All but the first test that uses the cinder backup service fails, exception logged is 409 BucketExists. This error is returned by RGW when you try to create a bucket that already exists. Expected results: The first tempest test that uses the cinder backup service should create the container specified in cinder.conf, and the test should pass. Then the rest of the backup tests should utilize the existing container and pass instead of failing because of trying to create duplicate container. Additional info: ceph.conf section that applies to swift as backup target. # The URL of the Swift endpoint (uri value) #backup_swift_url = <None> # The URL of the Keystone endpoint (uri value) #backup_swift_auth_url = <None> # Info to match when looking for swift in the service catalog. Format is: # separated values of the form: <service_type>:<service_name>:<endpoint_type> - # Only used if backup_swift_url is unset (string value) #swift_catalog_info = object-store:swift:publicURL # Info to match when looking for keystone in the service catalog. Format is: # separated values of the form: <service_type>:<service_name>:<endpoint_type> - # Only used if backup_swift_auth_url is unset (string value) #keystone_catalog_info = identity:Identity Service:publicURL # Swift authentication mechanism (string value) #backup_swift_auth = per_user # Swift authentication version. Specify "1" for auth 1.0, or "2" for auth 2.0 # or "3" for auth 3.0 (string value) #backup_swift_auth_version = 1 # Swift tenant/account name. Required when connecting to an auth 2.0 system # (string value) #backup_swift_tenant = <None> # Swift user domain name. Required when connecting to an auth 3.0 system # (string value) #backup_swift_user_domain = <None> # Swift project domain name. Required when connecting to an auth 3.0 system # (string value) #backup_swift_project_domain = <None> # Swift project/account name. Required when connecting to an auth 3.0 system # (string value) #backup_swift_project = <None> # Swift user name (string value) #backup_swift_user = <None> # Swift key for authentication (string value) #backup_swift_key = <None> # The default Swift container to use (string value) backup_swift_container = testbackup1 # The size in bytes of Swift backup objects (integer value) #backup_swift_object_size = 52428800 # The size in bytes that changes are tracked for incremental backups. # backup_swift_object_size has to be multiple of backup_swift_block_size. # (integer value) #backup_swift_block_size = 32768 # The number of retries to make for Swift operations (integer value) #backup_swift_retry_attempts = 3 # The backoff time in seconds between Swift retries (integer value) #backup_swift_retry_backoff = 2 # Enable or Disable the timer to send the periodic progress notifications to # Ceilometer when backing up the volume to the Swift backend storage. The # default value is True to enable the timer. (boolean value) #backup_swift_enable_progress_timer = true # Location of the CA certificate file to use for swift client requests. (string # value) #backup_swift_ca_cert_file = <None> # Bypass verification of server certificate when making SSL connection to # Swift. (boolean value) #backup_swift_auth_insecure = false # Driver to use for backups. (string value) #backup_driver = cinder.backup.drivers.swift
This functionality was addressed in: https://bugzilla.redhat.com/show_bug.cgi?id=1262104. But I do not believe it was tested past creating a single backup in a single container. It should pass a full tempest run.
Unable to make a public read/write container in rgw.
Can you see if this tempest test is running in our RH downstream testing on OSP11? Trying to understand if this is a general problem on OSP11 or something specific to certification.
Federico, when was the last time Ceph team checked that current or previous Ceph version passes swift Tempest tests? Was it done for OSP11 tempest? Thanks, Arkady
Hi David, Engineering is looking for the following information. Can you please attach to the BZ? [1.] Please pass the traceback and name of passed and failed tempest cinder backup tests. [2.] Attach the tempest.conf with the bug. [3.] Also please make sure that cinder backup service is up and running. Thanks, Sean
Backup service was def running. I no longer have the tempest.conf unfortunately and attached is the JSON with the failures.
Created attachment 1320077 [details] Cinder Backup Test Failures
Attaching tempest.conf from new deployment, tests results are identical.
Created attachment 1320987 [details] tempest.conf from certification environment
(In reply to Paul Grist from comment #13) > I checked with dsariel and our default cinder back up tests are not > configured for rgw. > > yogev, do you know if this is a ceph test case or who to ping from ceph QA > to answer arkady's question in #c7? We can ask Benny Kopilov or Liron Kuchlani about it. Both of them are working in the storage DFG and on Tempest.
What is the progress on this BZ?
I have deployed an OSP11/ocata overcloud on physical hardware and been able to create a cinder volume and make a backup of the volume. 1) What backend is cinder configured to use? 2) What size of volume is used for the backup test? 3) from a controller, what is the output of - rados lspools - rados ls -p images - rados ls -p default.rgw.buckets.data 4) What actions does tempest attempt when it is doing the cinder backup tests? 5) what user and role does tempest use for its tests?
re-assigning to ceph-rgw team
1) What backend is cinder configured to use? When you mean backend are you referring to block storage driver or backup target? This behavior happens when using the Dell PS, Dell SC, or Ceph block storage drivers for cinder. Also cinder is configured to use the swift backup driver which is the default: # Driver to use for backups. (string value) #backup_driver = cinder.backup.drivers.swift 2) What size of volume is used for the backup test? volume_size = 1, the default 3) from a controller, what is the output of - rados lspools rbd .rgw.root default.rgw.control default.rgw.data.root default.rgw.gc default.rgw.log .rgw.buckets backups images manila_data manila_metadata metrics vms volumes default.rgw.users.uid default.rgw.buckets.index default.rgw.buckets.data - rados ls -p images rbd_object_map.1236ce73f9df.0000000000000005 rbd_header.1235742cc55c8 rbd_object_map.1236ce73f9df rbd_data.1236ce73f9df.0000000000000001 rbd_data.1235742cc55c8.0000000000000001 rbd_directory rbd_header.1236ce73f9df rbd_data.1236ce73f9df.0000000000000000 rbd_id.e8e914db-272b-4261-8933-d70891acabac rbd_id.e6678fea-58b5-4137-9e9b-70df63524e6e rbd_object_map.1235742cc55c8.0000000000000004 rbd_data.1235742cc55c8.0000000000000000 rbd_object_map.1235742cc55c8 - rados ls -p default.rgw.buckets.data returns nothing 4) What actions does tempest attempt when it is doing the cinder backup tests? From description field #2 2. Run tempest volume backup tests. Actual results: All but the first test that uses the cinder backup service fails, exception logged is 409 BucketExists. This error is returned by RGW when you try to create a bucket that already exists. Expected results: The first tempest test that uses the cinder backup service should create the container specified in cinder.conf, and all the rest of the tests should pass, reusing the existing container. Then the rest of the backup tests should utilize the existing container and pass instead of failing because of trying to create duplicate container. 5) what user and role does tempest use for its tests? [auth] tempest_roles = swiftoperator [object-storage] operator_role = swiftoperator
Keith, have you tried to create more than one backup? In my experience the first backup is created fine because the container (bucket) does not exist, and is created and used. Any future backups trying to use the same container from a different project (tenant), fail. It's important to note that tempest creates dynamically creates projects and users and all backup tests try to use the same container. Could you please try running the tempest backup tests in your environment with tempest set to use dynamic credentials in tempest.conf: [auth] # Allows test cases to create/destroy projects and users. This option # requires that OpenStack Identity API admin credentials are known. If # false, isolated test cases and parallel execution, can still be # achieved configuring a list of test accounts (boolean value) # Deprecated group/name - [auth]/allow_tenant_isolation # Deprecated group/name - [compute]/allow_tenant_isolation # Deprecated group/name - [orchestration]/allow_tenant_isolation #use_dynamic_credentials = true
Just a brief update for now until Keith can add more, but we are running some tests on OSP-11 to try to reproduce that behavior and help isolate the problems.
I have been able to replicate this issue. I did the following: 1) deployed OSP11 with cinder backup configured to use swift provided by radosgw. 2) created a volume as the overcloud admin user 3) created 2 backups as the overcloud admin user 4) created a overcloud a demo user and project and added them to the _member_role 5) created a volume as the demo user 6) attempted a backup of the volume as the demo user. The fails with the error "Container PUT failed: http://10.19.139.214:8080/swift/v1/volumebackups 409 Conflict BucketAlreadyExists" This looks like it may be a software issue to me.
Arkady, the team is validating Swift with Tempest in the 2.x series (and now the 3.x series), but did not do so in the 1.3.x series. Paul, I am planning 1.3.4 and if there is an impact there I will need to know shortly. For 3.1, 2.5 we have a few weeks before the planning window closes. Please let us know.
These are not tempest swift but tempest cinder. This is a NEW issue that is specific when running cinder backup against swift API on Ceph. This issue is not exposed by tempest swift on RGW. It is not specific to ceph version as we see it for both ceph 1.x and 2.x.
This is an RGW bug that will need a fix for Dell OSP-11 certification (and future ones). As arkady said, this is seen in the use case of Cinder tempest back up tests when using RGW instead of native swift. Given the need is for OSP-11 and future, I do not think the fix is needed in the 1.x streams, but will look for Arkady to confirm. 2.x and 3.x fixes for this will be needed when they become available (next maint release after there is a fix). Right now I believe the plan is to allow the exception for certification and release the fix when it's available. thanks, Paul
Concur that it should be fixed from Ceph 2.x and going forward. No need to fix for 1.x
*** Bug 1518875 has been marked as a duplicate of this bug. ***
Keith's comment #31 gives me an idea for some tests I'll run that I hope will isolate which component is misbehaving. Dave (Paterson), do you hang out on freenode? I looked for "dpaterson" but maybe you're just not online now.
I gather the urgency is due to this BZ blocking OSP-11 certification of Dell EMC's SC Series Cinder driver. This is unfortunate because Cinder backups to a Swift (whether actual Swift or RGW) have nothing to do with the SC Series Cinder driver. In order to unblock the driver certification, I recommend modifying the overcloud deployment to either not use Swift for Cinder backups, or to deploy Swift and not the RGW (you can still deploy Ceph for RBD).
I am able to reproduce the problem, and confirmed the issue appears to be with the RGW's compatibility with Swift. In my tests, I was able to demonstrate differences between the RGW and Swift using the OpenStack 'swift' CLI tool, effectively removing both Cinder and Tempest from the equation. There is a discrepency between the way Swift and the RGW handle containers with the same name, but in different OpenStack projects. I repeated the following test using two OSP-11 deployments, one with Swift and one with RGW. - Create container as user X in project X - Verify the container is visible to user X but not user Y in project Y, or even user X in project Y) - Create the same container (name) as user in project Y When Swift is deployed, user Y is able to create the container, and both users "see" their own containers, which just happened to be named the same. When RGW is deployed, user Y is unable to create the container Here is an example of the failure (RGW in OSP-11): # Authenticate as admin/admin, create another project (demo) and add the # admin user to the project. User X (admin) is a member of projects X and Y # (admin and demo). $ . overcloudrc.v3 $ openstack project create --domain default demo $ openstack role add --project demo --user admin admin # Create "test" container as admin/admin, and observe it isn't visible in demo # project $ swift post test $ swift list test $ OS_PROJECT_NAME=demo swift list <nothing> # Attempt to create "test" container for demo project fails # NOTE: This failure does not occur when Swift is deployed $ OS_PROJECT_NAME=demo swift post test Container POST failed: http://192.168.24.15:8080/swift/v1/test 401 Unauthorized AccessDenied Failed Transaction ID: tx000000000000000000007-005a5cbf40-5e48-default # Verify same user is able to create other containers in project Y $ OS_PROJECT_NAME=demo swift post test-2 $ OS_PROJECT_NAME=demo swift list test-2 $ swift list test As noted, Swift does not exhibit this behavior. User X is able to create separate but identically named containers in projects X and Y. I'd like an RGW expert to comment. OSP-11 is deploying ceph-radosgw-10.2.7-48.el7cp.x86_64, which I believe is version 2.4.
Alan and I were able to able to successfully complete our tests by adding "rgw keystone implicit tenants = true" I have opened a launchpad bug for the puppet-ceph change. The new parameter will default to true so no new tht changes will be needed. I will be submitting a git issue for the ceph-ansible change later.
Submitted upstream openstack patch 534390.
Updating the BZ now that we know it can be addressed in puppet-ceph, and reassigning to Keith because he posted the patch.
Created attachment 1383396 [details] Full test of gerrit patch 534390 This is the output of a two user test making backups of of two cinder volumes. The patch has merged upstream.
The following patches have been merged upstream: (1) the current master branch of puppet-ceph. (2) the stable/jewel branch of puppet-ceph (3) the current master of tripleo-heat-templates. This should find its way into OSP13/queens. I just cherry picked (4) into OSP12/pike. It needs to go through CI and workflow. puppet-ceph only has the master and stable/jewel branches. I dont know if OSP11/ocata or OSP10/newton shipped with a branched version of puppet-ceph. Keith 1: https://review.openstack.org/#/c/534390/ 2: https://review.openstack.org/#/c/536956/ 3: https://review.openstack.org/#/c/536901/ 4: https://review.openstack.org/#/c/537330/
Upstream patch 53609 has been merged upstream. The last patch, 536330, is in the final stages of the CI workflow.
Dave P from Dell EMC has said that he'll reach out to Keith to get the packages and test.
I dont know if the change is packaged yet but I can provide clear instructions on how to obtain the relevant puppet module and version and add it to the overcloud as it is being deployed. Keith
(In reply to Keith Schincke from comment #82) > I dont know if the change is packaged yet but I can provide clear > instructions on how to obtain the relevant puppet module and version and add > it to the overcloud as it is being deployed. > > Keith This is not in any osp packages yet that I see. Can you work with slinaber/eggs to get this into the next osp11 update?