Bug 1611763

Summary: RGW Dynamic bucket index resharding keeps resharding same buckets
Product: Red Hat Ceph Storage Reporter: Scoots Hamilton <schamilt>
Component: RGWAssignee: J. Eric Ivancich <ivancich>
Status: CLOSED ERRATA QA Contact: vidushi <vimishra>
Severity: medium Docs Contact:
Priority: high    
Version: 3.0CC: assingh, cbodley, ceph-eng-bugs, ceph-qe-bugs, dzafman, hnallurv, ivancich, kbader, kchai, mbenjamin, mhackett, mmanjuna, nojha, pasik, sweil, tchandra, tserlin, vimishra, vumrao
Target Milestone: rc   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: RHEL: ceph-12.2.8-26.el7cp Ubuntu: ceph_12.2.8-25redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-03 19:01:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
Bucket List Pre/Post shard none

Description Scoots Hamilton 2018-08-02 16:52:52 UTC
Created attachment 1472783 [details]
Bucket List Pre/Post  shard

Description of problem:

Our customer opened a case noting that there was an increase in IOPS when they enabled the dynamic sharding feature in their environment. 

The issue reflected to be the same as luminous: Resharding hangs with versioning-enabled buckets and other in upstream trackers: 


Version-Release number of selected component (if applicable):

How reproducible:


We did have the customer capture the bucket list pre and post sharding and have evidence that the same buckets are being flagged over and over again. 

Steps to Reproduce:
The Customer allows the sharding process to run 
The bucket list is checked pre-run 
The Bucket list is checked post-run 

Actual results:
The buckets in the lists are identical in every respect 

Expected results:
The buckets which were queued to be shared, should not been seen in the list again with identical info as when they were first flagged 

Additional info:

Comment 16 J. Eric Ivancich 2018-10-03 20:51:16 UTC
The upstream PR is here, currently DNM, cleaning up:


Comment 18 J. Eric Ivancich 2018-10-31 19:34:28 UTC
Pushed to ceph-3.2-rhel-patches.

Comment 21 J. Eric Ivancich 2018-10-31 20:31:10 UTC
Here are some of the tests I used to verify the cases. All tests are based on inserting code at the top of the inner-most loop in RGWBucketReshard::do_reshard. In each case we're also testing resharding on a bucket with more than 30 objects and with the rgw_reshard_bucket_lock_duration set to 30.

Test 1: Insert sleep(1); -- this tests whether renewing the lock works when the resharding is taking somewhat longer than expected.

Test 2: Insert sleep(32); -- this tests proper recovery when we're unable to renew the lock before it expires.

Test 3: Insert static int i = 0; if (++i > 10) exit(1); -- this tests crashing of the radosgw code leaving things in a non-cleaned up state. I then restart everything and make sure I can read/write the bucket index (e.g., list the bucket, remove an object from the bucket). Furthermore I checked the radosgw log file to see if it includes "apparently successfully cleared resharding flags for bucket...".

Comment 25 Manjunatha 2018-12-26 17:46:05 UTC
*** Bug 1644212 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2019-01-03 19:01:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.