Bug 1688378

Summary:	ops waiting for resharding to complete may not be able to complete when resharding does complete
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	J. Eric Ivancich <ivancich>
Component:	RGW	Assignee:	J. Eric Ivancich <ivancich>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:	John Brier <jbrier>
Priority:	low
Version:	3.2	CC:	agunn, anharris, cbodley, ceph-eng-bugs, ceph-qe-bugs, jbrier, kbader, mbenjamin, sweil, tserlin, vumrao
Target Milestone:	z2
Target Release:	3.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-12.2.8-106.el7cp Ubuntu: ceph_12.2.8-91redhat1	Doc Type:	Bug Fix
Doc Text:	.Operations waiting for resharding to complete are able to complete after resharding Previously, when using dynamic resharding, some operations that were waiting to complete after resharding failed to complete. This was due to code changes to the Ceph Object Gateway when automatically cleaning up no longer used bucket index shards. While this reduced storage demands and eliminated the need for manual clean up, the process removed one source of an identifier needed for operations to complete after resharding. The code has been updated so that identifier is retrieved from a different source after resharding and operations requiring it can now complete.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-30 15:57:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1629656

Description J. Eric Ivancich 2019-03-13 16:11:55 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 J. Eric Ivancich 2019-03-13 16:18:54 UTC

Description of problem: ops waiting for reshard to complete will fail when resharding successfully completes


Version-Release number of selected component (if applicable):


How reproducible:

Has reproduced twice by Thomas Serlin (tserlin). Once dyanamic resharding was turned off it did not reproduce.


Steps to Reproduce:
1. Set up cluster with dynamic resharding turned to on
2. Use the Veeam backup utility to write a back up to Ceph cluster
3. After about 31G of data is sent, a reshard will initiate and one of the ops will fail.

Actual results:

The op fails

Expected results:

The op succeeds

Additional info:

Is likely a result of a previous improvement where old bucket index data was removed once resharding completed

Comment 5 J. Eric Ivancich 2019-04-01 20:04:38 UTC

I tested the bug fix in the following manner....

1. Create test bucket

2. Create 7 jobs that do the following in parallel:
    a. upload file of around 256KB to test bucket
    b. go back to a. Use a counter and a unique tag per job so object names do not collide.

3. Do reshards repeatedly
    a. reshard bucket to a higher shard number
    b. wait for 5 seconds
    c. go back to a.

4. When examining the rgw log there should be no requests with a return status of either 500 or 404.

Without the bug fix, when I ran the above for 5 minutes and each reshard increasing number of shards by 50% I could very easily induce the error condition.

Comment 14 errata-xmlrpc 2019-04-30 15:57:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911