Bug 2208269 - [RHCS Tracker] After add capacity the rebalance does not complete, and we see 2 PGs in active+clean+scrubbing and 1 active+clean+scrubbing+deep
Summary: [RHCS Tracker] After add capacity the rebalance does not complete, and we see...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.13.0
Assignee: Aishwarya Mathuria
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On: 2209375
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-18 12:31 UTC by Petr Balogh
Modified: 2023-08-14 05:45 UTC (History)
9 users (show)

Fixed In Version: 4.13.0-214
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2209375 (view as bug list)
Environment:
Last Closed: 2023-06-21 15:25:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 61313 0 None None None 2023-05-23 06:08:56 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:48 UTC

Description Petr Balogh 2023-05-18 12:31:19 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After add capacity test we see that the ceph state getting stuck in this state:

    pgs:     166 active+clean
             2   active+clean+scrubbing
             1   active+clean+scrubbing+deep


Version of all relevant components (if applicable):
I've noticed that it has started happening from build:
quay.io/rhceph-dev/ocs-registry:4.13.0-198
In
quay.io/rhceph-dev/ocs-registry:4.13.0-197 it worked well.

But from build -198 it's reproducible in every run.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, it's blocking our testing framework as cluster is not in good state.

Is there any workaround available to the best of your knowledge?
NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes - from build -198

Can this issue reproduce from the UI?
not relevant

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. Do add capacity
2. Rebalance is not finishing in timeout we used to have
3.


Actual results:
PGs:
             2   active+clean+scrubbing
             1   active+clean+scrubbing+deep

Expected results:
Have all PGS active+clean

Additional info:

Comment 9 Aishwarya Mathuria 2023-05-22 08:24:55 UTC
Hello Petr,
We have been working on a fix that is currently being tested and reviewed to reduce osd_scrub_cost in order to speed up scrubs with mClock. 
The new scrub cost would be 102400 ( osd_scrub_chunk_max (25) * 4KiB)

You can change osd_scrub_cost using the following command:
ceph config set osd 102400

To check osd_scrub_cost after modifying it: 
ceph config show osd.1 osd_scrub_cost

Regards,
Aishwarya

Comment 12 Aishwarya Mathuria 2023-05-22 16:13:04 UTC
Sorry about that! 

ceph config set osd osd_scrub_cost 102400

This should work fine.

Comment 14 Mudit Agarwal 2023-05-23 05:51:39 UTC
Aishwarya, do we have a Ceph BZ or tracker where the fix you are working upon can be tracked by ODF?

Comment 17 Aishwarya Mathuria 2023-05-23 06:00:01 UTC
Hi Mudit, 
The fix is being tracked here - https://tracker.ceph.com/issues/61313, it is currently under review.

Comment 25 errata-xmlrpc 2023-06-21 15:25:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.