Bug 1491007

Summary: [RFE] Implement a checking mechanism to validate PG count to OSD's before allowing PG increases
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: RADOSAssignee: Josh Durgin <jdurgin>
Status: CLOSED DUPLICATE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.3CC: ceph-eng-bugs, dzafman, hnallurv, kchai, kdreyer, linuxkidd, skinjo, sweil, vumrao
Target Milestone: rcKeywords: FutureFeature
Target Release: 3.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-16 15:30:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Hackett 2017-09-12 18:10:45 UTC
Description of problem:

We have had customers encounter issues due to large number of PG's on OSD's. 
The latest incident had an OSD with a PG count of 19k, eventually this lead to cluster being unable to recover.

Looking to include some mechanism that checks the PG to OSD ratio prior to allowing a user to create further PG's in a cluster to prevent a user from encountering a situation where the PG count exceeds a safe limit for cluster recovery.


Version-Release number of selected component (if applicable):
2.4


Steps to Reproduce:
1. Ceph cluster with 4 OSD nodes.
2. Create a pool with a large amount of PG's on it, 5k-7k.
3. Put data on the cluster.
4. Remove an OSD node from the cluster so the PG's need to rebalance to another node in the cluster.
5. Add node back into cluster and force rebalance.

Actual results:
Creating the pool with a large number of OSD's succeeds without warning.

Expected results:
We should block creating a large number of PG's on a pool where there are a limited number of OSD's

Additional info:
BU MOC encountered this issue and rndered the cluster completely down.
Gamestream aslo encountered this issue in the past.

Comment 8 Josh Durgin 2017-10-16 15:20:31 UTC
*** Bug 1489064 has been marked as a duplicate of this bug. ***

Comment 9 Josh Durgin 2017-10-16 15:30:00 UTC

*** This bug has been marked as a duplicate of bug 1489064 ***