Bug 1491007 - [RFE] Implement a checking mechanism to validate PG count to OSD's before allowing PG increases
Summary: [RFE] Implement a checking mechanism to validate PG count to OSD's before all...
Keywords:
Status: CLOSED DUPLICATE of bug 1489064
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.3
Hardware: All
OS: Linux
unspecified
high
Target Milestone: rc
: 3.0
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-12 18:10 UTC by Mike Hackett
Modified: 2020-12-14 10:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-16 15:30:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 17814 0 None None None 2017-10-09 23:01:08 UTC

Description Mike Hackett 2017-09-12 18:10:45 UTC
Description of problem:

We have had customers encounter issues due to large number of PG's on OSD's. 
The latest incident had an OSD with a PG count of 19k, eventually this lead to cluster being unable to recover.

Looking to include some mechanism that checks the PG to OSD ratio prior to allowing a user to create further PG's in a cluster to prevent a user from encountering a situation where the PG count exceeds a safe limit for cluster recovery.


Version-Release number of selected component (if applicable):
2.4


Steps to Reproduce:
1. Ceph cluster with 4 OSD nodes.
2. Create a pool with a large amount of PG's on it, 5k-7k.
3. Put data on the cluster.
4. Remove an OSD node from the cluster so the PG's need to rebalance to another node in the cluster.
5. Add node back into cluster and force rebalance.

Actual results:
Creating the pool with a large number of OSD's succeeds without warning.

Expected results:
We should block creating a large number of PG's on a pool where there are a limited number of OSD's

Additional info:
BU MOC encountered this issue and rndered the cluster completely down.
Gamestream aslo encountered this issue in the past.

Comment 8 Josh Durgin 2017-10-16 15:20:31 UTC
*** Bug 1489064 has been marked as a duplicate of this bug. ***

Comment 9 Josh Durgin 2017-10-16 15:30:00 UTC

*** This bug has been marked as a duplicate of bug 1489064 ***


Note You need to log in before you can comment on or make changes to this bug.