Description of problem:
We have had customers encounter issues due to large number of PG's on OSD's.
The latest incident had an OSD with a PG count of 19k, eventually this lead to cluster being unable to recover.
Looking to include some mechanism that checks the PG to OSD ratio prior to allowing a user to create further PG's in a cluster to prevent a user from encountering a situation where the PG count exceeds a safe limit for cluster recovery.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Ceph cluster with 4 OSD nodes.
2. Create a pool with a large amount of PG's on it, 5k-7k.
3. Put data on the cluster.
4. Remove an OSD node from the cluster so the PG's need to rebalance to another node in the cluster.
5. Add node back into cluster and force rebalance.
Creating the pool with a large number of OSD's succeeds without warning.
We should block creating a large number of PG's on a pool where there are a limited number of OSD's
BU MOC encountered this issue and rndered the cluster completely down.
Gamestream aslo encountered this issue in the past.
*** Bug 1489064 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1489064 ***