Description of problem
PG calculation as currently implemented by RHSC 2.0 is wrong (details below).
The worst case scenario is that a ceph cluster ends up in a non recoverable
state. In other words, there is a *riks of data loss* because of this issue.
The only scenario when the current implementation works right is one pool per
cluster, which which not very likely use case.
See pgcalc tool tool which provides proper guidance how to configure
PG count for a ceph pool.
 at http://ceph.com/pgcalc/ or https://access.redhat.com/labs/cephpgc/
On RHSC 2.0 server machine:
On Ceph 2.0 storage machines:
Steps to Reproduce
1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. Go Storage -> Pools -> Add Storage to start a wizard for ceph pool setup
5. Going through the wizard, select cluster 'alpha' and object storage
6. Stop on "Add Object Storage" page of the wizard and check number of
placement groups (PG)
On the "Add Object Storage" page, the PG number is:
* pre calculated by console itself (the calculation itself is wrong)
* not possible for the user to change
Either (the quick fix):
* there is no default predefined value for PG count to start with
* user can edit the value
* linking to pgcalc tool for proper guidance (downstream version is
available at https://access.redhat.com/labs/cephpgc/)
Or (the proper fix):
Console implements the same logic as pgcalc tool providing the same guidance
and functionality to the user.
Adding comments from Michael Kidd here:
This issue allows the customer to create many Pools with the same PG count
(which isn't taking into account how many pools, and how much data will exist
in them ), and get into a state of too many PGs per OSD.
This is especially critical since you cannot reduce the PG count after the pool
is created... Instead, a new pool with the proper values must be created,
and all data migrated (a usually painful process).
The per-pool calculations should be rounded to a power of 2, not the overall
cluster value. It's unclear which is intended in the slide deck, but the
per-pool value is what's important.
Per pool PG count ( pg_num * size ) should not be allowed to be less than the
OSD count in the cluster as this would limit performance of that pool.
rewriting pgcalc in the async timeframe is not tenable so we should expose the default of 0 PGs in an editable form for the user to adjust.
To summarize below are the changes which would be done -
1. Provide a text box in UI for enter pg num while creating a pool (with default value set as zero)
2. Have a check to validate negative values provided for pg num
3. Add a link to pgcalc tool next to pg num with help icon saying "Be aware that pg count per pool is critical. please visit pg calc tool to better understand what value should be used"
4. While expand cluster flow using new OSD nodes, show a warning to mention that "With expansion of cluster with OSD, cluster coming to non usable state would be very much possible as it involves movement of data across placement groups"
5. Add a checkbox to accept the expansion from admin, and if selected then only allow expansion submit from UI screen
6. In backend, dont calculate the pg num automatically and always expact the value from api to be passed.
@Michael/Ju, need your help t frame the warning messages in step-3 and step-4. Kindly provide your inputs.
For item 2, also validate non-zero
My suggestions on warning texts below:
3. "Be aware that the PG count per pool value is critical for cluster performance and stability. Please visit the Ceph PGs per Pool Calc tool to better understand what value should be used."
4. "Ceph cluster expansion requires data movement between OSDs and can cause significant client IO performance impact if proper adjustments are not made. Please contact Red Hat support for help with the recommended changes."
@Ju, can you ack this please?
Checking with packages (on RHEL 7.3 based, RHSCon 2.0 sever machine):
Following the reproducer from the description of this BZ, I see the following
1) On the "Add Object Storage" page, the explanation of importance of PG number
calculation is present (as proposed in comment 3), but a direct html link to
pgcalc tool is missing.
Based on the description of the bug and proposal in comment 2, I would expect
that link to the PG calc tool should be there.
2) The form on "Add Object Storage" page doesn't check for zero value of PG field.
It's possible to submit a request with zero PG number, which would fail in the
end, but console doesn't directly show any error.
The form should both display a warning for a zero value in the same way as for the
negative number and doesn't allow to click on next button to submit such invalid
Looking at your original description, especially these properties of PG number:
> The per-pool calculations should be rounded to a power of 2, not the overall
> cluster value. It's unclear which is intended in the slide deck, but the
> per-pool value is what's important.
> Per pool PG count ( pg_num * size ) should not be allowed to be less than the
> OSD count in the cluster as this would limit performance of that pool.
I'm wondering if it would make sense for the form on "Add Object Storage" page
to reject PG value which doesn't meet these requirements in a similar way how
it rejects negative values and how it should reject zero value.
While it would be great to have rules around the PG value, that would entail adding more logic and confirming it's implemented properly before the async update which doesn't seem realistic. So for this async update, simply removing the default enforcement, allowing a manual specification of PG count and linking to the PG calc tool is as good as I believe we can get.
Ultimately, we would have enforcement of the pg calc tool values and provide a means for the end user to override by acknowledging if they change the value, non-optimal behavior may be experienced (wording tbd).
We can stop suggesting PG value by keeping it empty and validate user input to stop giving negative and zero values. Also, as you suggest, we can add a small warning message.
Can you reply back with the exact warning message?
(In reply to Shubhendu Tripathi from comment #2)
> 4. While expand cluster flow using new OSD nodes, show a warning to mention
> that "With expansion of cluster with OSD, cluster coming to non usable state
> would be very much possible as it involves movement of data across placement
> 5. Add a checkbox to accept the expansion from admin, and if selected then
> only allow expansion submit from UI screen
Just for the sake of keeping thing organized, those items are covered in
BZ 1375972 and not this one.
(In reply to Michael J. Kidd from comment #7)
> While it would be great to have rules around the PG value, that would
> entail adding more logic and confirming it's implemented properly before the
> async update which doesn't seem realistic. So for this async update, simply
> removing the default enforcement, allowing a manual specification of PG
> count and linking to the PG calc tool is as good as I believe we can get.
> Ultimately, we would have enforcement of the pg calc tool values and provide
> a means for the end user to override by acknowledging if they change the
> value, non-optimal behavior may be experienced (wording tbd).
So it's not reasonable to add any additional checks. Thanks of the clarification.
Karnan: See Comment #3.
The message is already in the test build I was given access to, but was missing the link to the PG Calc tool. I provided that feedback via email on the request to check the current message state.
Added link to pgcalc tool in the warning message. Also added validation to the pg number input.
Checking with packages (on RHEL 7.3 based, RHSCon 2.0 sever machine):
and I see that:
* note now includes link to https://access.redhat.com/labs/cephpgc/
* there is no default value for "Placement Groups" field
* for zero value of PG, error message is displayed and "next" button disabled
Doc-text looks good.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.