Bug 1603615

Summary: Ceph PG calculator conflict with mon_max_pg_per_osd
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ben England <bengland>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED CURRENTRELEASE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 3.1CC: akrzos, ceph-eng-bugs, dwilson, dzafman, gfarnum, jdurgin, johfulto, kchai, linuxkidd, mnelson, nojha, shiyuan, xmeng
Target Milestone: rc   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-22 21:25:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot of Ceph PG calculator output
none
screenshot try 2 of Ceph PG calculator none

Description Ben England 2018-07-19 18:05:26 UTC
Description of problem:

The Ceph PG calculator can generate recommendations for pool PG counts that will conflict with the osd_max_pgs_per_osd parameter.    This can cause significant aggravation for the installer, particularly when OpenStack is deploying a Ceph cluster.  

Version-Release number of selected component (if applicable):

RHCS 3.1 - ceph-common-12.2.4-10.el7cp.x86_64
RHOSP 13 rc6
RHEL 7.5 - 3.10.0-862.3.3.el7.x86_64

How reproducible:

every time.

Steps to Reproduce:

- plug this scenario into the PG calculator at 

https://access.redhat.com/labs/cephpgc/

-- 1000 OSDs 
-- 95% space used for "vms" pool
-- 5% space used for glance "images" pool
-- none for any other pool
See attachment for Ceph PG calculator output

2. Add up the PG counts for each pool and multiply by 3 (replication count), the total is:

(512+32768+32768+4096)*3 = 70144*3 = 210432

Compare to mon_max_pgs_per_osd * 1000 OSDs = 200 * 1000 = 200000

3. 

Actual results:

Pool creation will fail.

Expected results:

PG Calc should not conflict with osd_max_pgs_per_osd, ever!

Additional info:

I spoke with Ceph developers at upstream perf weekly, their conclusion was that we needed to start using the ceph-mgr balancer module (which is in Luminous = RHCS 3) and then we wouldn't need so many PGs.  But then PG calculator needs an update at a minimum.  I was able to enable the balancer module in RHOSP 13 ceph-mgr container, but I don't know if it works yet.  RHOSP 13 installer and ceph-ansible certainly do not enable it by default.

http://docs.ceph.com/docs/luminous/mgr/balancer/

My suggestion would be to lower the PG calculator's recommendations, since it was developed prior to having the ceph-mgr balancer module.  But by how much?  I would need more experience with effectiveness of balancer module in different-sized configurations before I could give you a clear answer on this.

background:

change to RHCS 3 that leads to this:
https://ceph.com/community/new-luminous-pg-overdose-protection/

code that implements osd_max_pgs_per_osd check:
https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5670-L5698

Comment 3 Ben England 2018-07-19 18:09:42 UTC
Created attachment 1460926 [details]
screenshot try 2 of Ceph PG calculator

shows output of Ceph PG calculator for specified inputs

Comment 4 Josh Durgin 2018-07-25 22:52:57 UTC
Since this is already in the wild docs/pg calc, let's increase the mon_max_pgs_per_osd to 300 to avoid this.

Comment 5 Greg Farnum 2018-08-08 21:33:38 UTC
Upstream PR: https://github.com/ceph/ceph/pull/23251

Still really need to fix the PG calculator though. In that screenshot it appears to be default recommending the user target 200 PGs/OSD, so saying to create 70144 PGs with 3x replication.
Or else somebody or something else put in the target of 200, in which case we should fix those.

Michael, any thoughts?

Comment 6 Michael J. Kidd 2018-08-08 21:58:38 UTC
I think then, the best course of action based on all the info (and new balancer module) is to reduce the default target PGs per OSD to 100 in the PG Calc tool, and remove all mentions of 300 as a target.

If someone is intentionally setting it to 200 and they get that warning, I think the expectation is that the cluster will be expanded soon and they can figure out how to adjust the warning threshold.


Sound reasonable?

Comment 9 Shiyi Yuan 2018-08-10 06:12:04 UTC
Hi  Michael

I have updated the changes according to your suggestions, please review it 
on https://labsci.usersys.redhat.com/labs/cephpgc/

Thanks!

Comment 10 Michael J. Kidd 2018-08-10 19:43:04 UTC
Hello Shiyi,
  The update looks good to me.  Please push it to the production instance.

Thanks!

Comment 11 Shiyi Yuan 2018-08-13 02:15:16 UTC
Hi Michael

The new update has been on production instance.

Thanks!