Description of problem: The Ceph PG calculator can generate recommendations for pool PG counts that will conflict with the osd_max_pgs_per_osd parameter. This can cause significant aggravation for the installer, particularly when OpenStack is deploying a Ceph cluster. Version-Release number of selected component (if applicable): RHCS 3.1 - ceph-common-12.2.4-10.el7cp.x86_64 RHOSP 13 rc6 RHEL 7.5 - 3.10.0-862.3.3.el7.x86_64 How reproducible: every time. Steps to Reproduce: - plug this scenario into the PG calculator at https://access.redhat.com/labs/cephpgc/ -- 1000 OSDs -- 95% space used for "vms" pool -- 5% space used for glance "images" pool -- none for any other pool See attachment for Ceph PG calculator output 2. Add up the PG counts for each pool and multiply by 3 (replication count), the total is: (512+32768+32768+4096)*3 = 70144*3 = 210432 Compare to mon_max_pgs_per_osd * 1000 OSDs = 200 * 1000 = 200000 3. Actual results: Pool creation will fail. Expected results: PG Calc should not conflict with osd_max_pgs_per_osd, ever! Additional info: I spoke with Ceph developers at upstream perf weekly, their conclusion was that we needed to start using the ceph-mgr balancer module (which is in Luminous = RHCS 3) and then we wouldn't need so many PGs. But then PG calculator needs an update at a minimum. I was able to enable the balancer module in RHOSP 13 ceph-mgr container, but I don't know if it works yet. RHOSP 13 installer and ceph-ansible certainly do not enable it by default. http://docs.ceph.com/docs/luminous/mgr/balancer/ My suggestion would be to lower the PG calculator's recommendations, since it was developed prior to having the ceph-mgr balancer module. But by how much? I would need more experience with effectiveness of balancer module in different-sized configurations before I could give you a clear answer on this. background: change to RHCS 3 that leads to this: https://ceph.com/community/new-luminous-pg-overdose-protection/ code that implements osd_max_pgs_per_osd check: https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5670-L5698
Created attachment 1460926 [details] screenshot try 2 of Ceph PG calculator shows output of Ceph PG calculator for specified inputs
Since this is already in the wild docs/pg calc, let's increase the mon_max_pgs_per_osd to 300 to avoid this.
Upstream PR: https://github.com/ceph/ceph/pull/23251 Still really need to fix the PG calculator though. In that screenshot it appears to be default recommending the user target 200 PGs/OSD, so saying to create 70144 PGs with 3x replication. Or else somebody or something else put in the target of 200, in which case we should fix those. Michael, any thoughts?
I think then, the best course of action based on all the info (and new balancer module) is to reduce the default target PGs per OSD to 100 in the PG Calc tool, and remove all mentions of 300 as a target. If someone is intentionally setting it to 200 and they get that warning, I think the expectation is that the cluster will be expanded soon and they can figure out how to adjust the warning threshold. Sound reasonable?
Hi Michael I have updated the changes according to your suggestions, please review it on https://labsci.usersys.redhat.com/labs/cephpgc/ Thanks!
Hello Shiyi, The update looks good to me. Please push it to the production instance. Thanks!
Hi Michael The new update has been on production instance. Thanks!