Description of problem: When I create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node (together 16 OSDs) and then create one pool with default/recommended number of PGs[1,2] which is 4096 for the cluster with 16 OSDs (which is between 10 and 50), the cluster ends in HEALTH_WARN state because of "too many PGs per OSD (768 > max 300)". The reason here is not valid, because: 4096 PGs / 16 OSDs = 256 not 768 Please be also aware, that OSD should means "A physical or logical storage unit (e.g., LUN)."[3] So it shouldn't refer to anything else - for example to the number of OSD nodes. Just note: my cluster was created via USM/Red Hat Storage Console 2, but it shouldn't be a problem as it should use reasonable default configuration. OSDs list/tree: ~~~~~~~~~~~~~~~~~~~~~~~ # ceph --cluster TestClusterA osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 12.04297 root default -2 3.01074 host dhcp-126-125 0 1.00000 osd.0 up 1.00000 1.00000 1 0.01074 osd.1 up 1.00000 1.00000 2 1.00000 osd.2 up 1.00000 1.00000 3 1.00000 osd.3 up 1.00000 1.00000 -3 3.01074 host dhcp-126-126 4 0.01074 osd.4 up 1.00000 1.00000 5 1.00000 osd.5 up 1.00000 1.00000 6 1.00000 osd.6 up 1.00000 1.00000 7 1.00000 osd.7 up 1.00000 1.00000 -4 3.01074 host dhcp-126-127 8 0.01074 osd.8 up 1.00000 1.00000 9 1.00000 osd.9 up 1.00000 1.00000 10 1.00000 osd.10 up 1.00000 1.00000 11 1.00000 osd.11 up 1.00000 1.00000 -5 3.01074 host dhcp-126-128 12 0.01074 osd.12 up 1.00000 1.00000 13 1.00000 osd.13 up 1.00000 1.00000 14 1.00000 osd.14 up 1.00000 1.00000 15 1.00000 osd.15 up 1.00000 1.00000 ~~~~~~~~~~~~~~~~~~~~~~~ Pool list: ~~~~~~~~~~~~~~~~~~~~~~~ # ceph --cluster TestClusterA osd pool ls poolA ~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected component (if applicable): RHEL 7.2 ceph-base-10.2.2-32.el7cp.x86_64 ceph-common-10.2.2-32.el7cp.x86_64 ceph-mon-10.2.2-32.el7cp.x86_64 ceph-osd-10.2.2-32.el7cp.x86_64 ceph-selinux-10.2.2-32.el7cp.x86_64 libcephfs1-10.2.2-32.el7cp.x86_64 python-cephfs-10.2.2-32.el7cp.x86_64 How reproducible: probably 100% Steps to Reproduce: 1. Create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node. 2. Create one pool with default number of PGs (4096). 3. When the pool creation finish, check the health of the cluster. Actual results: ~~~~~~~~~~~~~~~~~~~~~~~ # ceph --cluster TestClusterA -s cluster cd696122-9ade-437f-baac-b12a02123bbc health HEALTH_WARN too many PGs per OSD (768 > max 300) monmap e3: 3 mons at {dhcp-126-87=10.34.126.87:6789/0,dhcp-126-88=10.34.126.88:6789/0,dhcp-126-89=10.34.126.89:6789/0} election epoch 10, quorum 0,1,2 dhcp-126-87,dhcp-126-88,dhcp-126-89 osdmap e59: 16 osds: 16 up, 16 in flags sortbitwise pgmap v371: 4096 pgs, 1 pools, 0 bytes data, 0 objects 659 MB used, 12325 GB / 12325 GB avail 4096 active+clean ~~~~~~~~~~~~~~~~~~~~~~~ Expected results: In my case, there are 256 PGs peer OSD (4096/16) so the cluster health status should be OK. Additional info: [1] http://docs.ceph.com/docs/master/rados/operations/placement-groups/ [2] https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/paged/storage-strategies/chapter-14-pg-count (14.3. PG Count for Small Clusters) [3] http://docs.ceph.com/docs/master/glossary/
As per recent discussions, one more thing done recently in USM is the case when calculated no of pgs is less than no of OSDs. This this case, the pgnum would be calculated using the below formula pgnum = no of osds / replica count consider the 2's next power as final value of pgnum.
The documentation seems a bit screwy, 4096 pgs with replication 3 would be too many for 16 osds. The guidance is to try for around 100-200 pgs/osd.
Doc team, would you please work with Sam and the RADOS team to ensure that proper guides are written?
Ken and Sam, Per the first comment, there is some sort of math error somewhere in Ceph. This guidance was something Loic and I worked on some time ago, and the Ceph PG Calculator was introduced to deal with larger clusters and the gateway. Comment one is saying: "When I create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node (together 16 OSDs) and then create one pool with default/recommended number of PGs[1,2] which is 4096 for the cluster with 16 OSDs (which is between 10 and 50), the cluster ends in HEALTH_WARN state because of "too many PGs per OSD (768 > max 300)". The reason here is not valid, because: 4096 PGs / 16 OSDs = 256 not 768" He's right. He's using the default value of 4096, but Ceph is saying that it is placing 768 PGs per OSD when that number should be 256, which is less than the 'mon pg warn max per osd' default value of 300. So something is wrong with these calcs.
Loic and Sam, would you please assist John with the guidance? See Comment 5
4096*3 / 16 = 768. You need to multiply nominal number of pgs by the replication count (for ec pools, by K+M).
Changed upper 10-50 OSDs recommendation from 4096 to 1024 based on Josh Durgin's input. https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/2/single/storage-strategies-guide#pg_count_for_small_clusters Changed in upstream too.
(In reply to John Wilkins from comment #8) > Changed in upstream too. https://github.com/ceph/ceph/commit/5621a9ca4c5a49855835113f972b114530cf341c for the record
The documentation was changed as described in Comment 8. >> VERIFIED