1362403 – Wrong calculation of PGs peer OSD leads to cluster in HEALTH_WARN state with explanation "too many PGs per OSD (768 > max 300)"

Bug 1362403 - Wrong calculation of PGs peer OSD leads to cluster in HEALTH_WARN state with explanation "too many PGs per OSD (768 > max 300)"

Summary: Wrong calculation of PGs peer OSD leads to cluster in HEALTH_WARN state with ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Documentation
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	2.0
Assignee:	ceph-docs@redhat.com
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1366577
TreeView+	depends on / blocked

Reported:	2016-08-02 06:58 UTC by Daniel Horák
Modified:	2016-09-30 17:19 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1366577 (view as bug list)
Environment:
Last Closed:	2016-09-30 17:19:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Daniel Horák 2016-08-02 06:58:40 UTC

Description of problem:
  When I create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node (together 16 OSDs) and then create one pool with default/recommended number of PGs[1,2] which is 4096 for the cluster with 16 OSDs (which is between 10 and 50), the cluster ends in HEALTH_WARN state because of "too many PGs per OSD (768 > max 300)".

  The reason here is not valid, because:
    4096 PGs / 16 OSDs = 256 not 768

  Please be also aware, that OSD should means "A physical or logical storage unit (e.g., LUN)."[3] So it shouldn't refer to anything else - for example to the number of OSD nodes.

  Just note: my cluster was created via USM/Red Hat Storage Console 2, but it shouldn't be a problem as it should use reasonable default configuration.

  OSDs list/tree:
  ~~~~~~~~~~~~~~~~~~~~~~~
  # ceph --cluster TestClusterA osd tree
  ID WEIGHT   TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY 
  -1 12.04297 root default                                            
  -2  3.01074     host dhcp-126-125                                   
   0  1.00000         osd.0              up  1.00000          1.00000 
   1  0.01074         osd.1              up  1.00000          1.00000 
   2  1.00000         osd.2              up  1.00000          1.00000 
   3  1.00000         osd.3              up  1.00000          1.00000 
  -3  3.01074     host dhcp-126-126                                   
   4  0.01074         osd.4              up  1.00000          1.00000 
   5  1.00000         osd.5              up  1.00000          1.00000 
   6  1.00000         osd.6              up  1.00000          1.00000 
   7  1.00000         osd.7              up  1.00000          1.00000 
  -4  3.01074     host dhcp-126-127                                   
   8  0.01074         osd.8              up  1.00000          1.00000 
   9  1.00000         osd.9              up  1.00000          1.00000 
  10  1.00000         osd.10             up  1.00000          1.00000 
  11  1.00000         osd.11             up  1.00000          1.00000 
  -5  3.01074     host dhcp-126-128                                   
  12  0.01074         osd.12             up  1.00000          1.00000 
  13  1.00000         osd.13             up  1.00000          1.00000 
  14  1.00000         osd.14             up  1.00000          1.00000 
  15  1.00000         osd.15             up  1.00000          1.00000 
  ~~~~~~~~~~~~~~~~~~~~~~~

  Pool list:
  ~~~~~~~~~~~~~~~~~~~~~~~
  # ceph --cluster TestClusterA osd pool ls
  poolA
  ~~~~~~~~~~~~~~~~~~~~~~~


Version-Release number of selected component (if applicable):
  RHEL 7.2
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-mon-10.2.2-32.el7cp.x86_64
  ceph-osd-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64

How reproducible:
  probably 100%

Steps to Reproduce:
1. Create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node.
2. Create one pool with default number of PGs (4096).
3. When the pool creation finish, check the health of the cluster.

Actual results:
  ~~~~~~~~~~~~~~~~~~~~~~~
  # ceph --cluster TestClusterA -s
    cluster cd696122-9ade-437f-baac-b12a02123bbc
     health HEALTH_WARN
            too many PGs per OSD (768 > max 300)
     monmap e3: 3 mons at {dhcp-126-87=10.34.126.87:6789/0,dhcp-126-88=10.34.126.88:6789/0,dhcp-126-89=10.34.126.89:6789/0}
            election epoch 10, quorum 0,1,2 dhcp-126-87,dhcp-126-88,dhcp-126-89
     osdmap e59: 16 osds: 16 up, 16 in
            flags sortbitwise
      pgmap v371: 4096 pgs, 1 pools, 0 bytes data, 0 objects
            659 MB used, 12325 GB / 12325 GB avail
                4096 active+clean
  ~~~~~~~~~~~~~~~~~~~~~~~

Expected results:
  In my case, there are 256 PGs peer OSD (4096/16) so the cluster health status should be OK.

Additional info:
[1] http://docs.ceph.com/docs/master/rados/operations/placement-groups/
[2] https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/paged/storage-strategies/chapter-14-pg-count (14.3. PG Count for Small Clusters)
[3] http://docs.ceph.com/docs/master/glossary/

Comment 2 Shubhendu Tripathi 2016-08-02 14:08:49 UTC

As per recent discussions, one more thing done recently in USM is the case when calculated no of pgs is less than no of OSDs. This this case, the pgnum would be calculated using the below formula

pgnum = no of osds / replica count
consider the 2's next power as final value of pgnum.

Comment 3 Samuel Just 2016-08-02 15:23:40 UTC

The documentation seems a bit screwy, 4096 pgs with replication 3 would be too many for 16 osds.  The guidance is to try for around 100-200 pgs/osd.

Comment 4 Ken Dreyer (Red Hat) 2016-08-02 19:19:06 UTC

Doc team, would you please work with Sam and the RADOS team to ensure that proper guides are written?

Comment 5 John Wilkins 2016-08-08 19:08:15 UTC

Ken and Sam,

Per the first comment, there is some sort of math error somewhere in Ceph. This guidance was something Loic and I worked on some time ago, and the Ceph PG Calculator was introduced to deal with larger clusters and the gateway. Comment one is saying: 

"When I create small Ceph cluster with 4 OSD nodes and 4 OSDs peer node (together 16 OSDs) and then create one pool with default/recommended number of PGs[1,2] which is 4096 for the cluster with 16 OSDs (which is between 10 and 50), the cluster ends in HEALTH_WARN state because of "too many PGs per OSD (768 > max 300)".

  The reason here is not valid, because:
    4096 PGs / 16 OSDs = 256 not 768"

He's right. He's using the default value of 4096, but Ceph is saying that it is placing 768 PGs per OSD when that number should be 256, which is less than the 'mon pg warn max per osd' default value of 300. So something is wrong with these calcs.

Comment 6 Ken Dreyer (Red Hat) 2016-08-10 13:28:22 UTC

Loic and Sam, would you please assist John with the guidance? See Comment 5

Comment 7 Samuel Just 2016-08-10 14:55:05 UTC

4096*3 / 16 = 768.  You need to multiply nominal number of pgs by the replication count (for ec pools, by K+M).

Comment 8 John Wilkins 2016-08-11 21:44:34 UTC

Changed upper 10-50 OSDs recommendation from 4096 to 1024 based on Josh Durgin's input.

https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/2/single/storage-strategies-guide#pg_count_for_small_clusters

Changed in upstream too.

Comment 9 Ken Dreyer (Red Hat) 2016-08-11 22:51:11 UTC

(In reply to John Wilkins from comment #8)
> Changed in upstream too.

https://github.com/ceph/ceph/commit/5621a9ca4c5a49855835113f972b114530cf341c

for the record

Comment 10 Daniel Horák 2016-08-12 11:36:42 UTC

The documentation was changed as described in Comment 8.

>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.