Bug 1578086 - osp13 ceph deployment fails with CephPoolDefaultPgNum >32 with 90 osds
Summary: osp13 ceph deployment fails with CephPoolDefaultPgNum >32 with 90 osds
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 3.1
Assignee: Guillaume Abrioux
QA Contact: Vasishta
URL:
Whiteboard:
: 1647467 (view as bug list)
Depends On:
Blocks: 1548353 1590938 1592848
TreeView+ depends on / blocked
 
Reported: 2018-05-14 18:57 UTC by Dave Wilson
Modified: 2019-08-26 15:11 UTC (History)
21 users (show)

Fixed In Version: RHEL: ceph-ansible-3.1.5-1.el7cp Ubuntu: ceph-ansible_3.1.5-2redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1590938 (view as bug list)
Environment:
Last Closed: 2019-08-26 15:11:48 UTC
Embargoed:


Attachments (Terms of Use)
ceph ansible log (465.32 KB, text/plain)
2018-05-14 18:57 UTC, Dave Wilson
no flags Details
env file (2.47 KB, text/plain)
2018-05-14 18:58 UTC, Dave Wilson
no flags Details
output of ceph-ansible playbook showing pool creation failing (489.98 KB, application/x-gzip)
2018-05-31 20:41 UTC, John Fulton
no flags Details
/var/log/mistral/ceph-install-workflow.log (8.06 MB, text/plain)
2018-07-20 11:44 UTC, Ben England
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 2628 0 'None' closed move openstack and cephfs pools creation in ceph-osd 2021-02-09 13:15:43 UTC
Github ceph ceph-ansible pull 2675 0 'None' closed osds: wait for osds to be up before creating pools 2021-02-09 13:15:43 UTC

Description Dave Wilson 2018-05-14 18:57:14 UTC
Created attachment 1436506 [details]
ceph ansible log

Description of problem: 
OSP13 deployment fails with CephPoolDefaultPgNum set greater than 32 (ex. 64).
Relevant error generated is:
"pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)", "stderr_lines": ["Error ERANGE:  pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"], "stdout": "", "stdout_lines": []}".
Note that I'm deploy with 90 osd's replication count 3 and 5 openstack pools. It appears the installer is computing max pgs based on one osd replicate (600).


Version-Release number of selected component (if applicable):
ceph-ansible-3.1.0-0.1.beta4.el7cp.noarch
osp 2018-03-29.1 puddle  
dockerimage: rhceph:3-6




How reproducible:
100%

Steps to Reproduce:
1. deploy ops13 with ceph-storage and pg_num greater > 32
2. configure adequate amount of osd's that should deploy based on #pools and a max pg's/osd=200

Actual results:
failed with "pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)", "stderr_lines": ["Error ERANGE:  pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 


Expected results:
successful osp13 overcloud deployed with ceph-storage 

Additional info:

Comment 3 Dave Wilson 2018-05-14 18:58:09 UTC
Created attachment 1436507 [details]
env file

Comment 4 Sébastien Han 2018-05-15 07:28:47 UTC
Isn't this one a dup John?

Comment 5 Sébastien Han 2018-05-18 12:35:25 UTC
The reported error is valid. When deploying a cluster, it's encouraged (soon mandatory) to set a correct size for your PGs for each pool. Also, using https://ceph.com/pgcalc/ as a reference is helpful. There is no good default calculation for this.

Given that the person deploying the cluster knows how machines/disk will be configurable then the total amount of OSD is known. Thus this will help avoid this error in the future.

I'm closing this, feel free to re-open if you have any concerns.

Comment 6 John Fulton 2018-05-21 21:33:22 UTC
The bug is that ceph-ansible tries to create the pools _before_ all the OSDs are active.

This results in the overdose protection check failing [1] because in this scenario, where the requested OSDs haven't yet been activated, num_osds returns 1 and with a PG count like 256, which is reasonable on a system with 50 to 100 OSDs, the check will fail. If you lower the PG count to 32, as has been done in CI jobs for virtual envs, you don't hit the error because 5 pools * 3 size * 32 pg_num < 600 total allowable PGs based on 1 OSD.

A fix is to create the pools _after_ all of the OSDs are active. To prove that this will work I did the following:

A. comment out pool creation in ceph-ansible [2]
B. run a fresh ceph-ansible and see it pass successfully without any pools getting created
C. uncomment out pool creation in ceph-ansible [2]
D. re-run ceph-ansible and see it pass successfully with pools getting created [3]

On the send run of ceph-ansible (step D) the pools were created because the OSDs were active and num_osds returned a value large enough for the check [1] to pass. Also, I did my test without mds as it would have the same problem [4].

Do you think you could modify the pool creation task to run after the OSDs are active? 

[1] https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5673 
[2] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/openstack_config.yml#L10-L35 

[3]
[root@overcloud-controller-0 ~]# ceph osd dump | grep pool
pool 1 'manila_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 2 flags hashpspool stripe_width 0
pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 144 flags hashpspool stripe_width 0
pool 3 'metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 145 flags hashpspool stripe_width 0
pool 4 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 146 flags hashpspool stripe_width 0
pool 5 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 147 flags hashpspool stripe_width 0
pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 148 flags hashpspool stripe_width 0
[root@overcloud-controller-0 ~]# 

[4] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/create_mds_filesystems.yml#L2-L6

Comment 7 Guillaume Abrioux 2018-05-29 13:21:55 UTC
fixed in v3.1.0rc4

Comment 14 John Fulton 2018-05-31 20:41:47 UTC
Created attachment 1446432 [details]
output of ceph-ansible playbook showing pool creation failing

Comment 21 John Fulton 2018-06-02 18:09:19 UTC
- Installed ceph-ansible rc6 on undercloud running OSP13 puddle [1]
- ceph-ansible deployment suceeded [2]
- pools were created with desired PG number of 256
- I didn't write the code, guits++ did
- I am marking this bug as verified

[1]
[root@microbrow-07 mistral]# rpm -q ceph-ansible
ceph-ansible-3.1.0-0.1.rc6.el7cp.noarch         
[root@microbrow-07 mistral]#    

[2]
2018-06-01 17:43:12,515 p=9922 u=mistral |  PLAY RECAP *********************************************************************
2018-06-01 17:43:12,515 p=9922 u=mistral |  192.168.24.54              : ok=80   changed=13   unreachable=0    failed=0   
2018-06-01 17:43:12,515 p=9922 u=mistral |  192.168.24.57              : ok=122  changed=20   unreachable=0    failed=0   
2018-06-01 17:43:12,515 p=9922 u=mistral |  192.168.24.59              : ok=58   changed=6    unreachable=0    failed=0   
2018-06-01 17:43:12,515 p=9922 u=mistral |  192.168.24.61              : ok=85   changed=17   unreachable=0    failed=0   
2018-06-01 17:43:12,515 p=9922 u=mistral |  192.168.24.64              : ok=84   changed=13   unreachable=0    failed=0   
2018-06-01 17:43:12,515 p=9922 u=mistral |  INSTALLER STATUS ***************************************************************
2018-06-01 17:43:12,517 p=9922 u=mistral |  Install Ceph Monitor        : Complete (0:01:14)
2018-06-01 17:43:12,518 p=9922 u=mistral |  Install Ceph Manager        : Complete (0:00:18)
2018-06-01 17:43:12,518 p=9922 u=mistral |  Install Ceph OSD            : Complete (0:11:38)
2018-06-01 17:43:12,518 p=9922 u=mistral |  Install Ceph Client         : Complete (0:00:31)
2018-06-01 17:43:12,518 p=9922 u=mistral |  Friday 01 June 2018  17:43:12 -0400 (0:00:00.043)       0:13:55.702 *********** 
2018-06-01 17:43:12,518 p=9922 u=mistral |  =============================================================================== 
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-osd : prepare ceph "filestore" containerized osd disk(s) non-collocated - 449.53s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-osd : wait for all osd to be up ----------------------------------- 75.31s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-osd : systemd start osd container --------------------------------- 27.63s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-osd : resolve dedicated device link(s) ---------------------------- 24.62s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-defaults : resolve device link(s) --------------------------------- 23.28s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 21.02s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 20.01s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-mon : wait for monitor socket to exist ---------------------------- 15.41s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-mon : create ceph mgr keyring(s) when mon is containerized -------- 13.52s
2018-06-01 17:43:12,521 p=9922 u=mistral |  gather and delegate facts ---------------------------------------------- 13.28s
2018-06-01 17:43:12,521 p=9922 u=mistral |  ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 12.97s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-osd : read information about the devices --------------------------- 8.25s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-osd : create gpt disk label ---------------------------------------- 7.56s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-osd : check the partition status of the osd disks ------------------ 6.74s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-docker-common : get ceph version ----------------------------------- 6.72s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-osd : create openstack pool(s) ------------------------------------- 5.86s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image --- 4.07s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-docker-common : get ceph version ----------------------------------- 3.58s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-docker-common : get ceph version ----------------------------------- 3.55s
2018-06-01 17:43:12,522 p=9922 u=mistral |  ceph-mon : ipv4 - force peer addition as potential bootstrap peer for cluster bringup - monitor_address_block --- 3.14s
                                                                                                       

[3]
[root@overcloud-controller-0 ~]# ceph osd dump | grep pool  
pool 1 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 124 flags hashpspool stripe_width 0
pool 2 'metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 125 flags hashpspool stripe_width 0 
pool 3 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 126 flags hashpspool stripe_width 0 
pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 127 flags hashpspool stripe_width 0     
pool 5 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 128 flags hashpspool stripe_width 0 
[root@overcloud-controller-0 ~]#

Comment 22 Ben England 2018-07-19 17:04:03 UTC
Resetting status to ASSIGNED, because the fix failed to work in a real cluster. 
 
The problem is that the task here:

https://github.com/ceph/ceph-ansible/blob/9d5265fe11fb5c1d0058525e8508aba80a396a6b/roles/ceph-osd/tasks/openstack_config.yml#L2

For example, Alex Krzos just tried this in the Alderaan cluster, with 20 OSDs (not a large cluster), and it failed with this error:

https://gist.githubusercontent.com/akrzos/809f744fbc95110b0b89d7fae30082c0/raw/c5a67eb6686ad466f75f2d2045e637a73186c080/gistfile1.txt

which shows that only 10 of the 20 OSDs were up when the pool create was tried.

This is compounded by the fact that mon_max_pgs_per_osd is set to 200, so low that almost any large storage pool will push this limit.  suppose most of my data is in the Ceph "vms" storage pool used for OpenStack Nova guests.  So I want to set the PG count high enough to spread this pool evenly across the available OSDs.  If I have 20 OSDs, and I assume that Glance images pool uses 5% of space and Nova "vms" pool uses 95% of space, the Ceph PG calculator specifies that I should use 1024 PGs for the "vms" pool and 64 PGs for the "images" pool.  The soft limit on total PG count is mon_max_pgs_per_osd * 20 OSDs = 200 * 20 = 4000.  The PGs used by PG calc's recommendation are:

PGs pool-name
------------
16 backup
16 volumes
64 images
1024 vms
---------
1120 total x replication count 3 = 3360  So this fits.  But unless almost all my OSDs are up, it won't fit within mon_max_pgs_per_osd * osds_up and the deployment will fail.  specifically with only 10 OSDs up it definitely will fail.  

We could increase default for mon_max_pgs_per_osd
but when I discussed with Ceph developers in performance weekly, they are reluctant to do so - they claim that soon you won't need as many PGs once the ceph-mgr rebalancing module is deployed.  I think that's true, but when will ceph-mgr be deployed by openstack and have the rebalancing feature in it?

Comment 23 Guillaume Abrioux 2018-07-20 09:08:54 UTC
Hi Ben,

could you provide the full playbook log related to the error you've linked?

Comment 24 Ben England 2018-07-20 11:44:52 UTC
Created attachment 1464928 [details]
/var/log/mistral/ceph-install-workflow.log

watch what happens in the "create openstack pool(s)" task for the "vms" pool.

Comment 25 Ben England 2018-07-20 12:08:52 UTC
also see related bz 1603615.   PGcalc doesn't always give bad PG counts, I just found one counter-example that shows it's not always consistent with the new mon_max_pgs_per_osd parameter.  This just means that we really need almost all the OSDs to be online before we try to create the pool.  

Since all that we're waiting for is the OSD daemon startup, we would think it shouldn't take all that long, but it appears that the "wait for all osds" task completed immediately, which is too fast.    Your code looks ok, don't understand why it failed.  

(undercloud) [stack@b04-h01-1029p ~]$ ssh heat-admin.10.15 sudo ceph -s -f json | python -c 'import sys,json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_up_osds"])'
20
(undercloud) [stack@b04-h01-1029p ~]$ ssh heat-admin.10.15 sudo ceph -s -f json | python -c 'import sys,json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])'
20

But if anything goes wrong with the 2 ceph -s commands that you issue, your equality test may give a false positive (i.e. same error on both will result in a True equality test).  Just my opinion, but having so many layers of quotation in an ansible command makes me nervous - I like to break things down into separate steps a bit more so that you can see what is happening while the command is being evaluated.

Comment 28 Sébastien Han 2018-08-06 14:05:06 UTC
I think Guillaume has done all the testing that was required and demonstrated this wasn't reproducible. We cannot go further with this BZ unless we get more an environment and a procedure that help us reproduce the bug every single time.
Thanks.

Comment 29 Ben England 2018-08-09 21:17:39 UTC
John Fulton just deployed RHOSP 13 with Ceph 3 in scale lab with > 1000 OSDs, he had some problems but this wasn't one of them.  cc'ing John.   I don't have a different OpenStack environment to try it on.  cc'ing openstack-perf-scale to see if anyone else is seeing this.  If people are, we'll re-open it I guess, but for now I can't prove it's still happening.  Thanks for your effort here.

Comment 30 Ben England 2018-08-09 21:47:20 UTC
On second thoughts, I just re-examined the attached log, and the chronology was:

2018-07-18 22:48:26,260 p=89400 u=mistral |  TASK [ceph-osd : systemd start osd container] **********************************

2018-07-18 22:48:28,161 p=89400 u=mistral |  TASK [ceph-osd : wait for all osd to be up] ***************************

2018-07-18 22:48:31,403 p=89400 u=mistral |  TASK [ceph-osd : create openstack pool(s)] *************************************

Given that only 5 seconds elapsed, could it be the case that all the OSDs had started but they weren't all  fully registered with all the mons?  I'm not claiming that this is a bug in your code, just that there may be a potential race condition here.  Can we just add an additional sleep of 3 seconds to give the MONs time to register the presence of the OSDs?   There are 3 mons, maybe one of them had a different OSD count than the others briefly and that was enough to throw this off?

Comment 34 John Fulton 2018-11-19 15:45:23 UTC
*** Bug 1647467 has been marked as a duplicate of this bug. ***

Comment 37 Ben England 2019-01-15 22:50:57 UTC
Has anyone reproduced it with RHCS 3.2, which just released?

RHCS 4.0 is in development now, it is based on Ceph Nautilus, which will have a key new feature for addressing this problem - the ability to reduce the PG count of an active Ceph storage pool.    Ceph already had the ability to increase the PG count of a storage pool, so with this new feature, Ceph should be able to change PG counts for any active cluster, and even automate this process.  So the user then should  not be stuck with the need to get the PG count right on the first try in the ceph-ansible YAML.  If they successfully develop an automated ceph mgr module for auto-adjusting PG counts, the user might not even have to specify it at all.

Comment 43 Ben England 2019-05-24 12:44:34 UTC
sounds like we can close it then as fixed in RHCS 3.2.


Note You need to log in before you can comment on or make changes to this bug.