Created attachment 1436506 [details] ceph ansible log Description of problem: OSP13 deployment fails with CephPoolDefaultPgNum set greater than 32 (ex. 64). Relevant error generated is: "pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)", "stderr_lines": ["Error ERANGE: pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"], "stdout": "", "stdout_lines": []}". Note that I'm deploy with 90 osd's replication count 3 and 5 openstack pools. It appears the installer is computing max pgs based on one osd replicate (600). Version-Release number of selected component (if applicable): ceph-ansible-3.1.0-0.1.beta4.el7cp.noarch osp 2018-03-29.1 puddle dockerimage: rhceph:3-6 How reproducible: 100% Steps to Reproduce: 1. deploy ops13 with ceph-storage and pg_num greater > 32 2. configure adequate amount of osd's that should deploy based on #pools and a max pg's/osd=200 Actual results: failed with "pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)", "stderr_lines": ["Error ERANGE: pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 Expected results: successful osp13 overcloud deployed with ceph-storage Additional info:
Created attachment 1436507 [details] env file
Isn't this one a dup John?
The reported error is valid. When deploying a cluster, it's encouraged (soon mandatory) to set a correct size for your PGs for each pool. Also, using https://ceph.com/pgcalc/ as a reference is helpful. There is no good default calculation for this. Given that the person deploying the cluster knows how machines/disk will be configurable then the total amount of OSD is known. Thus this will help avoid this error in the future. I'm closing this, feel free to re-open if you have any concerns.
The bug is that ceph-ansible tries to create the pools _before_ all the OSDs are active. This results in the overdose protection check failing [1] because in this scenario, where the requested OSDs haven't yet been activated, num_osds returns 1 and with a PG count like 256, which is reasonable on a system with 50 to 100 OSDs, the check will fail. If you lower the PG count to 32, as has been done in CI jobs for virtual envs, you don't hit the error because 5 pools * 3 size * 32 pg_num < 600 total allowable PGs based on 1 OSD. A fix is to create the pools _after_ all of the OSDs are active. To prove that this will work I did the following: A. comment out pool creation in ceph-ansible [2] B. run a fresh ceph-ansible and see it pass successfully without any pools getting created C. uncomment out pool creation in ceph-ansible [2] D. re-run ceph-ansible and see it pass successfully with pools getting created [3] On the send run of ceph-ansible (step D) the pools were created because the OSDs were active and num_osds returned a value large enough for the check [1] to pass. Also, I did my test without mds as it would have the same problem [4]. Do you think you could modify the pool creation task to run after the OSDs are active? [1] https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5673 [2] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/openstack_config.yml#L10-L35 [3] [root@overcloud-controller-0 ~]# ceph osd dump | grep pool pool 1 'manila_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 2 flags hashpspool stripe_width 0 pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 144 flags hashpspool stripe_width 0 pool 3 'metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 145 flags hashpspool stripe_width 0 pool 4 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 146 flags hashpspool stripe_width 0 pool 5 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 147 flags hashpspool stripe_width 0 pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 148 flags hashpspool stripe_width 0 [root@overcloud-controller-0 ~]# [4] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/create_mds_filesystems.yml#L2-L6
fixed in v3.1.0rc4
Created attachment 1446432 [details] output of ceph-ansible playbook showing pool creation failing
- Installed ceph-ansible rc6 on undercloud running OSP13 puddle [1] - ceph-ansible deployment suceeded [2] - pools were created with desired PG number of 256 - I didn't write the code, guits++ did - I am marking this bug as verified [1] [root@microbrow-07 mistral]# rpm -q ceph-ansible ceph-ansible-3.1.0-0.1.rc6.el7cp.noarch [root@microbrow-07 mistral]# [2] 2018-06-01 17:43:12,515 p=9922 u=mistral | PLAY RECAP ********************************************************************* 2018-06-01 17:43:12,515 p=9922 u=mistral | 192.168.24.54 : ok=80 changed=13 unreachable=0 failed=0 2018-06-01 17:43:12,515 p=9922 u=mistral | 192.168.24.57 : ok=122 changed=20 unreachable=0 failed=0 2018-06-01 17:43:12,515 p=9922 u=mistral | 192.168.24.59 : ok=58 changed=6 unreachable=0 failed=0 2018-06-01 17:43:12,515 p=9922 u=mistral | 192.168.24.61 : ok=85 changed=17 unreachable=0 failed=0 2018-06-01 17:43:12,515 p=9922 u=mistral | 192.168.24.64 : ok=84 changed=13 unreachable=0 failed=0 2018-06-01 17:43:12,515 p=9922 u=mistral | INSTALLER STATUS *************************************************************** 2018-06-01 17:43:12,517 p=9922 u=mistral | Install Ceph Monitor : Complete (0:01:14) 2018-06-01 17:43:12,518 p=9922 u=mistral | Install Ceph Manager : Complete (0:00:18) 2018-06-01 17:43:12,518 p=9922 u=mistral | Install Ceph OSD : Complete (0:11:38) 2018-06-01 17:43:12,518 p=9922 u=mistral | Install Ceph Client : Complete (0:00:31) 2018-06-01 17:43:12,518 p=9922 u=mistral | Friday 01 June 2018 17:43:12 -0400 (0:00:00.043) 0:13:55.702 *********** 2018-06-01 17:43:12,518 p=9922 u=mistral | =============================================================================== 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-osd : prepare ceph "filestore" containerized osd disk(s) non-collocated - 449.53s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-osd : wait for all osd to be up ----------------------------------- 75.31s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-osd : systemd start osd container --------------------------------- 27.63s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-osd : resolve dedicated device link(s) ---------------------------- 24.62s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-defaults : resolve device link(s) --------------------------------- 23.28s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 21.02s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 20.01s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-mon : wait for monitor socket to exist ---------------------------- 15.41s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-mon : create ceph mgr keyring(s) when mon is containerized -------- 13.52s 2018-06-01 17:43:12,521 p=9922 u=mistral | gather and delegate facts ---------------------------------------------- 13.28s 2018-06-01 17:43:12,521 p=9922 u=mistral | ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image -- 12.97s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-osd : read information about the devices --------------------------- 8.25s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-osd : create gpt disk label ---------------------------------------- 7.56s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-osd : check the partition status of the osd disks ------------------ 6.74s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-docker-common : get ceph version ----------------------------------- 6.72s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-osd : create openstack pool(s) ------------------------------------- 5.86s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-docker-common : pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image --- 4.07s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-docker-common : get ceph version ----------------------------------- 3.58s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-docker-common : get ceph version ----------------------------------- 3.55s 2018-06-01 17:43:12,522 p=9922 u=mistral | ceph-mon : ipv4 - force peer addition as potential bootstrap peer for cluster bringup - monitor_address_block --- 3.14s [3] [root@overcloud-controller-0 ~]# ceph osd dump | grep pool pool 1 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 124 flags hashpspool stripe_width 0 pool 2 'metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 125 flags hashpspool stripe_width 0 pool 3 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 126 flags hashpspool stripe_width 0 pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 127 flags hashpspool stripe_width 0 pool 5 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 128 flags hashpspool stripe_width 0 [root@overcloud-controller-0 ~]#
Resetting status to ASSIGNED, because the fix failed to work in a real cluster. The problem is that the task here: https://github.com/ceph/ceph-ansible/blob/9d5265fe11fb5c1d0058525e8508aba80a396a6b/roles/ceph-osd/tasks/openstack_config.yml#L2 For example, Alex Krzos just tried this in the Alderaan cluster, with 20 OSDs (not a large cluster), and it failed with this error: https://gist.githubusercontent.com/akrzos/809f744fbc95110b0b89d7fae30082c0/raw/c5a67eb6686ad466f75f2d2045e637a73186c080/gistfile1.txt which shows that only 10 of the 20 OSDs were up when the pool create was tried. This is compounded by the fact that mon_max_pgs_per_osd is set to 200, so low that almost any large storage pool will push this limit. suppose most of my data is in the Ceph "vms" storage pool used for OpenStack Nova guests. So I want to set the PG count high enough to spread this pool evenly across the available OSDs. If I have 20 OSDs, and I assume that Glance images pool uses 5% of space and Nova "vms" pool uses 95% of space, the Ceph PG calculator specifies that I should use 1024 PGs for the "vms" pool and 64 PGs for the "images" pool. The soft limit on total PG count is mon_max_pgs_per_osd * 20 OSDs = 200 * 20 = 4000. The PGs used by PG calc's recommendation are: PGs pool-name ------------ 16 backup 16 volumes 64 images 1024 vms --------- 1120 total x replication count 3 = 3360 So this fits. But unless almost all my OSDs are up, it won't fit within mon_max_pgs_per_osd * osds_up and the deployment will fail. specifically with only 10 OSDs up it definitely will fail. We could increase default for mon_max_pgs_per_osd but when I discussed with Ceph developers in performance weekly, they are reluctant to do so - they claim that soon you won't need as many PGs once the ceph-mgr rebalancing module is deployed. I think that's true, but when will ceph-mgr be deployed by openstack and have the rebalancing feature in it?
Hi Ben, could you provide the full playbook log related to the error you've linked?
Created attachment 1464928 [details] /var/log/mistral/ceph-install-workflow.log watch what happens in the "create openstack pool(s)" task for the "vms" pool.
also see related bz 1603615. PGcalc doesn't always give bad PG counts, I just found one counter-example that shows it's not always consistent with the new mon_max_pgs_per_osd parameter. This just means that we really need almost all the OSDs to be online before we try to create the pool. Since all that we're waiting for is the OSD daemon startup, we would think it shouldn't take all that long, but it appears that the "wait for all osds" task completed immediately, which is too fast. Your code looks ok, don't understand why it failed. (undercloud) [stack@b04-h01-1029p ~]$ ssh heat-admin.10.15 sudo ceph -s -f json | python -c 'import sys,json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_up_osds"])' 20 (undercloud) [stack@b04-h01-1029p ~]$ ssh heat-admin.10.15 sudo ceph -s -f json | python -c 'import sys,json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])' 20 But if anything goes wrong with the 2 ceph -s commands that you issue, your equality test may give a false positive (i.e. same error on both will result in a True equality test). Just my opinion, but having so many layers of quotation in an ansible command makes me nervous - I like to break things down into separate steps a bit more so that you can see what is happening while the command is being evaluated.
I think Guillaume has done all the testing that was required and demonstrated this wasn't reproducible. We cannot go further with this BZ unless we get more an environment and a procedure that help us reproduce the bug every single time. Thanks.
John Fulton just deployed RHOSP 13 with Ceph 3 in scale lab with > 1000 OSDs, he had some problems but this wasn't one of them. cc'ing John. I don't have a different OpenStack environment to try it on. cc'ing openstack-perf-scale to see if anyone else is seeing this. If people are, we'll re-open it I guess, but for now I can't prove it's still happening. Thanks for your effort here.
On second thoughts, I just re-examined the attached log, and the chronology was: 2018-07-18 22:48:26,260 p=89400 u=mistral | TASK [ceph-osd : systemd start osd container] ********************************** 2018-07-18 22:48:28,161 p=89400 u=mistral | TASK [ceph-osd : wait for all osd to be up] *************************** 2018-07-18 22:48:31,403 p=89400 u=mistral | TASK [ceph-osd : create openstack pool(s)] ************************************* Given that only 5 seconds elapsed, could it be the case that all the OSDs had started but they weren't all fully registered with all the mons? I'm not claiming that this is a bug in your code, just that there may be a potential race condition here. Can we just add an additional sleep of 3 seconds to give the MONs time to register the presence of the OSDs? There are 3 mons, maybe one of them had a different OSD count than the others briefly and that was enough to throw this off?
*** Bug 1647467 has been marked as a duplicate of this bug. ***
Has anyone reproduced it with RHCS 3.2, which just released? RHCS 4.0 is in development now, it is based on Ceph Nautilus, which will have a key new feature for addressing this problem - the ability to reduce the PG count of an active Ceph storage pool. Ceph already had the ability to increase the PG count of a storage pool, so with this new feature, Ceph should be able to change PG counts for any active cluster, and even automate this process. So the user then should not be stuck with the need to get the PG count right on the first try in the ceph-ansible YAML. If they successfully develop an automated ceph mgr module for auto-adjusting PG counts, the user might not even have to specify it at all.
sounds like we can close it then as fixed in RHCS 3.2.