Description of problem: The hierdeata ceph.yaml default pg_num and pgp_num sizes are 128. Recommended placement groups = (OSDs * 100) Total PGs = ------------ rounded up to nearest power of 2. pool size With 40 OSDs, 3 pools and 3 replications, rounded up to the nearest power of 2, we have pg_num of 512. http://docs.ceph.com/docs/master/rados/operations/placement-groups/#set-the-number-of-placement-groups Changing the default in ceph.yaml to 512 and deploying from templates creates stack successfully but no OSDs are created and ceph is unusable. Version-Release number of selected component (if applicable): python-rdomanager-oscplugin-0.0.8-44.el7ost.noarch How reproducible: Steps to Reproduce: 1. Deploy undercloud 2. Modify ceph.yaml defaults to recommended number of pgs: ceph::profile::params::osd_pool_default_pg_num: 512 ceph::profile::params::osd_pool_default_pgp_num: 512 ceph::profile::params::osd_pool_default_size: 3 3. Deploy overcloud with ceph storage servers via templates. Actual results: heat stack creates successfully but no OSDs are created. Ceph service starts/runs but is unusable. Expected results: ceph OSDs are created with correct pg_num OR heat stack create reports failure because no osds are created. Additional info: This is related to https://bugzilla.redhat.com/show_bug.cgi?id=1252158 but a separate case.
What I am seeing is a little different, the two hiera config parameters seem to work as expected; the deployment succeeded for me and I got the correct values set in ceph.conf. Yet, the pools created by OSPd do not make use of the defaults and enforce pg_num to 64. Maybe we could update the BZ subject to reflect this status (and attach a fix for it)?
I just ran into this today, I'm seeing the same behaviour as Giulio. My OSDs are created properly, /etc/ceph/ceph.conf has the right value for pg_num and pgp_num (as provided in the ceph.yaml template), however the 4 pools that are created in the overcloud (rbd, images, volumes, vms) all have 64 PG. It's almost like if the pools were created *before* ceph.conf was updated, or that their values were passed in hardcoded to 64 upon creation?
Looks like there is a default value for pg_num in the pupept module for newly created pools [1] set to 64, which makes it ignore the default we put in the config file; needs fixing. 1. https://github.com/openstack/puppet-ceph/blob/master/manifests/pool.pp#L48
Apparently we can't omit passing a pg_num when creating a new pool because the 'ceph osd pool create' command requires it as well, instead of using the default value from ceph.conf Maybe we can make the default value at [2] come from [3] though? 1. http://tracker.ceph.com/issues/13702 2. https://github.com/openstack/puppet-ceph/blob/master/manifests/pool.pp#L48 3. https://github.com/openstack/puppet-ceph/blob/master/manifests/profile/params.pp#L44
Is there any workaround for the issue ? can we try after editing "$pg_num" value from /usr/share/openstack-puppet/modules/ceph/manifests/pool.pp ?
TBH I don't think changing default value is a proper solution (while it may fix the problem for now). Next time you will need to change pg_num you will open same bug, realy :)? Let's fix it properly once and for all. I believe that reusing values from ceph::profile::params [1] should be moved to ceph::pool definition itself [2]. Also I wonder where ceph_pools resouce comes from? There's none in puppet-ceph [3] and neither in THT [4]. So I believe this is a bug in THT manifests. [1] https://github.com/openstack/tripleo-heat-templates/blob/618d14d7cf7f9a6499a7bf75ef2c01337d53715c/puppet/manifests/overcloud_controller_pacemaker.pp#L686 [2] https://github.com/openstack/tripleo-heat-templates/blob/618d14d7cf7f9a6499a7bf75ef2c01337d53715c/puppet/manifests/overcloud_controller_pacemaker.pp#L693 [3] https://github.com/openstack/puppet-ceph/search?utf8=%E2%9C%93&q=ceph_pool [4] https://github.com/openstack/tripleo-heat-templates/search?utf8=%E2%9C%93&q=ceph_pool
hi Martin, thanks for helping. I can move the hiera calls for the ceph::profile::params into [1] but is that the best approach? Shouldn't it be the puppet-ceph module using the default instead of hardcoding pg_num to 64 on new pools [2]? Regarding $ceph_pools, it comes from Heat as hieradata. 1. https://github.com/openstack/tripleo-heat-templates/blob/618d14d7cf7f9a6499a7bf75ef2c01337d53715c/puppet/manifests/overcloud_controller_pacemaker.pp#L693 2. https://github.com/openstack/puppet-ceph/blob/master/manifests/pool.pp#L48
I have proposed https://bugzilla.redhat.com/show_bug.cgi?id=1283721 which should fix this problem. Hopefully, it's very easy.
*** Bug 1283721 has been marked as a duplicate of this bug. ***
hi Felipe, thanks for the update! We posted a change to fix this bug, that is https://review.openstack.org/#/c/242456/ and it will be backported in future versions of the OSP Director. I see you are working on a change which will make this even more flexible by allowing usage of different pg_num, pgp_num and size for each Ceph pool, which I think is great to have. I will update the bz #1283721 to track your change there.
There is a simple algorithm from inktank in mojo for calculating sensible pg settings. Should we add some intelligence to the director to set those numbers automatically? Its a factor os pool count, OSD count, etc. I can file a separate RFE once this bug is resolved.
That's very reasonable, Jacob. Can you please file a separate RFE for it?
Filed RFE for automating sensible pg settings. https://bugzilla.redhat.com/show_bug.cgi?id=1286841
verified on python-tripleoclient-0.1.1-4.el7ost.noarch openstack-tripleo-puppet-elements-0.0.2-3.el7ost.noarch openstack-tripleo-0.0.7-1.el7ost.noarch openstack-tripleo-common-0.1.1-1.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.8-2.el7ost.noarch openstack-tripleo-image-elements-0.9.7-2.el7ost.noarch openstack-tripleo-heat-templates-0.8.8-2.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0604.html