Description of problem: Ceph placement groups aggregate data within pools. Configuring Ceph with a sensible number of placement groups impacts data durability and performance. OSP director allows for Ceph customization: adding pools, setting pg_num and pgp_num, setting replication factor, adding OSDs. OSP director should automatically make sensible suggestions for pg/pgp_num based on number of pools, OSDs, and replication factor. Heuristics can be captured from the pgcalc tool and documentation. Parameters should be over-rideable by customers. Additional info: 1) Existing bug: https://bugzilla.redhat.com/show_bug.cgi?id=1252546 Ceph pg_num and pgp_num are correctly set in ceph.yaml but the pools always use 64 2) PG documentation and recommendations: http://docs.ceph.com/docs/master/rados/operations/placement-groups/ 3) PG calc tool: http://ceph.com/pgcalc/
*** Bug 1290130 has been marked as a duplicate of this bug. ***
In the short term, the ceph.yaml should contain documentation/comments to help an operator determine the appropriate pg_num and pgp_num values and this should be documented in the Director Installation and Configuration guide. cc'ing dmacpher.
Here's some text to include in the ceph.yaml file (lifted from the ceph docs: http://docs.ceph.com/docs/hammer/rados/configuration/pool-pg-config-ref/) # Ensure you have a realistic number of placement groups. We recommend # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 # divided by the number of replicas (i.e., osd pool default size). So for # 10 OSDs and osd pool default size = 4, we'd recommend approximately # (100 * 10) / 4 = 250. # pgp_num is the default number of placement groups for placement for a pool. # The default value is the same as pgp_num with mkpool. PG and PGP should be # equal (for now).
Dan, The problem is despite the ned values of the default specified in ceph.yaml they don't appear to be picked up and the default remains at 64. - Is there a workaround for this ? - And if not, after deployment how should these values be updated ? Thank you
(In reply to Ruchika K from comment #7) > Dan, > > The problem is despite the ned values of the default specified in ceph.yaml > they don't appear to be picked up and the default remains at 64. > > - Is there a workaround for this ? > > - And if not, after deployment how should these values be updated ? > > Thank you Which Dan are you talking to? ;) I don't think there's a work-around for this; I think it's still broken. My guess is that the heat environment parameter isn't getting properly passed to the puppet manifest - I'm about to re-deploy OS1 P-Prime again, today, and I'll let you know if this has been fixed... Follow the ceph docs to increase the pg_num after creation: http://docs.ceph.com/docs/master/rados/operations/placement-groups/#set-the-number-of-placement-groups
Dan Yocum, You are the right Dan ;-) It isn't fixed on 7.3. For now we're updating the pg/pgp_num values manually from a ceph monitor node also the overcloud controller node. It is a manual step for someone to remember. Besides for updates issued by the undercloud node.. it is not clear if these values would revert back etc. Is it possible to get an insight on whether any update would update these values ? Thank you
As I recall an update will not revert the values since they are intially set at pool creation.
Ruchika, Here's a possible workaround to consider to at least have Director automate what the cloud admin would have to remember to do manually. 1. Have Heat trigger a post deployment shell script [1] 2. Have the shell script update the pools that were created [2] 3. Have the shell script, or some custom puppet [3], modify ceph.conf to set the desired PG values for future pool creation Then once this bug is fixed, take out the post configs described above. John [1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/sect-Configuring_after_Overcloud_Creation.html [2] # add a conditional to only execute this block if the change is necessary $pg_num=256; # or pull it from hiera $pgp_num=256; for i in rbd images volumes vms; do ceph osd pool set $i pg_num $pg_num; sleep 10 ceph osd pool set $i pgp_num $pgp_num; sleep 10 done [3] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/sect-Modifying_Puppet_Configuration_Data.html
(In reply to John Fulton from comment #11) > Ruchika, > > Here's a possible workaround to consider to at least have Director automate > what the cloud admin would have to remember to do manually. > > 1. Have Heat trigger a post deployment shell script [1] > 2. Have the shell script update the pools that were created [2] > 3. Have the shell script, or some custom puppet [3], modify ceph.conf to set > the desired PG values for future pool creation > > Then once this bug is fixed, take out the post configs described above. > > John > > [1] > https://access.redhat.com/documentation/en-US/ > Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/ > Director_Installation_and_Usage/sect-Configuring_after_Overcloud_Creation. > html > > [2] > # add a conditional to only execute this block if the change is necessary > $pg_num=256; # or pull it from hiera > $pgp_num=256; > for i in rbd images volumes vms; do It may better to update not all pools but some pools specified by users. Since there are users who have customized crush map. This process also is doable with running one shell, python or something script triggered by hiera / puppet. What do you think? > ceph osd pool set $i pg_num $pg_num; > sleep 10 > ceph osd pool set $i pgp_num $pgp_num; > sleep 10 > done > > [3] > https://access.redhat.com/documentation/en-US/ > Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/ > Director_Installation_and_Usage/sect-Modifying_Puppet_Configuration_Data.html
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
I wanted to confirm with Federico that this feature comes with OSPd Ceph integration, that will integrate a business logic layer in OSP11 (tech preview) and productised in OSP12.
I'm not a big fan of the software taking decisions, advising is fine but if we have to this this means something got wrong during the design process. Prior to deploy any production Ceph clusters, Ceph engineering team must be contacted for recommendation. Thus Ceph engineering will provide the cluster design, with proper configuration item. I agree this doesn't particularly scale, this is why we have to come up with guidelines for "standard" clusters, using pgcalc is a good start. Then non-standard cases (not address in our documentation and guidelines) this will require the help of a ceph engineer. To conclude, I'm expecting to see more guidance and doc in the product instead of building a complex logic inside ospd to determine the right values for pgs.
Seb, I disagree. I think OSP could use some heuristics that would avoid the worst case behaviors and advise people when there may be a problem. For example, I'd suggest increasing PG count on well-known OpenStack pools to achieve a ratio of PGs/OSD like this: images 1 (used by Glance) vms 8 (used by Nova) volumes 8 (used by Cinder) backups 2 (?) For example, on a cluster with 780 OSDs, this doesn't give you perfect performance but avoids the worst-case situations where a storage pool has only 32 PGs in a pool, which means it effectively can never use more than 3 x 32 = 96 OSDs < 1/10 of the block devices in the system. One feature that I haven't found is ability to monitor Ceph I/O by pool - this really is what's needed to determine how to assign PG counts to even out I/O across block devices. But a partial replacement for that is "ceph df", which shows you which pools have the most data in them - those are the ones that have the most need of PGs to even out space utilization across OSDs. Another feature that's missing is to tell people which OSDs are overflowing and what to do about it. I use this, but it should be a feature of Ceph: ansible -m shell -a 'df | grep /var/lib/ceph/osd' osds \ | tr '%' ' ' | sort -nk5 > /tmp/du In practice I've run into several situations where an OSD filled up while there were many other OSDs that were only 50-60% utilized. This is a problem with Ceph and increasing PGs is the primary way to address it - a secondary way is to reweight OSDs using "ceph osd reweight-by-utilization", which worked fairly well. To see what this would do without doing it, do "ceph osd test-reweight-by-utilization". But I would suggest adjusting PG counts first. It's important to get this close to right before all of your data lands on the block devices.
If the consensus here is that this is a documentation issue and not a software issue, then we ought to be telling the user how (and why) to calculate PG counts in the documentation. I don't see that here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/red_hat_ceph_storage_for_the_overcloud/#Assigning-Custom-Attributes-to-Different-Ceph-Pools Why don't we describe use of the PG calculator at https://access.redhat.com/labs/cephpgc/ We should explain what different standard storage pools do and which ones are likely to need increasing. Considerations that make PG counts hard to get right and important to get right: 1) PG counts cannot be easily reduced - this can only be done by stopping services that use the pool, copying to a new storage pool (if you have space!), deleting old storage pool and renaming new pool to old pool's name, then starting services that use it - not usually practical in a running site. 2) PG counts cannot easily be increased in a running system once data has been added to the pools. As Peter Portante says, "data is like concrete" - once it has been laid out, it's expensive to rearrange it. Changes to PG counts on a running system can cause latency increases. 3) PG count must be divided among active storage pools based on anticipated space consumption and I/O load - you can't set them all high because this consumes resources in Ceph daemons, such as threads and memory. The two OpenStack pools that most commonly need increasing are: vms - stores per-guest "system disk" files that are not in the Glance image backing the guest. volumes - stores Cinder volume data (i.e. /dev/vd[b-z] attached to guest) For methods of calculating placement group counts (i.e. if the PG calculator above doesn't work for you), see: http://docs.ceph.com/docs/master/rados/operations/placement-groups/
My initial suggestion, when I filed this RFE almost 2 years ago, was that OSP director should make sensible suggestions based on heuristics/logic from the pgcalc tool. The documentation (including the Ceph documentation AND OSP director documentation) for doing these calculations is not straighforward. At the time I filed this RFE we were seeing customers accepting bad defaults or making wrong calculations. Since the logic already exists in another tool, my thought was that getting OSP director to make some senisble/better initial recommendations would not be a heavy lift.