Bug 1286841 - [RFE] Automate Ceph PG configuration decisions when deploying via OSP director
Summary: [RFE] Automate Ceph PG configuration decisions when deploying via OSP director
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Sébastien Han
QA Contact: Yogev Rabl
URL:
Whiteboard:
: 1290130 (view as bug list)
Depends On:
Blocks: 1387430 1394872 1413723
TreeView+ depends on / blocked
 
Reported: 2015-11-30 21:07 UTC by jliberma@redhat.com
Modified: 2017-03-29 18:19 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-29 18:19:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jliberma@redhat.com 2015-11-30 21:07:03 UTC
Description of problem:

Ceph placement groups aggregate data within pools. Configuring Ceph with a sensible number of placement groups impacts data durability and performance. OSP director allows for Ceph customization: adding pools, setting pg_num and pgp_num, setting replication factor, adding OSDs. OSP director should automatically make sensible suggestions for pg/pgp_num based on number of pools, OSDs, and replication factor. Heuristics can be captured from the pgcalc tool and documentation. Parameters should be over-rideable by customers.

Additional info:

1) Existing bug: https://bugzilla.redhat.com/show_bug.cgi?id=1252546
Ceph pg_num and pgp_num are correctly set in ceph.yaml but the pools always use 64

2) PG documentation and recommendations: http://docs.ceph.com/docs/master/rados/operations/placement-groups/

3) PG calc tool: http://ceph.com/pgcalc/

Comment 3 Mike Burns 2015-12-09 17:34:38 UTC
*** Bug 1290130 has been marked as a duplicate of this bug. ***

Comment 4 Dan Yocum 2015-12-16 22:46:20 UTC
In the short term, the ceph.yaml should contain documentation/comments to help an operator determine the appropriate pg_num and pgp_num values and this should be documented in the Director Installation and Configuration guide.  cc'ing dmacpher.

Comment 5 Dan Yocum 2015-12-16 22:55:32 UTC
Here's some text to include in the ceph.yaml file (lifted from the ceph docs: http://docs.ceph.com/docs/hammer/rados/configuration/pool-pg-config-ref/)

# Ensure you have a realistic number of placement groups. We recommend
# approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 
# divided by the number of replicas (i.e., osd pool default size). So for
# 10 OSDs and osd pool default size = 4, we'd recommend approximately
# (100 * 10) / 4 = 250.
# pgp_num is the default number of placement groups for placement for a pool. 
# The default value is the same as pgp_num with mkpool. PG and PGP should be 
# equal (for now).

Comment 7 Ruchika K 2016-03-30 15:05:40 UTC
Dan,

The problem is despite the ned values of the default specified in ceph.yaml they don't appear to be picked up and the default remains at 64.

- Is there a workaround for this ?

- And if not, after deployment how should these values be updated ?

Thank you

Comment 8 Dan Yocum 2016-03-30 16:12:36 UTC
(In reply to Ruchika K from comment #7)
> Dan,
> 
> The problem is despite the ned values of the default specified in ceph.yaml
> they don't appear to be picked up and the default remains at 64.
> 
> - Is there a workaround for this ?
> 
> - And if not, after deployment how should these values be updated ?
> 
> Thank you

Which Dan are you talking to?  ;)

I don't think there's a work-around for this; I think it's still broken.  My guess is that the heat environment parameter isn't getting properly passed to the puppet manifest - I'm about to re-deploy OS1 P-Prime again, today, and I'll let you know if this has been fixed...

Follow the ceph docs to increase the pg_num after creation:

http://docs.ceph.com/docs/master/rados/operations/placement-groups/#set-the-number-of-placement-groups

Comment 9 Ruchika K 2016-03-30 16:29:00 UTC
Dan Yocum,
You are the right Dan ;-)

It isn't fixed on 7.3.

For now we're updating the pg/pgp_num values manually from a ceph monitor node also the overcloud controller node. 

It is a manual step for someone to remember. Besides for updates issued by the undercloud node.. it is not clear if these values would revert back etc.
Is it possible to get an insight on whether any update would update these values ?


Thank you

Comment 10 jliberma@redhat.com 2016-03-31 03:03:57 UTC
As I recall an update will not revert the values since they are intially set at pool creation.

Comment 11 John Fulton 2016-03-31 12:34:37 UTC
Ruchika,

Here's a possible workaround to consider to at least have Director automate what the cloud admin would have to remember to do manually. 

1. Have Heat trigger a post deployment shell script [1] 
2. Have the shell script update the pools that were created [2]
3. Have the shell script, or some custom puppet [3], modify ceph.conf to set the desired PG values for future pool creation 

Then once this bug is fixed, take out the post configs described above. 

  John

[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/sect-Configuring_after_Overcloud_Creation.html

[2] 
# add a conditional to only execute this block if the change is necessary
$pg_num=256; # or pull it from hiera 
$pgp_num=256;
for i in rbd images volumes vms; do
 ceph osd pool set $i pg_num $pg_num;
 sleep 10
 ceph osd pool set $i pgp_num $pgp_num;
 sleep 10
done

[3] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/Director_Installation_and_Usage/sect-Modifying_Puppet_Configuration_Data.html

Comment 12 Shinobu KINJO 2016-03-31 22:52:19 UTC
(In reply to John Fulton from comment #11)
> Ruchika,
> 
> Here's a possible workaround to consider to at least have Director automate
> what the cloud admin would have to remember to do manually. 
> 
> 1. Have Heat trigger a post deployment shell script [1] 
> 2. Have the shell script update the pools that were created [2]
> 3. Have the shell script, or some custom puppet [3], modify ceph.conf to set
> the desired PG values for future pool creation 
> 
> Then once this bug is fixed, take out the post configs described above. 
> 
>   John
> 
> [1]
> https://access.redhat.com/documentation/en-US/
> Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/
> Director_Installation_and_Usage/sect-Configuring_after_Overcloud_Creation.
> html
> 
> [2] 
> # add a conditional to only execute this block if the change is necessary
> $pg_num=256; # or pull it from hiera 
> $pgp_num=256;
> for i in rbd images volumes vms; do

It may better to update not all pools but some pools specified by users.
Since there are users who have customized crush map.

This process also is doable with running one shell, python or something script triggered by hiera / puppet.

What do you think?

>  ceph osd pool set $i pg_num $pg_num;
>  sleep 10
>  ceph osd pool set $i pgp_num $pgp_num;
>  sleep 10
> done
> 
> [3]
> https://access.redhat.com/documentation/en-US/
> Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html/
> Director_Installation_and_Usage/sect-Modifying_Puppet_Configuration_Data.html

Comment 13 Mike Burns 2016-04-07 21:00:12 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 15 Jeff Brown 2016-09-13 17:35:26 UTC
I wanted to confirm with Federico that this feature comes with OSPd Ceph integration, that will integrate a business logic layer in OSP11 (tech preview) and productised in OSP12.

Comment 17 seb 2016-09-20 11:43:56 UTC
I'm not a big fan of the software taking decisions, advising is fine but if we have to this this means something got wrong during the design process.
Prior to deploy any production Ceph clusters, Ceph engineering team must be contacted for recommendation. Thus Ceph engineering will provide the cluster design, with proper configuration item.

I agree this doesn't particularly scale, this is why we have to come up with guidelines for "standard" clusters, using pgcalc is a good start.

Then non-standard cases (not address in our documentation and guidelines) this will require the help of a ceph engineer.

To conclude, I'm expecting to see more guidance and doc in the product instead of building a complex logic inside ospd to determine the right values for pgs.

Comment 18 Ben England 2016-12-13 18:01:02 UTC
Seb, I disagree.  I think OSP could use some heuristics that would avoid the worst case behaviors and advise people when there may be a problem.  For example, I'd suggest increasing PG count on well-known OpenStack pools to achieve a ratio of PGs/OSD like this:

images 1  (used by Glance)
vms 8     (used by Nova)
volumes 8 (used by Cinder)
backups 2 (?)

For example, on a cluster with 780 OSDs, this doesn't give you perfect  performance but avoids the worst-case situations where a storage pool has only 32 PGs in a pool, which means it effectively can never use more than 3 x 32 = 96 OSDs < 1/10 of the block devices in the system.

One feature that I haven't found is ability to monitor Ceph I/O by pool - this really is what's needed to determine how to assign PG counts to even out I/O across block devices.  But a partial replacement for that is "ceph df", which shows you which pools have the most data in them - those are the ones that have the most need of PGs to even out space utilization across OSDs. 

Another feature that's missing is to tell people which OSDs are overflowing and what to do about it.  I use this, but it should be a feature of Ceph:

 ansible -m shell -a 'df | grep /var/lib/ceph/osd' osds  \
  | tr '%' ' ' | sort -nk5 > /tmp/du

In practice I've run into several situations where an OSD filled up while there were many other OSDs that were only 50-60% utilized.  This is a problem with Ceph and increasing PGs is the primary way to address it - a secondary way is to reweight OSDs using "ceph osd reweight-by-utilization", which worked fairly well. To see what this would do without doing it, do "ceph osd test-reweight-by-utilization".  But I would suggest adjusting PG counts first.  It's important to get this close to right before all of your data lands on the block devices.

Comment 22 Ben England 2017-03-21 19:10:23 UTC
If the consensus here is that this is a documentation issue and not a software issue, then we ought to be telling the user how (and why) to calculate PG counts in the documentation.  I don't see that here:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/red_hat_ceph_storage_for_the_overcloud/#Assigning-Custom-Attributes-to-Different-Ceph-Pools

Why don't we describe use of the PG calculator at

https://access.redhat.com/labs/cephpgc/

We should explain what different standard storage pools do and which ones are likely to need increasing.  Considerations that make PG counts hard to get right and important to get right:

1) PG counts cannot be easily reduced - this can only be done by stopping services that use the pool, copying to a new storage pool (if you have space!), deleting old storage pool and renaming new pool to old pool's name, then starting services that use it - not usually practical in a running site.  

2) PG counts cannot easily be increased in a running system once data has been added to the pools.  As Peter Portante says, "data is like concrete" - once it has been laid out, it's expensive to rearrange it.  Changes to PG counts on a running system can cause latency increases.

3) PG count must be divided among active storage pools based on anticipated space consumption and I/O load - you can't set them all high because this consumes resources in Ceph daemons, such as threads and memory.

The two OpenStack pools that most commonly need increasing are:

vms - stores per-guest "system disk" files that are not in the Glance image backing the guest.
volumes - stores Cinder volume data (i.e. /dev/vd[b-z] attached to guest)

For methods of calculating placement group counts (i.e. if the PG calculator above doesn't work for you), see:

http://docs.ceph.com/docs/master/rados/operations/placement-groups/

Comment 23 jliberma@redhat.com 2017-03-21 21:29:40 UTC
My initial suggestion, when I filed this RFE almost 2 years ago, was that OSP director should make sensible suggestions based on heuristics/logic from the pgcalc tool.

The documentation (including the Ceph documentation AND OSP director documentation) for doing these calculations is not straighforward.

At the time I filed this RFE we were seeing customers accepting bad defaults or making wrong calculations.

Since the logic already exists in another tool, my thought was that getting OSP director to make some senisble/better initial recommendations would not be a heavy lift.


Note You need to log in before you can comment on or make changes to this bug.