Bug 1383268

Summary:	Deployed Swift rings use way too high partition power
Product:	Red Hat OpenStack	Reporter:	Christian Schwede (cschwede) <cschwede>
Component:	openstack-tripleo-heat-templates	Assignee:	Christian Schwede (cschwede) <cschwede>
Status:	CLOSED ERRATA	QA Contact:	Arik Chernetsky <achernet>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	10.0 (Newton)	CC:	cschwede, egafford, jschluet, mabrams, mburns, mcornea, pgrist, rhel-osp-director-maint, thiago, zaitcev
Target Milestone:	rc	Keywords:	Triaged
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-5.0.0-0.8.0rc3.el7ost,instack-undercloud-5.0.0-2.el7ost,	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-14 16:15:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christian Schwede (cschwede) 2016-10-10 10:22:15 UTC

Description of problem:

The partition power of the deployed Swift rings both on the undercloud and overcloud are set to 18, which is way too high. It will create serious problems later on small deployments, especially with replication.

Using a partition power of 18 creates 262.144 partitions in the cluster. If these are spread only across 3 disks (for example only the controller nodes), each partition will be replicated individually - and this takes a lot of time. Additionally, a lot of extra inodes will be created, and depending on the usage one might suffer from a inode cache misses, slowing down the whole node.

Version-Release number of selected component (if applicable):

All OSP releases that use TripleO.

How reproducible:

Always

Steps to Reproduce:
1. Deploy using OOO/director.
2. Check partition power on under- and overcloud using "swift-ring-builder /etc/swift/object.builder"

Actual results:
Partition power of 18. This is the default in puppet-swift if nothing else is defined.

Expected results:
Way lower partition power, for example 10 for small deployments with only a few disks.

Additional info:
Please note that it is not possible to lower the partition power once the cluster has been deployed. Increasing the partition power later on is proposed as a patch upstream, but not yet merged; therefore starting with a lower value is much safer than using a high value that will create trouble later.

There is already a partition power of 10 defined in tripleo-heat-templates, but this is not used. Looking at /etc/puppet/hieradata/puppet-stack-config.yaml on the undercloud I see this:

tripleo::ringbuilder::part_power: 10
tripleo::ringbuilder::replicas: 3
tripleo::ringbuilder::min_part_hours: 1
swift_mount_check: false
swift::ringbuilder::replicas: 1

I think the first line should be "swift::ringbuilder::part_power: 10"? Also, replicas is defined twice, and the one with "3" is unused (which makes sense, because there is only a single disk on the undercloud).

On the overcloud I see this in /etc/puppet/hieradata/service_configs.yaml:

swift::ringbuilder::part_power: 10
tripleo::profile::base::swift::ringbuilder::replicas: 3

I think this should be "tripleo::profile::base::swift::ringbuilder::part_power: 10?

Using a partition power of 10 creates 2^10 = 1024 partitions. Each partition is replicated 3 times, therefore a total of 3072 partitions will be spread across all disks. It is recommended to use at least ~ 100 partitions per disk; 3072 partitions is enough for up to ~ 30 disks (even more disks is usable, but data gets less evenly distributed then). Bigger deployments should use a higher value depending on the number of disks on initial deployment as well as expected growth.

Comment 1 Paul Grist 2016-10-10 15:50:06 UTC

Setting target release, this is a must fix for OSP10

Comment 2 Pete Zaitcev 2016-10-11 19:45:41 UTC

I'm not sure if it actually would've helped Alex K. case to bump the
ring offset down to 10. It was more about concurrency and unnecessary
writes, I thought.

Comment 3 Christian Schwede (cschwede) 2016-10-12 06:07:35 UTC

Pete: actually it helped quite a bit, because replication was way faster with a lower partpower. Each partition starts it own replication pass, therefore one needs much more time for finishing a full replication run, adding quite a bit of I/O load to the already busy disks.

Comment 4 Christian Schwede (cschwede) 2016-10-14 13:43:32 UTC

Moving this to POST; proposed patches landed on master and stable/newton.

Comment 12 errata-xmlrpc 2016-12-14 16:15:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html