Bug 1288153

Summary: [Docs] [Director] Add Tuning Instructions in Troubleshooting Section
Product: Red Hat OpenStack Reporter: Dan Yocum <dyocum>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED CURRENTRELEASE QA Contact: Radek Bíba <rbiba>
Severity: medium Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: adahms, dyocum, rbiba, yeylon
Target Milestone: gaKeywords: Documentation
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-13 04:42:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Yocum 2015-12-03 17:04:39 UTC
Description of problem:

The Director node spec'd in the guide is much too small.  Even with 2vCPU and 12GB of RAM, I am unable to successfully deploy a 20 node overcloud

Version-Release number of selected component (if applicable):

7

Solution:

slagle suggests a minimum of 8 cores and 16GB of RAM on a dedicated baremetal system for a 65 node overcloud.

For a production, >100 node overcloud, 16 cores and 32GB of RAM is recommended.

Comment 2 Andrew Dahms 2015-12-07 22:57:33 UTC
Assigning to Dan for review.

Comment 3 Dan Yocum 2015-12-08 02:20:08 UTC
I have a feeling that the undercloud can be smaller *if* the following BZs are addressed before y2:

https://bugzilla.redhat.com/show_bug.cgi?id=1289287
https://bugzilla.redhat.com/show_bug.cgi?id=1212126

Comment 5 Dan Yocum 2015-12-08 14:57:34 UTC
Slightly OT, but still document related: I think there should be a tuning section added to the Troubleshooting portion of the guide.  Off the top of my head, these things should be included:

1) A crontab to purge the keystone.token table so that it doesn't grow without bound.  This may need to be done more than once per day - possibly every hour:

3 1 * * * /bin/keystone-manage token_flush

2) A crontab to purge the heat.raw_templates table so that it doesn't grow without bound.  This may need to be done more than once per week - possibly every day - and may need to cleared more than every 30 days:

3 2 * * 6 /bin/heat-manage purge_deleted -g days 30

3) If heat-engine and heat-api consume too many resources (i.e., they peg the CPU repeatedly and for long periods of time) set max_resources_per_stack=-1 in /etc/heat/heat.conf

Comment 6 Dan Macpherson 2015-12-08 16:32:57 UTC
OT, yes, but I don't think another BZ is necessary for the moment.

I think it's a great idea so I'll add these things in another section in the guide tomorrow.

Comment 7 Dan Yocum 2015-12-08 19:26:17 UTC
4) If deployment fails and the system load is very high, reduce the number of concurrent instances builds to something less than the default of 10.  Edit /etc/nova/nova.conf and set:

max_concurrent_builds=3

And restart nova-api and nova-scheduler services.

5) Tune the mariadb to increase max_connections (if not already 4096) and several innodb parameters.  Edit /etc/my.cnf.d/server.cnf:


innodb_additional_mem_pool_size = 20M
innodb_buffer_pool_size = 1000M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_max_purge_lag = 10000
innodb_thread_concurrency = <2*(NumCPUs+NumDisks)>

NB: Ensure that the director has enough RAM, typically 512MB to 1GB more than the size of the innodb_buffer_pool_size.

Comment 8 Dan Yocum 2015-12-28 16:13:11 UTC
From this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1290949, comment 15, do this:

1. edit the file /etc/heat/heat.conf on the undercloud and uncomment the line:
#num_engine_workers = 4

2. restart openstack-heat-engine

Comment 9 Dan Yocum 2015-12-28 16:14:17 UTC
Also, by far the cronjob to clean out the keystone.token table has had the biggest positive influence on tuning the undercloud.

Comment 10 Dan Macpherson 2016-01-22 03:46:53 UTC
Resetting to ASSIGNED due to other docs changes.

Comment 18 Dan Macpherson 2016-03-17 02:51:26 UTC
Thanks, Radek.

Have implemented a commit with your suggested changes:

https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/commit/8525a8c46347cf773f6f29992a2052bb88967b9a

Comment 19 Radek Bíba 2016-03-17 06:38:04 UTC
Great, moving to VERIFIED.

Comment 20 Andrew Dahms 2016-04-13 04:42:31 UTC
This content is now live on the Customer Portal.

Closing.