Bug 1427326

Summary:	Support for batched deployment of the overcloud nodes
Product:	Red Hat OpenStack	Reporter:	John Fulton <johfulto>
Component:	openstack-tripleo-heat-templates	Assignee:	Steven Hardy <shardy>
Status:	CLOSED ERRATA	QA Contact:	Gurenko Alex <agurenko>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	10.0 (Newton)	CC:	aschultz, gfidente, jefbrown, jtaleric, mburns, rhel-osp-director-maint, slinaber, tvignaud, yrabl
Target Milestone:	ga	Keywords:	Triaged
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-7.0.0-0.20170616123155.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1452087 (view as bug list)		Environment:
Last Closed:	2017-12-13 21:11:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1452087, 1469584

Description John Fulton 2017-02-27 22:31:16 UTC

Deploying an overcloud with a large number nodes at once can result in a failed deployment (similar to the symptoms described in BZ 1410977 on OSP10). Rules of thumb have evolved to workaround this consisting of: 

1. Using scheduler hints to pin node placement instead of just letting tripleo's nova scheduler schedule the ironic nodes
2. Ensuring the undercloud.conf IP ranges are not too small
3. Deploying in batches 

One may deploy a larger overcloud when following the above, but waiting to come back and scale a subset of nodes at a time may not be the best deployer experience. 

As an alternative, could TripleO, be configured to do a deployment in batches so that a deployer could start a large deploy and it would batch the deployment for them? 

Could there be a new variable in TripleO called something like $batch_size which when given a deployment consisting of N total nodes stands up a smaller overcloud and then scale it slowly in batches of $batch_size so that the deployer does not need to worry about batching their deployment (but can change the batch size if they need to). 

Since the deployment is now managed by a Mistral workflow this type of batching for a long running task might seem like something that would fit Mistral well so I just wanted to track this idea in a BZ RFE and see if others vote it up.

Comment 2 John Fulton 2017-02-28 14:21:31 UTC

gfidente pointed out to me that Nova itself is batching but that only
the nova server create is batched. It's still possible though that
because this is so deep into nova that ironic may not get them
batched. We don't know if it gets requests of nodes in groups (of
10) without waiting for the first group to be up. Perhaps this
batching was intended to solve the problem of trying to PXE a large
number of nodes, but if deployers are deploying in batches themselves
then it's probably still a bug worth looking at.

Comment 3 Giulio Fidente 2017-02-28 14:23:09 UTC

It seems that nova is already doing some batching internally [1].

Heat can batch resourcegroups so we could use a batching parameter there.

1. https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L425

Comment 4 Giulio Fidente 2017-03-01 12:03:02 UTC

Note, the change proposed at 439039 is meant to set batching of the Heat resource groups and *does not* regulate other steps like introspection. Updating the BZ title accordingly.

Comment 10 Gurenko Alex 2017-12-11 14:15:32 UTC

Verified with a batch size of 2 on build 2017-12-01.4

Comment 13 errata-xmlrpc 2017-12-13 21:11:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462