Bug 1427326 - Support for batched deployment of the overcloud nodes
Summary: Support for batched deployment of the overcloud nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ga
: 12.0 (Pike)
Assignee: Steven Hardy
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks: 1452087 1469584
TreeView+ depends on / blocked
 
Reported: 2017-02-27 22:31 UTC by John Fulton
Modified: 2018-02-05 19:04 UTC (History)
9 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.0-0.20170616123155.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1452087 (view as bug list)
Environment:
Last Closed: 2017-12-13 21:11:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1688550 0 None None None 2017-05-05 13:02:34 UTC
OpenStack gerrit 446927 0 None None None 2017-05-18 10:28:49 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description John Fulton 2017-02-27 22:31:16 UTC
Deploying an overcloud with a large number nodes at once can result in a failed deployment (similar to the symptoms described in BZ 1410977 on OSP10). Rules of thumb have evolved to workaround this consisting of: 

1. Using scheduler hints to pin node placement instead of just letting tripleo's nova scheduler schedule the ironic nodes
2. Ensuring the undercloud.conf IP ranges are not too small
3. Deploying in batches 

One may deploy a larger overcloud when following the above, but waiting to come back and scale a subset of nodes at a time may not be the best deployer experience. 

As an alternative, could TripleO, be configured to do a deployment in batches so that a deployer could start a large deploy and it would batch the deployment for them? 

Could there be a new variable in TripleO called something like $batch_size which when given a deployment consisting of N total nodes stands up a smaller overcloud and then scale it slowly in batches of $batch_size so that the deployer does not need to worry about batching their deployment (but can change the batch size if they need to). 

Since the deployment is now managed by a Mistral workflow this type of batching for a long running task might seem like something that would fit Mistral well so I just wanted to track this idea in a BZ RFE and see if others vote it up.

Comment 2 John Fulton 2017-02-28 14:21:31 UTC
gfidente pointed out to me that Nova itself is batching but that only
the nova server create is batched. It's still possible though that
because this is so deep into nova that ironic may not get them
batched. We don't know if it gets requests of nodes in groups (of
10) without waiting for the first group to be up. Perhaps this
batching was intended to solve the problem of trying to PXE a large
number of nodes, but if deployers are deploying in batches themselves
then it's probably still a bug worth looking at.

Comment 3 Giulio Fidente 2017-02-28 14:23:09 UTC
It seems that nova is already doing some batching internally [1].

Heat can batch resourcegroups so we could use a batching parameter there.

1. https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L425

Comment 4 Giulio Fidente 2017-03-01 12:03:02 UTC
Note, the change proposed at 439039 is meant to set batching of the Heat resource groups and *does not* regulate other steps like introspection. Updating the BZ title accordingly.

Comment 10 Gurenko Alex 2017-12-11 14:15:32 UTC
Verified with a batch size of 2 on build 2017-12-01.4

Comment 13 errata-xmlrpc 2017-12-13 21:11:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.