Deploying an overcloud with a large number nodes at once can result in a failed deployment (similar to the symptoms described in BZ 1410977 on OSP10). Rules of thumb have evolved to workaround this consisting of:
1. Using scheduler hints to pin node placement instead of just letting tripleo's nova scheduler schedule the ironic nodes
2. Ensuring the undercloud.conf IP ranges are not too small
3. Deploying in batches
One may deploy a larger overcloud when following the above, but waiting to come back and scale a subset of nodes at a time may not be the best deployer experience.
As an alternative, could TripleO, be configured to do a deployment in batches so that a deployer could start a large deploy and it would batch the deployment for them?
Could there be a new variable in TripleO called something like $batch_size which when given a deployment consisting of N total nodes stands up a smaller overcloud and then scale it slowly in batches of $batch_size so that the deployer does not need to worry about batching their deployment (but can change the batch size if they need to).
Since the deployment is now managed by a Mistral workflow this type of batching for a long running task might seem like something that would fit Mistral well so I just wanted to track this idea in a BZ RFE and see if others vote it up.
gfidente pointed out to me that Nova itself is batching but that only
the nova server create is batched. It's still possible though that
because this is so deep into nova that ironic may not get them
batched. We don't know if it gets requests of nodes in groups (of
10) without waiting for the first group to be up. Perhaps this
batching was intended to solve the problem of trying to PXE a large
number of nodes, but if deployers are deploying in batches themselves
then it's probably still a bug worth looking at.
It seems that nova is already doing some batching internally .
Heat can batch resourcegroups so we could use a batching parameter there.
Note, the change proposed at 439039 is meant to set batching of the Heat resource groups and *does not* regulate other steps like introspection. Updating the BZ title accordingly.
Verified with a batch size of 2 on build 2017-12-01.4
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.