Description of problem:
In the process of scaling out an OverCloud using OSP-d, e.g. adding a compute node, there is a period when the OverCloud become unavailable. This is far from ideal as it affects the uptime of the OpenStack. Handling the services in such a way that one node always would be available to server user requests would be much preferred.
Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch
How reproducible:
Every time
Steps to Reproduce:
1. Add an compute node
2. Listen to your users asking why OpenStack is down
Additional info:
In some cases I've seen nova-compute timing out the connection to RabbitMQ, causing the compute node to go off-line.
Actually, on closer look, I'm un-duplicating this bug. Bz 1339559 is regarding not restarting services when scaling out.
This bug is related, but covers a broader topic, what I would like to see is that when services are restarted, that the restart is orchestrated in a rolling way, such that in a rolling setup, there always is a running control node. With other words, even when a service restart is required, end-users will not experience a total outage