Description of problem:
A 26 overcloud nodes failed with a time out, when Heat failed to get the 'done' from one of the Ceph storage nodes while setting the network configuration of the node.
The nodes network configuration is set as it should be but heat doesn't acknowledge it.
The deployment details are: 3 controller nodes, 3 compute nodes and 20 Ceph storage nodes.
This deployment has been tested twice with the same identical results.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy an overcloud with the same amount of nodes as specified above
The deployment fails due to a time out 4 hours after starting
Either the deployment fails after shorter time with a specific error about the problematic node, or the deployment should discard the node and choose another one (if handy) and continue on.
Attached are the logs from the undercloud
Created attachment 1238161 [details]
undercloud logs and /var/log/messages from the problematic node
changed priority and severity.
There are multiple problems
- why did it take 4 hours to give up on the defective node?
- is partial success an option here? It should be. Especially since it is possible to fix the problem with the missing node and add it into the cluster later, right?
In any sufficiently large cluster there will *almost always* be a missing block device, network interface, etc. and the problem should be reported at the end but should not prevent the deployment from succeeding in a reasonable amount of time. So this problem is in fact a OpenStack scalability problem, which makes it a high priority.
(In reply to Ben England from comment #5)
> changed priority and severity.
> There are multiple problems
> - why did it take 4 hours to give up on the defective node?
This is nothing specific to Ceph, 4 hours is the default timeout Ben. If you feel that this is not acceptable, suggest a suitable timeout. Larger deployments get pretty close to the 4 hour timeout, if the Timeout is changed (increased) we need to update keystone as well.
> - is partial success an option here? It should be. Especially since it is
> possible to fix the problem with the missing node and add it into the
> cluster later, right?
This is not a option today. We have had multiple discussions with the DF DFG team on how to have acceptable tolerance for failures during deployment.
> In any sufficiently large cluster there will *almost always* be a missing
> block device, network interface, etc. and the problem should be reported at
> the end but should not prevent the deployment from succeeding in a
> reasonable amount of time. So this problem is in fact a OpenStack
> scalability problem, which makes it a high priority.
The timeout should not be on the entire deployment, it should be on the next batch of nodes. If you are doing a 1000-node deployment, 4 hours is not enough, nor should it be. We really don't care if it takes 12 hours, or 24 hours, as long as the deployment is steadily making progress at a reasonable rate.
We don't care if every single host gets deployed successfully. For really large deployments it is likely that *some* host(s) will malfunction. That's just expected behavior in a very large cluster. Perhaps there could be a percentage of hosts that we expect to deploy successfully, defaulting to something like 95% (and round up, so if only 4 hosts in the cluster, you expect them all to deploy).
Dropping this bz on advice of the Perf & Scale team dealing with OpenStack.
Joe Talerico's comment: "The timeout is configurable, unless something else has changed? Regardless, we would never recommend deploying all 1000 nodes at once. We have seen where building large number of hosts is error prone."