Description of problem: When deploying the overcloud, and for some reason the nodes are not pingable - the deployment hangs until it times out after 4 hours. Nodes may not be pingable if their nic-configs are wrong, or the nic order changed, or asymetric routing was not enabled, or for a million other reasons... As soon as the deployment is at a state where the nodes *should* be pingable, and before the deployment proceeds any further and tries to connect to them or receive any call-backs from them, the director should test that the nodes can be pinged. If the ping fails the deployment should stop immediately, and print a descriptive error message so the user will know exactly what to troubleshoot. Version-Release number of selected component (if applicable): 7.x and 8.0 beta How reproducible: 100%
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
This seems to be implemented already. The code is here: https://github.com/openstack/tripleo-heat-templates/blob/master/validation-scripts/all-nodes.sh It is called from the templates here: https://github.com/openstack/tripleo-heat-templates/blob/b8f154be31c5847dc376a72cf9c0835aa0001afd/overcloud.yaml#L923-L961 Ok to close the bug or is there anything else that needs to be implemented?
To test this fix, I ssh'ed to several of the nodes after the deployment finished, and ran the command 'sudo journalctl -u os-collect-config'. You can find lines like this: Trying to ping 172.16.0.26 for local network 172.16.0.0/24. Ping to 172.16.0.26 succeeded. Trying to ping default gateway 10.35.163.254...Ping to 10.35.163.254 succeeded. Trying to ping default gateway 10.35.190.254...Ping to 10.35.190.254 succeeded. Trying to ping default gateway 172.16.0.1...Ping to 172.16.0.1 succeeded. However, I couldn't find evidence that the nodes are pinging each other, or that the undercloud is pinging the nodes (how do I even check what the undercloud pinged?). It seems like the only pings are from the node to itself, and from the node to the undercloud. This is not the validation we wanted.
Tomas, how does Udi test this more thoroughly to make sure it's not FailedQA?
It's not about more thorough testing. Rather, it seems that the checks that are in the Heat templates don't actually implement the RFE even though we initially thought they did. Can we expect the nodes in general being able to ping each other? I think we should only check that the nodes can reach the controller and vice versa. I'm not aware of any need for two compute nodes talking to each other directly and with isolated networks, nodes from different roles wouldn't be able reach one another by design. So what should this check entail? Controller pinging every node? Anything else?
The most important check are: 1) That the undercloud can ping all nodes 2) That the nodes can ping the controller and vice versa
We should implement as Udi has noted in Comment #13. 1) That the undercloud can ping all nodes 2) That the nodes can ping the controller and vice versa
Maros - is this still being worked on - do you need anything from DFG:UI?
Changing to DF dfg, this might be working?
We already validate that controllers and gateways are reachable, do we really need to ping all computes, etc? See tripleo-heat-templates/validation-scripts/all-nodes.sh script.
If we're only testing the controllers, and not the computes and the other roles, then yes - we should add that test too. Otherwise we'll wait 4 hours for a timeout just to see what was wrong. Thanks.
This was moved over to DFG:DF without reasoning. Seems like this RFE was worked on and is meant to be in Validations. Moving over to the validations team for consideration.
We do not have resources to allocate to this issue. If this is still an issue please open a new bug for the release it is affecting.