Bug 1301360 - [RFE][UX] validate that the nodes are pingable
Summary: [RFE][UX] validate that the nodes are pingable
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Udi Kalifon
URL:
Whiteboard: NeedsAllocation
Depends On:
Blocks: 1442136
TreeView+ depends on / blocked
 
Reported: 2016-01-24 13:15 UTC by Udi Kalifon
Modified: 2018-09-28 16:02 UTC (History)
13 users (show)

Fixed In Version: openstack-heat-templates-0-0.5.1e6015dgit.el7ost
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-28 16:02:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Udi Kalifon 2016-01-24 13:15:43 UTC
Description of problem:
When deploying the overcloud, and for some reason the nodes are not pingable - the deployment hangs until it times out after 4 hours. Nodes may not be pingable if their nic-configs are wrong, or the nic order changed, or asymetric routing was not enabled, or for a million other reasons...

As soon as the deployment is at a state where the nodes *should* be pingable, and before the deployment proceeds any further and tries to connect to them or receive any call-backs from them, the director should test that the nodes can be pinged. If the ping fails the deployment should stop immediately, and print a descriptive error message so the user will know exactly what to troubleshoot.


Version-Release number of selected component (if applicable):
7.x and 8.0 beta


How reproducible:
100%

Comment 3 Mike Burns 2016-04-07 21:03:37 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 5 Udi Kalifon 2016-09-05 12:16:09 UTC
This seems to be implemented already. The code is here:
https://github.com/openstack/tripleo-heat-templates/blob/master/validation-scripts/all-nodes.sh
It is called from the templates here:
https://github.com/openstack/tripleo-heat-templates/blob/b8f154be31c5847dc376a72cf9c0835aa0001afd/overcloud.yaml#L923-L961

Ok to close the bug or is there anything else that needs to be implemented?

Comment 9 Udi Kalifon 2016-11-08 08:51:21 UTC
To test this fix, I ssh'ed to several of the nodes after the deployment finished, and ran the command 'sudo journalctl -u os-collect-config'. You can find lines like this:

Trying to ping 172.16.0.26 for local network 172.16.0.0/24.
Ping to 172.16.0.26 succeeded.
Trying to ping default gateway 10.35.163.254...Ping to 10.35.163.254 succeeded.
Trying to ping default gateway 10.35.190.254...Ping to 10.35.190.254 succeeded.
Trying to ping default gateway 172.16.0.1...Ping to 172.16.0.1 succeeded.

However, I couldn't find evidence that the nodes are pinging each other, or that the undercloud is pinging the nodes (how do I even check what the undercloud pinged?). It seems like the only pings are from the node to itself, and from the node to the undercloud. This is not the validation we wanted.

Comment 11 Jason E. Rist 2016-11-29 20:43:59 UTC
Tomas, how does Udi test this more thoroughly to make sure it's not FailedQA?

Comment 12 Tomas Sedovic 2016-12-01 13:44:34 UTC
It's not about more thorough testing. Rather, it seems that the checks that are in the Heat templates don't actually implement the RFE even though we initially thought they did.

Can we expect the nodes in general being able to ping each other? I think we should only check that the nodes can reach the controller and vice versa. I'm not aware of any need for two compute nodes talking to each other directly and with isolated networks, nodes from different roles wouldn't be able reach one another by design.

So what should this check entail? Controller pinging every node? Anything else?

Comment 13 Udi Kalifon 2016-12-01 13:50:32 UTC
The most important check are:
1) That the undercloud can ping all nodes
2) That the nodes can ping the controller and vice versa

Comment 16 Anandeep Pannu 2016-12-07 16:30:28 UTC
We should implement as Udi has noted in Comment #13. 
1) That the undercloud can ping all nodes
2) That the nodes can ping the controller and vice versa

Comment 20 Jason E. Rist 2017-05-03 14:26:42 UTC
Maros - is this still being worked on - do you need anything from DFG:UI?

Comment 25 Jason E. Rist 2018-03-13 15:58:57 UTC
Changing to DF dfg, this might be working?

Comment 26 Emilien Macchi 2018-03-14 14:07:47 UTC
We already validate that controllers and gateways are reachable, do we really need to ping all computes, etc?

See tripleo-heat-templates/validation-scripts/all-nodes.sh script.

Comment 27 Udi Kalifon 2018-03-15 07:28:21 UTC
If we're only testing the controllers, and not the computes and the other roles, then yes - we should add that test too. Otherwise we'll wait 4 hours for a timeout just to see what was wrong. Thanks.

Comment 35 Jaromir Coufal 2018-09-19 07:05:21 UTC
This was moved over to DFG:DF without reasoning. Seems like this RFE was worked on and is meant to be in Validations. Moving over to the validations team for consideration.

Comment 36 Michael Barnett 2018-09-28 16:02:46 UTC
We do not have resources to allocate to this issue. If this is still an issue please open a new bug for the release it is affecting.


Note You need to log in before you can comment on or make changes to this bug.