Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1301360

Summary: [RFE][UX] validate that the nodes are pingable
Product: Red Hat OpenStack Reporter: Udi Kalifon <ukalifon>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED WONTFIX QA Contact: Udi Kalifon <ukalifon>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0 (Liberty)CC: apannu, beth.white, emacchi, hbrock, jcoufal, jrist, jschluet, mbarnett, mburns, morazi, rhel-osp-director-maint, sclewis, ukalifon
Target Milestone: ---Keywords: FutureFeature, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: NeedsAllocation
Fixed In Version: openstack-heat-templates-0-0.5.1e6015dgit.el7ost Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-28 16:02:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442136    

Description Udi Kalifon 2016-01-24 13:15:43 UTC
Description of problem:
When deploying the overcloud, and for some reason the nodes are not pingable - the deployment hangs until it times out after 4 hours. Nodes may not be pingable if their nic-configs are wrong, or the nic order changed, or asymetric routing was not enabled, or for a million other reasons...

As soon as the deployment is at a state where the nodes *should* be pingable, and before the deployment proceeds any further and tries to connect to them or receive any call-backs from them, the director should test that the nodes can be pinged. If the ping fails the deployment should stop immediately, and print a descriptive error message so the user will know exactly what to troubleshoot.


Version-Release number of selected component (if applicable):
7.x and 8.0 beta


How reproducible:
100%

Comment 3 Mike Burns 2016-04-07 21:03:37 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 5 Udi Kalifon 2016-09-05 12:16:09 UTC
This seems to be implemented already. The code is here:
https://github.com/openstack/tripleo-heat-templates/blob/master/validation-scripts/all-nodes.sh
It is called from the templates here:
https://github.com/openstack/tripleo-heat-templates/blob/b8f154be31c5847dc376a72cf9c0835aa0001afd/overcloud.yaml#L923-L961

Ok to close the bug or is there anything else that needs to be implemented?

Comment 9 Udi Kalifon 2016-11-08 08:51:21 UTC
To test this fix, I ssh'ed to several of the nodes after the deployment finished, and ran the command 'sudo journalctl -u os-collect-config'. You can find lines like this:

Trying to ping 172.16.0.26 for local network 172.16.0.0/24.
Ping to 172.16.0.26 succeeded.
Trying to ping default gateway 10.35.163.254...Ping to 10.35.163.254 succeeded.
Trying to ping default gateway 10.35.190.254...Ping to 10.35.190.254 succeeded.
Trying to ping default gateway 172.16.0.1...Ping to 172.16.0.1 succeeded.

However, I couldn't find evidence that the nodes are pinging each other, or that the undercloud is pinging the nodes (how do I even check what the undercloud pinged?). It seems like the only pings are from the node to itself, and from the node to the undercloud. This is not the validation we wanted.

Comment 11 Jason E. Rist 2016-11-29 20:43:59 UTC
Tomas, how does Udi test this more thoroughly to make sure it's not FailedQA?

Comment 12 Tomas Sedovic 2016-12-01 13:44:34 UTC
It's not about more thorough testing. Rather, it seems that the checks that are in the Heat templates don't actually implement the RFE even though we initially thought they did.

Can we expect the nodes in general being able to ping each other? I think we should only check that the nodes can reach the controller and vice versa. I'm not aware of any need for two compute nodes talking to each other directly and with isolated networks, nodes from different roles wouldn't be able reach one another by design.

So what should this check entail? Controller pinging every node? Anything else?

Comment 13 Udi Kalifon 2016-12-01 13:50:32 UTC
The most important check are:
1) That the undercloud can ping all nodes
2) That the nodes can ping the controller and vice versa

Comment 16 Anandeep Pannu 2016-12-07 16:30:28 UTC
We should implement as Udi has noted in Comment #13. 
1) That the undercloud can ping all nodes
2) That the nodes can ping the controller and vice versa

Comment 20 Jason E. Rist 2017-05-03 14:26:42 UTC
Maros - is this still being worked on - do you need anything from DFG:UI?

Comment 25 Jason E. Rist 2018-03-13 15:58:57 UTC
Changing to DF dfg, this might be working?

Comment 26 Emilien Macchi 2018-03-14 14:07:47 UTC
We already validate that controllers and gateways are reachable, do we really need to ping all computes, etc?

See tripleo-heat-templates/validation-scripts/all-nodes.sh script.

Comment 27 Udi Kalifon 2018-03-15 07:28:21 UTC
If we're only testing the controllers, and not the computes and the other roles, then yes - we should add that test too. Otherwise we'll wait 4 hours for a timeout just to see what was wrong. Thanks.

Comment 35 Jaromir Coufal 2018-09-19 07:05:21 UTC
This was moved over to DFG:DF without reasoning. Seems like this RFE was worked on and is meant to be in Validations. Moving over to the validations team for consideration.

Comment 36 Michael Barnett 2018-09-28 16:02:46 UTC
We do not have resources to allocate to this issue. If this is still an issue please open a new bug for the release it is affecting.