Bug 1635423 - Ping sometimes fails to controller nodes.
Summary: Ping sometimes fails to controller nodes.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Bob Fournier
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On: 1622660
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-02 21:28 UTC by Bob Fournier
Modified: 2019-01-11 11:53 UTC (History)
7 users (show)

Fixed In Version: openstack-tripleo-heat-templates-9.0.0-0.20180919080945.0rc1.0rc1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1622660
Environment:
Last Closed: 2019-01-11 11:53:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1793598 0 None None None 2018-10-02 21:28:20 UTC
OpenStack gerrit 607173 0 None None None 2018-10-02 21:29:03 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:53:44 UTC

Description Bob Fournier 2018-10-02 21:28:20 UTC
+++ This bug was initially created as a clone of Bug #1622660 +++

Description of problem:

While deploying an Overcloud that is utilizing IPv6 addressing, we sometimes see where the pinging to the controllers fails as part of the deployment validation which then results in a failed deploy.  Through troubleshooting, we've determined that pings to the gateways do not fail and when pinging the gateway, it appears to "fix" the ability to ping to the Controller Nodes.

Within our team we've tried to figure out why this would be the case and one of our team members pointed out that IPV6 does not have ARP, it has neighbor solicitation and router solicitation, and sometimes neighbor solicitation isn't on or doesn't work, in which case you -have- to establish connectivity to the router before any other host communication works.  He said this only applies to non eui64 (mac address embedded in ipv6 address) setups.

We're not sure if that's what is happening here, however, we do know if we continually run an Ansible Playbook that connects tot he overcloud nodes and has them ping their default gateways during deployment, the deployment succeeds 100% of the time.

In looking at the all_nodes.sh validation script:

/usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh

We noticed that it is performing 2 tests, ping_controller_ips and ping_default_gateways, in that order.  We believe flipping these around will correct this issue.

This BZ is to have those 2 tests swapped so the ping of the gateways happens before pinging the controllers.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Alexander Chuzhoy 2018-10-18 14:52:34 UTC
Environment:
openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch

Looking at the file:
/usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh


    113 ping_default_gateways
    114 ping_controller_ips "$ping_test_ips"
    115 if [[ $validate_fqdn == "True" ]];then
    116   fqdn_check
    117 fi
    118 if [[ $validate_ntp == "True" ]];then
    119   ntp_check
    120 fi

Comment 5 Alexander Chuzhoy 2018-10-18 14:55:56 UTC
Verified based on comment #4.

Also successfully able to deploy OC.

Comment 6 Harsh 2018-10-18 15:06:35 UTC
This fix does not work for me 100% of the time. It only works sometimes.

Comment 7 Bob Fournier 2018-10-18 16:53:40 UTC
>This fix does not work for me 100% of the time. It only works sometimes.

Can you clarify this?
Did you make the change manually to switch the order of pings?
What kind of network setup are you using - IPv6?
Is the ping success rate different when the order of pings was not switched?
Which of the pings are failing - to the gateway or to controllers?

Comment 8 Harsh 2018-10-18 17:01:23 UTC
Yes, I manually modified the files mentioned. My setup is spine-leaf architecture with IPv6. It fails on the step where compute hosts try to ping the controller IPs.
If I run the ansible playbook that connects to the overcloud nodes and has them ping their default gateways during deployment, the deployment succeeds 100% of the time.

Comment 10 errata-xmlrpc 2019-01-11 11:53:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.