Bug 2087231 - ping_test_gateway_ips contain empty string in multi-cells environment, this cause node validation failure during deployment when gateway ping tests execute
Summary: ping_test_gateway_ips contain empty string in multi-cells environment, this c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 16.2 (Train on RHEL 8.4)
Assignee: Harald Jensås
QA Contact: James Parker
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-17 16:00 UTC by James Parker
Modified: 2022-12-07 19:23 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220821010130.b1e9bfe.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-07 19:22:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1973866 0 None None None 2022-05-18 07:40:08 UTC
OpenStack gerrit 842274 0 None MERGED Filter empty string in PingTestGatewayIPsMap yagl 2022-06-09 07:09:11 UTC
Red Hat Issue Tracker OSP-15285 0 None None None 2022-05-17 18:44:55 UTC
Red Hat Product Errata RHBA-2022:8794 0 None None None 2022-12-07 19:23:16 UTC

Description James Parker 2022-05-17 16:00:04 UTC
Description of problem: ansible_facts.default_ipv4.gateway appears to not be set when scaling up a multi-cells v2 TLS-e environment resulting the in the ping command to fail when deploying:

2022-05-17 13:34:43.567462 | 525400f9-7a8a-d346-8ffd-000000000b9e |     TIMING | tripleo_nodes_validation : Check Default IPv4 Gateway availability | cell1-compute-1 | 0:01:35.711596 | 0.35s
2022-05-17 13:34:43.627570 | 525400f9-7a8a-d346-8ffd-000000000b9f |       TASK | Check all networks Gateway availability
2022-05-17 13:34:43.787038 | 525400f9-7a8a-d346-8ffd-000000000b9f |      FATAL | Check all networks Gateway availability | cell1-compute-0 | error={"ansible_loop_var": "gateway_ip", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1"], "delta": "0:00:00.004407", "end": "2022-05-17 09:34:43.657259", "gateway_ip": "", "msg": "non-zero return code", "rc": 2, "start": "2022-05-17 09:34:43.652852", "stderr": "Usage: ping [-aAbBdDfhLnOqrRUvV64] [-c count] [-i interval] [-I interface]\n            [-m mark] [-M pmtudisc_option] [-l preload] [-p pattern] [-Q tos]\n            [-s packetsize] [-S sndbuf] [-t ttl] [-T timestamp_option]\n            [-w deadline] [-W timeout] [hop1 ...] destination\nUsage: ping -6 [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]\n             [-l preload] [-m mark] [-M pmtudisc_option]\n             [-N nodeinfo_option] [-p pattern] [-Q tclass] [-s packetsize]\n             [-S sndbuf] [-t ttl] [-T timestamp_option] [-w deadline]\n             [-W timeout] destination", "stderr_lines": ["Usage: ping [-aAbBdDfhLnOqrRUvV64] [-c count] [-i interval] [-I interface]", "            [-m mark] [-M pmtudisc_option] [-l preload] [-p pattern] [-Q tos]", "            [-s packetsize] [-S sndbuf] [-t ttl] [-T timestamp_option]", "            [-w deadline] [-W timeout] [hop1 ...] destination", "Usage: ping -6 [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]", "             [-l preload] [-m mark] [-M pmtudisc_option]", "             [-N nodeinfo_option] [-p pattern] [-Q tclass] [-s packetsize]", "             [-S sndbuf] [-t ttl] [-T timestamp_option] [-w deadline]", "             [-W timeout] destination"], "stdout": "", "stdout_lines": []}


Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20220513.n.2

How reproducible:
Only have tried it once in phase3 CI

Steps to Reproduce:
1. Deploy multi-cells v2 environment with TLS-e with above puddle.
2.
3.

Actual results:
Deployment fails due to ping cmd failing due to not having a target ip address

Expected results:
Connectivity check is successful

Additional info:
Build: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-compute-nova-16.2_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/64 
Failure: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-16.2_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/64/undercloud-0/home/stack/overcloud_cell_deployment.log.gz

Comment 1 Harald Jensås 2022-05-17 18:39:29 UTC
For the overcloud deployment:

/var/lib/mistral/overcloud/global_vars.yaml contains:
ping_test_gateway_ips:
  BlockStorage: []
  CephStorage: []
  Compute: []
  Controller:
  - 10.0.0.1
  ObjectStorage: []

For the Cell deployment:

/var/lib/mistral/cell1/global_vars.yaml contains:

ping_test_gateway_ips:
  CellController:
  - ''
  - ''
  - ''
  - ''
  - 10.0.0.1
  Compute:
  - ''
  - ''
  - ''

Instead of empty lists, we end up with empty string values.

The empty string is passed as argument to ping ... which cause the ping command to raise an error because there is a missing argument i.e no address to ping is given.

Comment 2 Harald Jensås 2022-05-18 07:44:37 UTC
I proposed a fix: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274

A workaround is to set the ansible var `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the `ExtraAnsibleHostVars` THT paramter.
This will disable the "Check all networks Gateway availability" task[1].


[1] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47

Comment 11 Martin Schuppert 2022-05-24 07:05:01 UTC
(In reply to Harald Jensås from comment #2)
> I proposed a fix:
> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274
> 
> A workaround is to set the ansible var
> `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the
> `ExtraAnsibleHostVars` THT paramter.
> This will disable the "Check all networks Gateway availability" task[1].
> 
> 
> [1]
> https://opendev.org/openstack/tripleo-ansible/src/branch/master/
> tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47

can't we just set ValidateGatewaysIcmp: false [1]?

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/train/common/deploy-steps.j2#L127

Comment 12 Harald Jensås 2022-05-24 08:17:11 UTC
(In reply to Martin Schuppert from comment #11)
> (In reply to Harald Jensås from comment #2)
> > I proposed a fix:
> > https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274
> > 
> > A workaround is to set the ansible var
> > `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the
> > `ExtraAnsibleHostVars` THT paramter.
> > This will disable the "Check all networks Gateway availability" task[1].
> > 
> > 
> > [1]
> > https://opendev.org/openstack/tripleo-ansible/src/branch/master/
> > tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47
> 
> can't we just set ValidateGatewaysIcmp: false [1]?
> 
> [1]
> https://github.com/openstack/tripleo-heat-templates/blob/stable/train/common/
> deploy-steps.j2#L127

Oh, yes! Indeed, that would be the easier way.

Thanks Martin!


@jparker , as Martin points out the better workaround is to set THT parameter `ValidateGatewaysIcmp: false`.

Comment 34 errata-xmlrpc 2022-12-07 19:22:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794


Note You need to log in before you can comment on or make changes to this bug.