Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2087231

Summary: ping_test_gateway_ips contain empty string in multi-cells environment, this cause node validation failure during deployment when gateway ping tests execute
Product: Red Hat OpenStack Reporter: James Parker <jparker>
Component: openstack-tripleo-heat-templatesAssignee: Harald Jensås <hjensas>
Status: CLOSED ERRATA QA Contact: James Parker <jparker>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: bdobreli, hjensas, igallagh, mburns, mkrcmari, mschuppe, owalsh, ramishra
Target Milestone: z4Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220821010130.b1e9bfe.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-07 19:22:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Parker 2022-05-17 16:00:04 UTC
Description of problem: ansible_facts.default_ipv4.gateway appears to not be set when scaling up a multi-cells v2 TLS-e environment resulting the in the ping command to fail when deploying:

2022-05-17 13:34:43.567462 | 525400f9-7a8a-d346-8ffd-000000000b9e |     TIMING | tripleo_nodes_validation : Check Default IPv4 Gateway availability | cell1-compute-1 | 0:01:35.711596 | 0.35s
2022-05-17 13:34:43.627570 | 525400f9-7a8a-d346-8ffd-000000000b9f |       TASK | Check all networks Gateway availability
2022-05-17 13:34:43.787038 | 525400f9-7a8a-d346-8ffd-000000000b9f |      FATAL | Check all networks Gateway availability | cell1-compute-0 | error={"ansible_loop_var": "gateway_ip", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1"], "delta": "0:00:00.004407", "end": "2022-05-17 09:34:43.657259", "gateway_ip": "", "msg": "non-zero return code", "rc": 2, "start": "2022-05-17 09:34:43.652852", "stderr": "Usage: ping [-aAbBdDfhLnOqrRUvV64] [-c count] [-i interval] [-I interface]\n            [-m mark] [-M pmtudisc_option] [-l preload] [-p pattern] [-Q tos]\n            [-s packetsize] [-S sndbuf] [-t ttl] [-T timestamp_option]\n            [-w deadline] [-W timeout] [hop1 ...] destination\nUsage: ping -6 [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]\n             [-l preload] [-m mark] [-M pmtudisc_option]\n             [-N nodeinfo_option] [-p pattern] [-Q tclass] [-s packetsize]\n             [-S sndbuf] [-t ttl] [-T timestamp_option] [-w deadline]\n             [-W timeout] destination", "stderr_lines": ["Usage: ping [-aAbBdDfhLnOqrRUvV64] [-c count] [-i interval] [-I interface]", "            [-m mark] [-M pmtudisc_option] [-l preload] [-p pattern] [-Q tos]", "            [-s packetsize] [-S sndbuf] [-t ttl] [-T timestamp_option]", "            [-w deadline] [-W timeout] [hop1 ...] destination", "Usage: ping -6 [-aAbBdDfhLnOqrRUvV] [-c count] [-i interval] [-I interface]", "             [-l preload] [-m mark] [-M pmtudisc_option]", "             [-N nodeinfo_option] [-p pattern] [-Q tclass] [-s packetsize]", "             [-S sndbuf] [-t ttl] [-T timestamp_option] [-w deadline]", "             [-W timeout] destination"], "stdout": "", "stdout_lines": []}


Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20220513.n.2

How reproducible:
Only have tried it once in phase3 CI

Steps to Reproduce:
1. Deploy multi-cells v2 environment with TLS-e with above puddle.
2.
3.

Actual results:
Deployment fails due to ping cmd failing due to not having a target ip address

Expected results:
Connectivity check is successful

Additional info:
Build: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-compute-nova-16.2_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/64 
Failure: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-compute-nova-16.2_director-rhel-virthost-1cont_1comp_1cellcont_2cellcomp_1ipa-ipv4-geneve-multi-cell-tls-everywhere-phase3/64/undercloud-0/home/stack/overcloud_cell_deployment.log.gz

Comment 1 Harald Jensås 2022-05-17 18:39:29 UTC
For the overcloud deployment:

/var/lib/mistral/overcloud/global_vars.yaml contains:
ping_test_gateway_ips:
  BlockStorage: []
  CephStorage: []
  Compute: []
  Controller:
  - 10.0.0.1
  ObjectStorage: []

For the Cell deployment:

/var/lib/mistral/cell1/global_vars.yaml contains:

ping_test_gateway_ips:
  CellController:
  - ''
  - ''
  - ''
  - ''
  - 10.0.0.1
  Compute:
  - ''
  - ''
  - ''

Instead of empty lists, we end up with empty string values.

The empty string is passed as argument to ping ... which cause the ping command to raise an error because there is a missing argument i.e no address to ping is given.

Comment 2 Harald Jensås 2022-05-18 07:44:37 UTC
I proposed a fix: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274

A workaround is to set the ansible var `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the `ExtraAnsibleHostVars` THT paramter.
This will disable the "Check all networks Gateway availability" task[1].


[1] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47

Comment 11 Martin Schuppert 2022-05-24 07:05:01 UTC
(In reply to Harald Jensås from comment #2)
> I proposed a fix:
> https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274
> 
> A workaround is to set the ansible var
> `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the
> `ExtraAnsibleHostVars` THT paramter.
> This will disable the "Check all networks Gateway availability" task[1].
> 
> 
> [1]
> https://opendev.org/openstack/tripleo-ansible/src/branch/master/
> tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47

can't we just set ValidateGatewaysIcmp: false [1]?

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/train/common/deploy-steps.j2#L127

Comment 12 Harald Jensås 2022-05-24 08:17:11 UTC
(In reply to Martin Schuppert from comment #11)
> (In reply to Harald Jensås from comment #2)
> > I proposed a fix:
> > https://review.opendev.org/c/openstack/tripleo-heat-templates/+/842274
> > 
> > A workaround is to set the ansible var
> > `tripleo_nodes_validation_validate_gateway_icmp` to `false` using the
> > `ExtraAnsibleHostVars` THT paramter.
> > This will disable the "Check all networks Gateway availability" task[1].
> > 
> > 
> > [1]
> > https://opendev.org/openstack/tripleo-ansible/src/branch/master/
> > tripleo_ansible/roles/tripleo_nodes_validation/tasks/main.yml#L47
> 
> can't we just set ValidateGatewaysIcmp: false [1]?
> 
> [1]
> https://github.com/openstack/tripleo-heat-templates/blob/stable/train/common/
> deploy-steps.j2#L127

Oh, yes! Indeed, that would be the easier way.

Thanks Martin!


@jparker , as Martin points out the better workaround is to set THT parameter `ValidateGatewaysIcmp: false`.

Comment 34 errata-xmlrpc 2022-12-07 19:22:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794