Bug 1872517 - node-health validation fails because of missing hosts entries in undercloud
Summary: node-health validation fails because of missing hosts entries in undercloud
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Gaël Chamoulaud
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-26 00:35 UTC by Takashi Kajinami
Modified: 2024-06-13 22:59 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-validations-8.5.0-7.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 13:08:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-1700 0 None None None 2023-12-15 19:04:18 UTC
Red Hat Product Errata RHBA-2021:0932 0 None None None 2021-03-18 13:10:06 UTC

Description Takashi Kajinami 2020-08-26 00:35:35 UTC
Description of problem:

The following failures are observed during running validation script[1] prior to OSP upgrade from 13 to 16.1
 [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/framework_for_upgrades_13_to_16.1/planning-and-preparation-for-an-in-place-openstack-platform-upgrade#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade

~~~
=== Running validation: "node-health" ===

...

Task 'Ping all overcloud nodes' failed:
Host: undercloud
Message: ping: compute-1: Name or service not known

...

Task 'Fail if there are unreachable nodes' failed:
Host: undercloud
Message: The following nodes could not be reached (5 nodes):

* compute-1
    UUID: 5065e75a-e098-4646-a651-d6a42fcbc3e0
    Instance: 5cd56e92-ecaf-49a9-a228-15b87ed12141
    Last Error: 
    Power State: power on
* compute-0
    ...

Failure! The validation failed for all hosts:
* undercloud
~~~

According to the error, it seems that validation tries to ping overcloud nodes by their host name,
but that ping doesn't succeed because undercloud node doesn't have overcloud nodes in its /etc/hosts .

Actually I can ping compute-1 by its ip but can't by its hostname
~~~
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 3bccf8d3-1c26-4c06-9e16-fd328cf53eb7 | controller-0 | ACTIVE | ctlplane=192.168.24.16 | overcloud-full | controller |
| 73b72635-3e66-4b50-9afc-4dbc278f4c59 | compute-1    | ACTIVE | ctlplane=192.168.24.33 | overcloud-full | compute    |
| c7019b40-9f45-496f-9d26-fa2cf2e2f124 | controller-1 | ACTIVE | ctlplane=192.168.24.28 | overcloud-full | controller |
| 5cd56e92-ecaf-49a9-a228-15b87ed12141 | compute-0    | ACTIVE | ctlplane=192.168.24.37 | overcloud-full | compute    |
| 61ff5726-8acb-4484-9d47-046419f2ddf9 | controller-2 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | controller |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ ping -c 3 192.168.24.33
PING 192.168.24.33 (192.168.24.33) 56(84) bytes of data.
64 bytes from 192.168.24.33: icmp_seq=1 ttl=64 time=0.460 ms
64 bytes from 192.168.24.33: icmp_seq=2 ttl=64 time=0.232 ms
64 bytes from 192.168.24.33: icmp_seq=3 ttl=64 time=0.246 ms

--- 192.168.24.33 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.232/0.312/0.460/0.106 ms
(undercloud) [stack@undercloud-0 ~]$ ping -c 3 compute-1
ping: compute-1: Name or service not known
(undercloud) [stack@undercloud-0 ~]$ 
~~~

We need some consideration in tripleo-validation or documentation to avoid this false errors.

Version-Release number of selected component (if applicable):
RHOSP13z12
~~~
ansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch
openstack-tripleo-common-8.7.1-20.el7ost.noarch
openstack-tripleo-common-containers-8.7.1-20.el7ost.noarch
openstack-tripleo-heat-templates-8.4.1-58.1.el7ost.noarch
openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.1.1-2.el7ost.noarch
openstack-tripleo-ui-8.3.2-3.el7ost.noarch
openstack-tripleo-validations-8.5.0-4.el7ost.noarch
puppet-tripleo-8.5.1-14.el7ost.noarch
python-tripleoclient-9.3.1-7.el7ost.noarch
~~~

How reproducible:
Always

Steps to Reproduce:
1. Run validation script according to the documentation[1]

Actual results:
The validation shell reports failures in node-health validation

Expected results:
The validation shell reports no failures in node-health validation

Additional info:

Comment 1 Jose Luis Franco 2020-08-31 12:50:56 UTC
Hello Folks,

I am moving it back to DFG:DF as the issue isn't related to the FFU itself. This validation is failing on a fresh (or not fresh) OSP13 environment, before any of the FFU process is triggered.

The complain here is that the node-health validation is trying to check the health of the nodes by pinging at their hostname, but the undercloud in OSP13 doesn't have any information about the Overcloud node's hostnames:

(undercloud) [stack@undercloud-0 ~]$ cat /etc/hosts
127.0.0.1   undercloud-0.redhat.local undercloud-0
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

10.35.64.93  rhos-qe-mirror-tlv.usersys.redhat.com download.lab.bos.redhat.com download.eng.bos.redhat.com download-node-02.eng.bos.redhat.com

And if you try to ping, for example the compute, you will get:

(undercloud) [stack@undercloud-0 ~]$ ping compute-1
ping: compute-1: Name or service not known

Which seems to be what the validation is doing (iterating over the ansible groups and pinging them by hostname):

  - name: Check if hosts are IPs
    set_fact: hosts_are_ips="{{ item | ipaddr == item }}"
    with_items: "{{ groups.overcloud }}"
  - name: Ping all overcloud nodes
    icmp_ping:
        host: "{{ item }}"
    with_items: "{{ groups.overcloud }}"
    ignore_errors: true
    register: ping_results

So, imho, it should be the validation what needs to be improved.

Comment 2 Jose Luis Franco 2020-09-01 10:33:51 UTC
I can see that the OSP16.1 Undercloud has Ansible 2.9 version, so maybe it's just a fact of changing these ansible options: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html

Comment 3 Cédric Jeanneret 2020-10-20 08:36:59 UTC
Hello,

IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on osp-16.1 (and maybe with earlier versions, but since they are EOL...).
That's probably "just" the root cause.

Meaning, in short: you can't run this validation on an osp-13 undercloud, unfortunately.

@Jose: you might want to update the doc mentioning it, and maybe modify the command in order to filter out this validation?

Cheers,

C.

Comment 4 Jose Luis Franco 2020-10-20 13:25:59 UTC
(In reply to Cédric Jeanneret from comment #3)
> Hello,
> 
> IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on
> osp-16.1 (and maybe with earlier versions, but since they are EOL...).
> That's probably "just" the root cause.
> 
> Meaning, in short: you can't run this validation on an osp-13 undercloud,
> unfortunately.
> 
> @Jose: you might want to update the doc mentioning it, and maybe modify the
> command in order to filter out this validation?
> 
> Cheers,
> 
> C.

Well, if that's the case we then need to remove this validation from the group on RHOSP13. As in the documentation we only suggest to run pre-upgrade group:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade

How hard is to change the validation from using the hostname into using the IP? if I understand correctly we should have that IP in the environment, don't we?
I am quite sure this validation, when originally made for OSP13 didn't consider the hosts to be injected into the Undercloud's /etc/hosts. This is not a validation that got recently backported, it's been there for two years already https://github.com/openstack/tripleo-validations/commit/a0c06ae7278f7446babd8c8aed92ce9c5a25fa3f#diff-d242bdac83a2b5cb825eaca5c1cde2dda1b1741fc63cb693dc7868776fb44230

If it's easier for you, we can remove it from the pre-upgrade group, but I have the feeling that this is a pretty important validation though.

Cheers,
José Luis

Comment 5 Cédric Jeanneret 2020-10-27 06:06:42 UTC
hmm, wondering how it was supposed to work, especially since it's being launched from within mistral container at that point (osp-13 doesn't have the new validation framework, everything runs as a mistral workflow).

That's probably a question for Gael in the end, since this is OSP-13, he has more knowledge than me.

Comment 14 David Rosenfeld 2021-02-19 21:12:39 UTC
Used procedure from the link in Comment 1. The node-health validation passed:

=== Running validation: "node-health" ===

Success! The validation passed for all hosts:
* undercloud

Comment 18 errata-xmlrpc 2021-03-18 13:08:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0932


Note You need to log in before you can comment on or make changes to this bug.