1872517 – node-health validation fails because of missing hosts entries in undercloud

Bug 1872517 - node-health validation fails because of missing hosts entries in undercloud

Summary: node-health validation fails because of missing hosts entries in undercloud

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-validations
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gaël Chamoulaud
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-26 00:35 UTC by Takashi Kajinami
Modified:	2024-06-13 22:59 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-validations-8.5.0-7.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-18 13:08:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-1700	0	None	None	None	2023-12-15 19:04:18 UTC
Red Hat Product Errata	RHBA-2021:0932	0	None	None	None	2021-03-18 13:10:06 UTC

Description Takashi Kajinami 2020-08-26 00:35:35 UTC

Description of problem:

The following failures are observed during running validation script[1] prior to OSP upgrade from 13 to 16.1
 [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/framework_for_upgrades_13_to_16.1/planning-and-preparation-for-an-in-place-openstack-platform-upgrade#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade

~~~
=== Running validation: "node-health" ===

...

Task 'Ping all overcloud nodes' failed:
Host: undercloud
Message: ping: compute-1: Name or service not known

...

Task 'Fail if there are unreachable nodes' failed:
Host: undercloud
Message: The following nodes could not be reached (5 nodes):

* compute-1
    UUID: 5065e75a-e098-4646-a651-d6a42fcbc3e0
    Instance: 5cd56e92-ecaf-49a9-a228-15b87ed12141
    Last Error: 
    Power State: power on
* compute-0
    ...

Failure! The validation failed for all hosts:
* undercloud
~~~

According to the error, it seems that validation tries to ping overcloud nodes by their host name,
but that ping doesn't succeed because undercloud node doesn't have overcloud nodes in its /etc/hosts .

Actually I can ping compute-1 by its ip but can't by its hostname
~~~
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 3bccf8d3-1c26-4c06-9e16-fd328cf53eb7 | controller-0 | ACTIVE | ctlplane=192.168.24.16 | overcloud-full | controller |
| 73b72635-3e66-4b50-9afc-4dbc278f4c59 | compute-1    | ACTIVE | ctlplane=192.168.24.33 | overcloud-full | compute    |
| c7019b40-9f45-496f-9d26-fa2cf2e2f124 | controller-1 | ACTIVE | ctlplane=192.168.24.28 | overcloud-full | controller |
| 5cd56e92-ecaf-49a9-a228-15b87ed12141 | compute-0    | ACTIVE | ctlplane=192.168.24.37 | overcloud-full | compute    |
| 61ff5726-8acb-4484-9d47-046419f2ddf9 | controller-2 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | controller |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
(undercloud) [stack@undercloud-0 ~]$ ping -c 3 192.168.24.33
PING 192.168.24.33 (192.168.24.33) 56(84) bytes of data.
64 bytes from 192.168.24.33: icmp_seq=1 ttl=64 time=0.460 ms
64 bytes from 192.168.24.33: icmp_seq=2 ttl=64 time=0.232 ms
64 bytes from 192.168.24.33: icmp_seq=3 ttl=64 time=0.246 ms

--- 192.168.24.33 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.232/0.312/0.460/0.106 ms
(undercloud) [stack@undercloud-0 ~]$ ping -c 3 compute-1
ping: compute-1: Name or service not known
(undercloud) [stack@undercloud-0 ~]$ 
~~~

We need some consideration in tripleo-validation or documentation to avoid this false errors.

Version-Release number of selected component (if applicable):
RHOSP13z12
~~~
ansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch
openstack-tripleo-common-8.7.1-20.el7ost.noarch
openstack-tripleo-common-containers-8.7.1-20.el7ost.noarch
openstack-tripleo-heat-templates-8.4.1-58.1.el7ost.noarch
openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.1.1-2.el7ost.noarch
openstack-tripleo-ui-8.3.2-3.el7ost.noarch
openstack-tripleo-validations-8.5.0-4.el7ost.noarch
puppet-tripleo-8.5.1-14.el7ost.noarch
python-tripleoclient-9.3.1-7.el7ost.noarch
~~~

How reproducible:
Always

Steps to Reproduce:
1. Run validation script according to the documentation[1]

Actual results:
The validation shell reports failures in node-health validation

Expected results:
The validation shell reports no failures in node-health validation

Additional info:

Comment 1 Jose Luis Franco 2020-08-31 12:50:56 UTC

Hello Folks,

I am moving it back to DFG:DF as the issue isn't related to the FFU itself. This validation is failing on a fresh (or not fresh) OSP13 environment, before any of the FFU process is triggered.

The complain here is that the node-health validation is trying to check the health of the nodes by pinging at their hostname, but the undercloud in OSP13 doesn't have any information about the Overcloud node's hostnames:

(undercloud) [stack@undercloud-0 ~]$ cat /etc/hosts
127.0.0.1   undercloud-0.redhat.local undercloud-0
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

10.35.64.93  rhos-qe-mirror-tlv.usersys.redhat.com download.lab.bos.redhat.com download.eng.bos.redhat.com download-node-02.eng.bos.redhat.com

And if you try to ping, for example the compute, you will get:

(undercloud) [stack@undercloud-0 ~]$ ping compute-1
ping: compute-1: Name or service not known

Which seems to be what the validation is doing (iterating over the ansible groups and pinging them by hostname):

  - name: Check if hosts are IPs
    set_fact: hosts_are_ips="{{ item | ipaddr == item }}"
    with_items: "{{ groups.overcloud }}"
  - name: Ping all overcloud nodes
    icmp_ping:
        host: "{{ item }}"
    with_items: "{{ groups.overcloud }}"
    ignore_errors: true
    register: ping_results

So, imho, it should be the validation what needs to be improved.

Comment 2 Jose Luis Franco 2020-09-01 10:33:51 UTC

I can see that the OSP16.1 Undercloud has Ansible 2.9 version, so maybe it's just a fact of changing these ansible options: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html

Comment 3 Cédric Jeanneret 2020-10-20 08:36:59 UTC

Hello,

IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on osp-16.1 (and maybe with earlier versions, but since they are EOL...).
That's probably "just" the root cause.

Meaning, in short: you can't run this validation on an osp-13 undercloud, unfortunately.

@Jose: you might want to update the doc mentioning it, and maybe modify the command in order to filter out this validation?

Cheers,

C.

Comment 4 Jose Luis Franco 2020-10-20 13:25:59 UTC

(In reply to Cédric Jeanneret from comment #3)
> Hello,
> 
> IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on
> osp-16.1 (and maybe with earlier versions, but since they are EOL...).
> That's probably "just" the root cause.
> 
> Meaning, in short: you can't run this validation on an osp-13 undercloud,
> unfortunately.
> 
> @Jose: you might want to update the doc mentioning it, and maybe modify the
> command in order to filter out this validation?
> 
> Cheers,
> 
> C.

Well, if that's the case we then need to remove this validation from the group on RHOSP13. As in the documentation we only suggest to run pre-upgrade group:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade

How hard is to change the validation from using the hostname into using the IP? if I understand correctly we should have that IP in the environment, don't we?
I am quite sure this validation, when originally made for OSP13 didn't consider the hosts to be injected into the Undercloud's /etc/hosts. This is not a validation that got recently backported, it's been there for two years already https://github.com/openstack/tripleo-validations/commit/a0c06ae7278f7446babd8c8aed92ce9c5a25fa3f#diff-d242bdac83a2b5cb825eaca5c1cde2dda1b1741fc63cb693dc7868776fb44230

If it's easier for you, we can remove it from the pre-upgrade group, but I have the feeling that this is a pretty important validation though.

Cheers,
José Luis

Comment 5 Cédric Jeanneret 2020-10-27 06:06:42 UTC

hmm, wondering how it was supposed to work, especially since it's being launched from within mistral container at that point (osp-13 doesn't have the new validation framework, everything runs as a mistral workflow).

That's probably a question for Gael in the end, since this is OSP-13, he has more knowledge than me.

Comment 14 David Rosenfeld 2021-02-19 21:12:39 UTC

Used procedure from the link in Comment 1. The node-health validation passed:

=== Running validation: "node-health" ===

Success! The validation passed for all hosts:
* undercloud

Comment 18 errata-xmlrpc 2021-03-18 13:08:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0932

Note You need to log in before you can comment on or make changes to this bug.