Description of problem: The following failures are observed during running validation script[1] prior to OSP upgrade from 13 to 16.1 [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/framework_for_upgrades_13_to_16.1/planning-and-preparation-for-an-in-place-openstack-platform-upgrade#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade ~~~ === Running validation: "node-health" === ... Task 'Ping all overcloud nodes' failed: Host: undercloud Message: ping: compute-1: Name or service not known ... Task 'Fail if there are unreachable nodes' failed: Host: undercloud Message: The following nodes could not be reached (5 nodes): * compute-1 UUID: 5065e75a-e098-4646-a651-d6a42fcbc3e0 Instance: 5cd56e92-ecaf-49a9-a228-15b87ed12141 Last Error: Power State: power on * compute-0 ... Failure! The validation failed for all hosts: * undercloud ~~~ According to the error, it seems that validation tries to ping overcloud nodes by their host name, but that ping doesn't succeed because undercloud node doesn't have overcloud nodes in its /etc/hosts . Actually I can ping compute-1 by its ip but can't by its hostname ~~~ (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 3bccf8d3-1c26-4c06-9e16-fd328cf53eb7 | controller-0 | ACTIVE | ctlplane=192.168.24.16 | overcloud-full | controller | | 73b72635-3e66-4b50-9afc-4dbc278f4c59 | compute-1 | ACTIVE | ctlplane=192.168.24.33 | overcloud-full | compute | | c7019b40-9f45-496f-9d26-fa2cf2e2f124 | controller-1 | ACTIVE | ctlplane=192.168.24.28 | overcloud-full | controller | | 5cd56e92-ecaf-49a9-a228-15b87ed12141 | compute-0 | ACTIVE | ctlplane=192.168.24.37 | overcloud-full | compute | | 61ff5726-8acb-4484-9d47-046419f2ddf9 | controller-2 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | controller | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ (undercloud) [stack@undercloud-0 ~]$ ping -c 3 192.168.24.33 PING 192.168.24.33 (192.168.24.33) 56(84) bytes of data. 64 bytes from 192.168.24.33: icmp_seq=1 ttl=64 time=0.460 ms 64 bytes from 192.168.24.33: icmp_seq=2 ttl=64 time=0.232 ms 64 bytes from 192.168.24.33: icmp_seq=3 ttl=64 time=0.246 ms --- 192.168.24.33 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.232/0.312/0.460/0.106 ms (undercloud) [stack@undercloud-0 ~]$ ping -c 3 compute-1 ping: compute-1: Name or service not known (undercloud) [stack@undercloud-0 ~]$ ~~~ We need some consideration in tripleo-validation or documentation to avoid this false errors. Version-Release number of selected component (if applicable): RHOSP13z12 ~~~ ansible-tripleo-ipsec-8.1.1-0.20190513184007.7eb892c.el7ost.noarch openstack-tripleo-common-8.7.1-20.el7ost.noarch openstack-tripleo-common-containers-8.7.1-20.el7ost.noarch openstack-tripleo-heat-templates-8.4.1-58.1.el7ost.noarch openstack-tripleo-image-elements-8.0.3-1.el7ost.noarch openstack-tripleo-puppet-elements-8.1.1-2.el7ost.noarch openstack-tripleo-ui-8.3.2-3.el7ost.noarch openstack-tripleo-validations-8.5.0-4.el7ost.noarch puppet-tripleo-8.5.1-14.el7ost.noarch python-tripleoclient-9.3.1-7.el7ost.noarch ~~~ How reproducible: Always Steps to Reproduce: 1. Run validation script according to the documentation[1] Actual results: The validation shell reports failures in node-health validation Expected results: The validation shell reports no failures in node-health validation Additional info:
Hello Folks, I am moving it back to DFG:DF as the issue isn't related to the FFU itself. This validation is failing on a fresh (or not fresh) OSP13 environment, before any of the FFU process is triggered. The complain here is that the node-health validation is trying to check the health of the nodes by pinging at their hostname, but the undercloud in OSP13 doesn't have any information about the Overcloud node's hostnames: (undercloud) [stack@undercloud-0 ~]$ cat /etc/hosts 127.0.0.1 undercloud-0.redhat.local undercloud-0 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.35.64.93 rhos-qe-mirror-tlv.usersys.redhat.com download.lab.bos.redhat.com download.eng.bos.redhat.com download-node-02.eng.bos.redhat.com And if you try to ping, for example the compute, you will get: (undercloud) [stack@undercloud-0 ~]$ ping compute-1 ping: compute-1: Name or service not known Which seems to be what the validation is doing (iterating over the ansible groups and pinging them by hostname): - name: Check if hosts are IPs set_fact: hosts_are_ips="{{ item | ipaddr == item }}" with_items: "{{ groups.overcloud }}" - name: Ping all overcloud nodes icmp_ping: host: "{{ item }}" with_items: "{{ groups.overcloud }}" ignore_errors: true register: ping_results So, imho, it should be the validation what needs to be improved.
I can see that the OSP16.1 Undercloud has Ansible 2.9 version, so maybe it's just a fact of changing these ansible options: https://docs.ansible.com/ansible/latest/reference_appendices/interpreter_discovery.html
Hello, IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on osp-16.1 (and maybe with earlier versions, but since they are EOL...). That's probably "just" the root cause. Meaning, in short: you can't run this validation on an osp-13 undercloud, unfortunately. @Jose: you might want to update the doc mentioning it, and maybe modify the command in order to filter out this validation? Cheers, C.
(In reply to Cédric Jeanneret from comment #3) > Hello, > > IIRC osp-13 doesn't inject things in the /etc/hosts, while it does on > osp-16.1 (and maybe with earlier versions, but since they are EOL...). > That's probably "just" the root cause. > > Meaning, in short: you can't run this validation on an osp-13 undercloud, > unfortunately. > > @Jose: you might want to update the doc mentioning it, and maybe modify the > command in order to filter out this validation? > > Cheers, > > C. Well, if that's the case we then need to remove this validation from the group on RHOSP13. As in the documentation we only suggest to run pre-upgrade group: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/framework_for_upgrades_13_to_16.1/index#validating-red-hat-openstack-platform-oldvernum-before-the-upgrade How hard is to change the validation from using the hostname into using the IP? if I understand correctly we should have that IP in the environment, don't we? I am quite sure this validation, when originally made for OSP13 didn't consider the hosts to be injected into the Undercloud's /etc/hosts. This is not a validation that got recently backported, it's been there for two years already https://github.com/openstack/tripleo-validations/commit/a0c06ae7278f7446babd8c8aed92ce9c5a25fa3f#diff-d242bdac83a2b5cb825eaca5c1cde2dda1b1741fc63cb693dc7868776fb44230 If it's easier for you, we can remove it from the pre-upgrade group, but I have the feeling that this is a pretty important validation though. Cheers, José Luis
hmm, wondering how it was supposed to work, especially since it's being launched from within mistral container at that point (osp-13 doesn't have the new validation framework, everything runs as a mistral workflow). That's probably a question for Gael in the end, since this is OSP-13, he has more knowledge than me.
Used procedure from the link in Comment 1. The node-health validation passed: === Running validation: "node-health" === Success! The validation passed for all hosts: * undercloud
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0932