Bug 1949385 - [OSP16.2] hypervisor hostname nolonger matches nova.conf hosts value
Summary: [OSP16.2] hypervisor hostname nolonger matches nova.conf hosts value
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 16.2 (Train on RHEL 8.4)
Assignee: OSP Team
QA Contact: James Parker
URL:
Whiteboard:
: 1949469 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-14 07:32 UTC by Maxim Babushkin
Modified: 2021-09-15 07:14 UTC (History)
19 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.4.1-2.20210326005015.7befdd2.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 07:13:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 782684 0 None MERGED Set toplevel nova::dhcp_domain for all nova services 2021-04-16 10:22:23 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:14:17 UTC

Description Maxim Babushkin 2021-04-14 07:32:52 UTC
Description of problem:
Unable to add host to aggregate with the following error - compute host could not be found

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.0 Beta (Train)
Puddle: RHOS-16.2-RHEL-8-20210409.n.0


When trying to add compute host to the aggregate it fails with the "compute host could not be found error"
This happens because compute (hypervisor) name seen with different suffix by the controller.

List the hypervisor hosts:
$ openstack hypervisor list
+--------------------------------------+---------------------------------+-----------------+---------------+-------+
| ID                                   | Hypervisor Hostname             | Hypervisor Type | Host IP       | State |
+--------------------------------------+---------------------------------+-----------------+---------------+-------+
| f26f4523-64b7-4b51-8cab-6cd9de7d0410 | computeovsdpdksriov-1.novalocal | QEMU            | 10.10.130.154 | up    |
| 74113e94-0e67-45b6-937f-b0ba21ea61f4 | computeovsdpdksriov-0.novalocal | QEMU            | 10.10.130.187 | up    |
+--------------------------------------+---------------------------------+-----------------+---------------+-------+

List and show created aggregate:
$ openstack aggregate list
+----+------------------------------+-------------------+
| ID | Name                         | Availability Zone |
+----+------------------------------+-------------------+
| 14 | tempest-aggregate-1965427792 | None              |
+----+------------------------------+-------------------+
$ openstack aggregate show 14
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| availability_zone | None                                 |
| created_at        | 2021-04-14T06:17:35.000000           |
| deleted           | False                                |
| deleted_at        | None                                 |
| hosts             |                                      |
| id                | 14                                   |
| name              | tempest-aggregate-1965427792         |
| properties        |                                      |
| updated_at        | None                                 |
| uuid              | c008a013-6a12-489a-8ec8-8efdc340c5c0 |
+-------------------+--------------------------------------+

Try to add the host to aggregate:
$ openstack aggregate add host 14 computeovsdpdksriov-1.novalocal
Compute host computeovsdpdksriov-1.novalocal could not be found. (HTTP 404) (Request-ID: req-452f1917-fede-4c0e-8023-17b3f26adb98)

Look into the nova-scheduler log of controller:
2021-04-14 06:24:40.240 15 DEBUG nova.scheduler.host_manager [req-2c3782a0-00e4-4acd-a267-83f629be473e - - - - -] Successfully synced instances from host 'computeovsdpdksriov-1.localdomain'. sync_instance_info /usr/lib/python3.6/site-packages/nova/scheduler/host_manager.py:960


The host seen by the controller with suffix "computeovsdpdksriov-1.localdomain" and not "computeovsdpdksriov-1.novalocal" as seen in the hypervisor list output.


Try to add the host with the "localdomain" suffix:
$ openstack aggregate add host 14 computeovsdpdksriov-1.localdomain
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| availability_zone | None                                 |
| created_at        | 2021-04-14T06:17:35.000000           |
| deleted           | False                                |
| deleted_at        | None                                 |
| hosts             | computeovsdpdksriov-1.localdomain    |
| id                | 14                                   |
| name              | tempest-aggregate-1965427792         |
| properties        |                                      |
| updated_at        | None                                 |
| uuid              | c008a013-6a12-489a-8ec8-8efdc340c5c0 |
+-------------------+--------------------------------------+

The host successfully added.
The output suffix in "openstack hypervisor list" command is not correct.

Comment 1 Maxim Babushkin 2021-04-14 07:34:24 UTC
Sosreports available on the following link: http://file.mad.redhat.com/~mbabushk/sosreports/bz1949385/

Comment 2 smooney 2021-04-14 11:36:05 UTC
you are trying to use the hypervior hostname to add a host to an aggregate that is incorrect.
each node has 2 different value the hypervior hostname which the hostname as reported by libvirt/glibc and the hosts value
which is set in the nova.conf. which is set to host=computeovsdpdksriov-0.localdomain

this is not a nova bug its a ooo bug combinded with fact that you were trying to use they hypervior host name instead of the compute service host. 

(kolla-venv) [sean@workstation kolla-work-dir]$ openstack aggregate create test
+-------------------+----------------------------+
| Field             | Value                      |
+-------------------+----------------------------+
| availability_zone | None                       |
| created_at        | 2021-04-14T11:31:24.952624 |
| deleted           | False                      |
| deleted_at        | None                       |
| hosts             | None                       |
| id                | 1                          |
| name              | test                       |
| properties        | None                       |
| updated_at        | None                       |
+-------------------+----------------------------+
(kolla-venv) [sean@workstation kolla-work-dir]$ openstack aggregate show test
+-------------------+----------------------------+
| Field             | Value                      |
+-------------------+----------------------------+
| availability_zone | None                       |
| created_at        | 2021-04-14T11:31:24.000000 |
| deleted           | False                      |
| deleted_at        | None                       |
| hosts             |                            |
| id                | 1                          |
| name              | test                       |
| properties        |                            |
| updated_at        | None                       |
+-------------------+----------------------------+

[sean@workstation kolla-work-dir]$ openstack compute service list --service nova-compute
+----+--------------+--------------------+------+---------+-------+----------------------------+
| ID | Binary       | Host               | Zone | Status  | State | Updated At                 |
+----+--------------+--------------------+------+---------+-------+----------------------------+
|  4 | nova-compute | workstation        | nova | enabled | up    | 2021-04-14T11:23:24.000000 |
|  5 | nova-compute | workstation-ironic | nova | enabled | up    | 2021-04-14T11:23:24.000000 |
+----+--------------+--------------------+------+---------+-------+----------------------------+

openstack hypervisor list --long
+----+--------------------------------------+-----------------+-------------+-------+------------+-------+----------------+-----------+
| ID | Hypervisor Hostname                  | Hypervisor Type | Host IP     | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB |
+----+--------------------------------------+-----------------+-------------+-------+------------+-------+----------------+-----------+
|  1 | workstation                          | QEMU            | 192.168.3.1 | up    |        162 |    48 |         202240 |    257886 |
|  2 | 31303735-3035-4247-3830-333132534457 | ironic          | 192.168.3.1 | up    |         16 |     0 |          49152 |         0 |
|  3 | 31303735-3934-4247-3830-333132535336 | ironic          | 192.168.3.1 | up    |         16 |     0 |          49152 |         0 |
|  4 | 31303735-3035-4247-3830-323630455630 | ironic          | 192.168.3.1 | up    |          0 |     0 |              0 |         0 |
|  5 | 31303735-3934-4247-3830-323630455930 | ironic          | 192.168.3.1 | up    |         16 |     0 |          49152 |         0 |
+----+--------------------------------------+-----------------+-------------+-------+------------+-------+----------------+-----------+

this should fail
[sean@workstation kolla-work-dir]$ openstack aggregate add host test 31303735-3035-4247-3830-333132534457
Compute host 31303735-3035-4247-3830-333132534457 could not be found. (HTTP 404) (Request-ID: req-d68bfe69-f26d-4511-a8c4-faa17d959da7)

and it does but using the host value form the compute service record works

 openstack aggregate add host test workstation-ironic
+-------------------+----------------------------+
| Field             | Value                      |
+-------------------+----------------------------+
| availability_zone | None                       |
| created_at        | 2021-04-14T11:31:24.000000 |
| deleted           | False                      |
| deleted_at        | None                       |
| hosts             | workstation-ironic         |
| id                | 1                          |
| name              | test                       |
| properties        |                            |
| updated_at        | None                       |
+-------------------+----------------------------+

this is how the aggregates api was intended to work

Comment 3 Maxim Babushkin 2021-04-14 11:41:07 UTC
Hi Sean,

We are using the same tht to deploy an environment on 16.1 and 16.2.
We never used the "CloudDomain" parameter in our tht.

So why in that case I have different output in 16.1 and 16.2?

In 16.1:
$ openstack hypervisor list
+--------------------------------------+-----------------------------------+-----------------+---------------+-------+
| ID                                   | Hypervisor Hostname               | Hypervisor Type | Host IP       | State |
+--------------------------------------+-----------------------------------+-----------------+---------------+-------+
| 3c4ebf0e-3e2a-486b-b119-2b8924769c17 | computeovsdpdksriov-1.localdomain | QEMU            | 10.10.100.111 | up    |
| 216357a2-d499-4aaf-80b0-37cbb3eb4df0 | computeovsdpdksriov-0.localdomain | QEMU            | 10.10.100.150 | up    |
+--------------------------------------+-----------------------------------+-----------------+---------------+-------+

In 16.2:
$ openstack hypervisor list
+--------------------------------------+---------------------------------+-----------------+---------------+-------+
| ID                                   | Hypervisor Hostname             | Hypervisor Type | Host IP       | State |
+--------------------------------------+---------------------------------+-----------------+---------------+-------+
| f26f4523-64b7-4b51-8cab-6cd9de7d0410 | computeovsdpdksriov-1.novalocal | QEMU            | 10.10.130.154 | up    |
| 74113e94-0e67-45b6-937f-b0ba21ea61f4 | computeovsdpdksriov-0.novalocal | QEMU            | 10.10.130.187 | up    |
+--------------------------------------+---------------------------------+-----------------+---------------+-------+

Comment 4 smooney 2021-04-14 11:47:45 UTC
*** Bug 1949469 has been marked as a duplicate of this bug. ***

Comment 5 smooney 2021-04-14 11:56:03 UTC
there can be a number of reasons but the short answer is the is "Hypervisor Hostname"  is the value that libvirt retruns to nova for the hostname of the current host.
in 16.1 ooo was configuring the host such that the "hypervior hostname" and the value configured in nova.conf were the same.

i suspect that something has change in how /etc/hostname and /etc/hosts is now being configured resulting in effectivly a hostname change.
this is a release blocker as it will break deployment on upgrade so setting the correct flags. i have  close the ohter bug reported for the OVN job as a duplciate of this
so ill remove the nfv dfg form the devel dashboard since its not dpdk specific.

Comment 6 smooney 2021-04-14 12:15:55 UTC
this could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1900500

Comment 7 smooney 2021-04-14 12:23:20 UTC
/etc/hosts looks correct

# BEGIN ANSIBLE MANAGED BLOCK
10.10.130.187 computeovsdpdksriov-0.localdomain computeovsdpdksriov-0
10.10.130.187 computeovsdpdksriov-0.internalapi.localdomain computeovsdpdksriov-0.internalapi
10.10.131.102 computeovsdpdksriov-0.tenant.localdomain computeovsdpdksriov-0.tenant
10.10.132.143 computeovsdpdksriov-0.storage.localdomain computeovsdpdksriov-0.storage
192.0.90.22 computeovsdpdksriov-0.ctlplane.localdomain computeovsdpdksriov-0.ctlplane
10.10.130.154 computeovsdpdksriov-1.localdomain computeovsdpdksriov-1
10.10.130.154 computeovsdpdksriov-1.internalapi.localdomain computeovsdpdksriov-1.internalapi
10.10.131.175 computeovsdpdksriov-1.tenant.localdomain computeovsdpdksriov-1.tenant
10.10.132.121 computeovsdpdksriov-1.storage.localdomain computeovsdpdksriov-1.storage
192.0.90.21 computeovsdpdksriov-1.ctlplane.localdomain computeovsdpdksriov-1.ctlplane
10.10.130.184 controller-0.localdomain controller-0
10.10.130.184 controller-0.internalapi.localdomain controller-0.internalapi
10.10.131.127 controller-0.tenant.localdomain controller-0.tenant
10.10.132.114 controller-0.storage.localdomain controller-0.storage
10.10.133.194 controller-0.storagemgmt.localdomain controller-0.storagemgmt
10.35.185.75 controller-0.external.localdomain controller-0.external
192.0.90.19 controller-0.ctlplane.localdomain controller-0.ctlplane
10.10.130.194 controller-1.localdomain controller-1
10.10.130.194 controller-1.internalapi.localdomain controller-1.internalapi
10.10.131.184 controller-1.tenant.localdomain controller-1.tenant
10.10.132.138 controller-1.storage.localdomain controller-1.storage
10.10.133.123 controller-1.storagemgmt.localdomain controller-1.storagemgmt
10.35.185.76 controller-1.external.localdomain controller-1.external
192.0.90.24 controller-1.ctlplane.localdomain controller-1.ctlplane
10.10.130.162 controller-2.localdomain controller-2
10.10.130.162 controller-2.internalapi.localdomain controller-2.internalapi
10.10.131.188 controller-2.tenant.localdomain controller-2.tenant
10.10.132.148 controller-2.storage.localdomain controller-2.storage
10.10.133.140 controller-2.storagemgmt.localdomain controller-2.storagemgmt
10.35.185.67 controller-2.external.localdomain controller-2.external
192.0.90.9 controller-2.ctlplane.localdomain controller-2.ctlplane

192.0.90.1 undercloud-0.ctlplane.localdomain undercloud-0.ctlplane
192.0.90.12  overcloud.ctlplane.localdomain
10.10.130.175  overcloud.internalapi.localdomain
10.10.132.157  overcloud.storage.localdomain
10.10.133.153  overcloud.storagemgmt.localdomain
10.35.185.74  overcloud.localdomain
# END ANSIBLE MANAGED BLOCK
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

but /etc/hostname is wrong

computeovsdpdksriov-0.novalocal

something is causing ooo to generate the hostname incorrectly in 16.2

Comment 8 Martin Schuppert 2021-04-14 12:56:33 UTC
sosreport-undercloud-0-2021-04-14-wrfqjon]$ grep tripleo-heat installed-rpms 
openstack-tripleo-heat-templates-11.4.1-2.20210323012110.c3396e2.el8ost.1.noarch Tue Apr 13 13:23:07 2021

sosreport-undercloud-0-2021-04-14-wrfqjon]$ grep dhcp_domain var/lib/config-data/puppet-generated/nova//etc/nova/nova.conf
#dhcp_domain=novalocal

openstack-tripleo-heat-templates-11.4.1-2.20210323012110.c3396e2.el8ost.1.noarch misses [1] to unset the default novalocal dhcp_domain in nova.conf on the undercloud.

It is part of openstack-tripleo-heat-templates-11.4.1-2.20210326005015.7befdd2.el8ost

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/782684

Comment 9 smooney 2021-04-14 13:16:46 UTC
as martin pointed out this is already fixed but was not included in the compose so setting it to modifed and adding triaged keyword.

Comment 22 errata-xmlrpc 2021-09-15 07:13:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.