Bug 1900500

Summary:	Hostname in neutron agent's config don't match what is stored in placement by nova
Product:	Red Hat OpenStack	Reporter:	Slawek Kaplonski <skaplons>
Component:	openstack-neutron	Assignee:	Takashi Kajinami <tkajinam>
Status:	CLOSED ERRATA	QA Contact:	Alex Katz <akatz>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	bdobreli, ccamposr, chrisw, gregraka, hakhande, igallagh, oblaut, ralonsoh, scohen, smooney, tkajinam, vkhitrin
Target Milestone:	z7	Keywords:	Regression, Reopened, Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-15.2.1-1.20210623113310.el8ost	Doc Type:	Enhancement
Doc Text:	The logic to detect the hypervisor hostname has been fixed and now returns the result consistent with `libvirt` driver in the Compute service (nova). With this fix, you no longer need to specify the `resource_provider_hypervisors` option when you use the guaranteed minimum bandwidth QoS feature. + With this update, a new option, `resource_provider_default_hypervisor`, has been added to the Modular Layer 2 with the Open Virtual Network mechanism driver (ML2/OVN) to replace the default hypervisor name. The option locates the root resource provider without giving a complete list of interfaces or bridges in the `resource_provider_hypervisors` option in case it has to be customized by the user. This new option is located in the `[ovs]` ini-section for the `ovs-agent`, and in the `[sriov_nic]` ini-section for the `sriov-agent`.	Story Points:	---
Clone Of:
Clones:	1989820 (view as bug list)		Environment:
Last Closed:	2021-12-09 20:17:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1669584, 1885514, 1989820, 1991746, 2018121

Description Slawek Kaplonski 2020-11-23 08:37:30 UTC

This was originally reported by Takashi Kajinami in the comment to https://bugzilla.redhat.com/show_bug.cgi?id=1788974#c20

A customer tried minimum bandwidth qos and found some issues with the current configurations described in our networking guide[1]. IIUC the configurations implemented in this bz is based on the above doc, and I'm afraid we hit the same issue with the current implementation.
 [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/networking_guide/sec-qos#guaranteed-min-bw


What we've observed so far is that neutron fails to report the resource to placement.
We confirmed that resource_synced field in "openstack network agent show <uuid>" does not become True after setting the parameters described.

While investigating the issue, we identified that configuration which appears in "network agent show" has
 'resource_provider_hypervisors': {'br-tenant': 'compute-0'}
instead of
 'resource_provider_hypervisors': {'br-tenant': 'compute-0.redhat.local'}
and neutron fails to identify the resource provider for that hypervisor because the resource provider record has FQDN(which comes from the host parameter in nova.conf on compute nodes).

It seems that the following patch, which was backported to stable/train, switched the way to obtain the default hypervisor hostname from CONF.host to socket.gethostname(), but I'm afraid this made short name used instead of FQDN(*1) in tripleo deployment, which resulted in the inconsistencies between neutron and placement.
 https://opendev.org/openstack/neutron/commit/9a6766470ef127ee5495a5b74b7156bd5a80f03c

(*1) In director deployment we expect short name returned by gethostname(). The method might be able to return FQDN in TLS-e deployment but I've not yet confirmed.

One possible workaround of the issue would be to set resource_provider_hypervisors to override hypervisor names by FQDNs, and the customer confirmed that adding the following parameter solves the issue with resource_synced.
~~~
resource_provider_hypervisors=br-tenant:compute-0.redhat.local
~~~

However, IMO this is not really ideal solution because it requires redundant configuration which has bridge/device name described multiple times.
I think it's better to fix neutron to depend on FQDN(or provide the way to use CONF.host instead of gethostname if possible.

I hope the above information helps you with this topic, and I'd appreciate your thoughts about the issue reported.

Comment 1 Takashi Kajinami 2020-11-23 08:53:11 UTC

FYI. I proposed one patch[1] to neutron, which introduces a new option to neutron, so that it allows us to override hyperviros hostname.
 [1] https://review.opendev.org/c/openstack/neutron/+/763563

I submitted this as a new feature, but it might be better to report a bug and resubmit this as a bug fix instead
so that we can backport the change to stable branches...

Comment 3 smooney 2020-11-24 13:15:29 UTC

the name of the resouce provider does not come form the host paramter in the nova.conf
it is the hyperviors_hostname which comes form the virt driver in this case it comes form libvirt which internally calls gethostname in libc.

it should be the same value retruned form the hostname command.

gethostname is the correct thing to call you need to investiate why libvirt and pythons gethostname are nolonger retruning the same value.

Comment 4 Takashi Kajinami 2020-11-24 13:30:47 UTC

Thanks Sean. I confirmed that we use hypervisro_hostname which is obtained by getHostname method,
so it doesn't depend on the host parameter in nova.conf . Sorry that I missed that part.

I checked the current behavior in my RHOSP16.1.2 deployment and confirmed that
libvirt detects FQDN while hostname returns short name.
AFAIK we have had the same behavior since RHOSP13, and I'm afraid there is something
wrong with our assumption about the way how libvirt obtain hostname.

~~~
[heat-admin@compute-0 ~]$ sudo podman exec -it nova_libvirt virsh hostname
compute-0.redhat.local

[heat-admin@compute-0 ~]$ sudo podman exec -it nova_libvirt hostname
compute-0
[heat-admin@compute-0 ~]$ hostname
compute-0
[heat-admin@compute-0 ~]$ sudo podman ps | grep libvirt
79081b32e9be  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.1_20201020.1                kolla_start  4 days ago  Up 11 hours ago         nova_libvirt
56551b6cee23  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:16.1_20201020.1                kolla_start  4 days ago  Up 4 days ago           nova_virtlogd
[heat-admin@compute-0 ~]$ 
~~~

Comment 5 Takashi Kajinami 2020-11-24 13:49:52 UTC

If I understand the current implementation of libvirt corrently, libvirt uses gethostname to obtain hostname
but it doesn't directly pass the value returned by gethostname, but it tries to convert it to FQDN by getaddr
if the gethostname returns short name, non FQDN
 https://github.com/libvirt/libvirt/blob/b67080b3451fced61fa92f2e445d325a4286fa5f/src/util/virutil.c#L469-L488

I checked the simpler implementation in python, and confirmed that it returns FQDN.

>>> import socket
>>> hostname = socket.gethostname()
>>> hostname
'compute-0'
>>> socket.getaddrinfo(hostname, None, flags=socket.AI_CANONNAME)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, 'compute-0.redhat.local', ('172.17.1.148', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('172.17.1.148', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('172.17.1.148', 0))]

Comment 6 Takashi Kajinami 2020-11-24 13:55:13 UTC

one correction.

s/but it tries to convert it to FQDN by getaddr/but it tries to convert it to FQDN by **getaddrinfo**/

Comment 7 smooney 2020-11-24 14:20:08 UTC

the libvirt behaviour or python behavior has changed

if i do

 virsh hostname i get numa-1


ubuntu@numa-1:/opt/repos/nova$ virsh hostname 
numa-1


which is alos what i get from python 
ubuntu@numa-1:/opt/repos/nova$ python -c "import socket; print(socket.gethostname())"
numa-1

if i do hostname -A i my real fqdns 
get ubuntu@numa-1:/opt/repos/nova$ hostname -A
numa-1.cloud.seanmooney.info numa-1.cloud.seanmooney.info numa-1 numa-1 numa-1.cloud.seanmooney.info numa-1                                                                                               

but if i do hostname -f i get just numa-1

this is contol by the the content of /etc/hosts

if i change 

127.0.0.1 numa-1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

104.130.246.32 review.opendev.org

to 

127.0.0.1 localhost numa-1

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

104.130.246.32 review.opendev.org

then 
hostname -f will return localhost.

looking at the customer /etc/hosts they have the fqdn listed before the host name in there /etc/hosts

so that is likely the issue here
they should have

<ip> <hostname> <fqdn>

if they have them in that order then python libvirt should both agree

has this change in ooo recently.
i know ooo template that file so that is likely the cause of the issue

Comment 8 smooney 2020-11-24 14:20:55 UTC

by the way on a ooo deployment we shoudl never hit the getaddrinfo part since
the hostmae are always set in /etc/hosts

Comment 9 Takashi Kajinami 2020-11-24 14:39:51 UTC

Ah, ok. I tried the same in my deployment and confirmed that updating /etc/hosts makes
shortname instead of FQDN appear for virsh hostname.

I checked RHOSP10, 13, and 16.1 deployment but in all deployments we have FQDN before short name,
so I believe that we have kept the same order at least for recent versions.
~~~
[heat-admin@compute-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 10.0.10 (Newton)
[heat-admin@compute-0 ~]$ grep -r compute-0 /etc/hosts
172.17.1.16 compute-0.redhat.local compute-0
192.168.24.9 compute-0.external.redhat.local compute-0.external
172.17.1.16 compute-0.internalapi.redhat.local compute-0.internalapi
172.17.3.23 compute-0.storage.redhat.local compute-0.storage
192.168.24.9 compute-0.storagemgmt.redhat.local compute-0.storagemgmt
172.17.2.13 compute-0.tenant.redhat.local compute-0.tenant
192.168.24.9 compute-0.management.redhat.local compute-0.management
192.168.24.9 compute-0.ctlplane.redhat.local compute-0.ctlplane
~~~

Comment 10 Takashi Kajinami 2020-11-25 10:41:41 UTC

So let me summarize the current status, after having discussion with Sean.

 - The problem is that virsh hostname returns different values
   from the one obtained by socket.gethostname().
   Currently "virsh hostname" returnes FQDN (thus resource providers
   use FQDN for their name) while gethostname() returns short name.

 - The current implementation in neutron (and cyborg) to get
   the default hypverisor name is based on the assmption that
   these two agree.

 - We confirmed that the value returned by "virsh hostname"
   is affected by the record in /etc/hosts.
   ~~~
   172.17.1.16 compute-0.redhat.local compute-0
   ~~~
   If I update /etc/hosts to have short name BEFORE FQDN like
   ~~~
   172.17.1.16 compute-0 compute-0.redhat.local
   ~~~
   then gethostname returns shorname instead of FQDN.

 - In RHOSP10, 13, 16.1 we have FQDN before short name
   in all records of /etc/hosts.

 - In RHOSP10 we use FQDN for OS hostname(which is written in /etc/hostname)
   but in RHSOP13 and 16.1 we use shortname.

I've reviewed change log in libvirt but couldn't find the change which seems to
have changed the behavior about "virsh hostname".
So it is not yet confirmed whether the original assumption (which assupmes that
socket.gethostname and "virsh hostname" agree) was correct or not in the beginning.

Comment 11 smooney 2020-11-25 15:34:15 UTC

it was
this assumption that they agree has been true across multiple distros and installers for years.
i would also not characterise it as an assumption but as a requirement.

if i install openstack with devstack or kolla on ubunut or centos it stil holds true today.

when ooo started using FQDNs soemthing which form a compute dfg we did not wnat by the way it started to change the behavior
of this which has broken upgrades in the past partically with customer going form osp 7 to 10

openstack upstream exepcted that the hypervior hostnames would be actully hostnames not fqdns
we allowed fqdns to be supported but we intended to hostnames to be the primary mechanisum.

the reason for this is nova even if you use fqdn does not support two compute hosts to have the same hostname but different FQDNs
you can technically get it to work in some case but if you do you are technically running in an unsupported configruation.

so the short hostname for all hosts must be unique using the fqdn implies that is not required which is why the compute dfg did not want this change.

in osp 10 when ooo wrote the FQDN into /etc/hostname it would cause the virsh hostname and socket.gethostname() to be the full fqdn so they matched.
setting the fqdn in /etc/hostname is technically an invalid use of that file, it works but that should only be set to the shortname,
https://www.freedesktop.org/software/systemd/man/hostname.html
this is the important point 
"The hostname may be a free-form string up to 64 characters in length; however, it is recommended that it consists only of 7-bit ASCII lower-case characters and no spaces or dots, and limits itself to the format allowed for DNS domain name labels, even though this is not a strict requirement."
because '.' dots are not ment to be used it technically legal to set the fqdn but it does not follow best partices and it may break some applications that expect it to be just the host name.

in 13/16 since the fqdn is not set in the /etc/hostname it uses /etc/hosts + the dhcp dns search domains to determine the fqdn.
/etc/hosts has precidence and if the host name is present on a line in /etc/host the first value is use for the hostname.


'hostname' and pythons socket.gethostname() should always be the same

sean@p50:~$ python -c "import socket; print(socket.gethostname());" && hostname
p50
p50

this is a requirement that predates the creation fo ooo and its an invariant we have to maintain or else we will break upgrades

if the output of 'hostname' and 'virsh hostname' differ then your system is misconfigured.

Comment 12 Takashi Kajinami 2020-11-26 12:18:32 UTC

Sorry but I might have had some wrong description.

> 'hostname' and pythons socket.gethostname() should always be the same
These two agrees. What doesn't agree with these two is the "virsh hostname"
~~~
[heat-admin@compute-0 ~]$ hostname
compute-0
[heat-admin@compute-0 ~]$ python -c "import socket;print(socket.gethostname())"
compute-0
[heat-admin@compute-0 ~]$ sudo virsh hostname
compute-0.redhat.local

[heat-admin@compute-0 ~]$ grep compute-0 /etc/hosts
172.17.1.134 compute-0.redhat.local compute-0
172.17.3.129 compute-0.storage.redhat.local compute-0.storage
172.17.1.134 compute-0.internalapi.redhat.local compute-0.internalapi
172.17.2.87 compute-0.tenant.redhat.local compute-0.tenant
192.168.24.24 compute-0.ctlplane.redhat.local compute-0.ctlplane
[heat-admin@compute-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.12 (Queens)
~~~

We need to update the content of /etc/hosts like
~~~
172.17.1.134 compute-0 compute-0.redhat.local
~~~
to make "virsh hostname" return short name

This might be unexpected behavior from perspective of nova/neutron, but I believe we need some solution in codes.
As I mentioned during our previous discussions TripleO has been using that "wrong" hosts configurations,
and it's not very easy to fix it now to the expected status and we need cvery areful migration of resource provider
records from FQDN to short name.

I might have missed something but IIUC the format of /etc/hosts has not been very clearly enforced,
and I expect there can be some more deployments with the similar configuration, which makes inconsistencies
between "virsh hostname" and "socket.gethostname()"
These deployments are actually affected by the usage of socket.gethostname() to obtain the default hypervisor name.

Comment 19 Takashi Kajinami 2021-04-30 06:39:29 UTC

I have submitted a potential fix to neutron.
 https://review.opendev.org/c/openstack/neutron/+/788893

This looks a bit hacky implementation but is supposed to have the same logic as libvirt
unless I read libvirt implementation incorrectly...

Comment 20 Takashi Kajinami 2021-04-30 06:41:43 UTC

copy-pasting what I posted on the launchpad bug.
 https://bugs.launchpad.net/neutron/+bug/1926693

As far as I've tested in RHOSP16.1 deployment the new logic returns the FQDN
which is compatible with libvirt, while socket.gethostname() currently used
returns a short name.

~~~
[heat-admin@compute-0 ~]$ python
Python 3.6.8 (default, Dec 5 2019, 15:45:45)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostname()
'compute-0'
>>> exit()
[heat-admin@compute-0 ~]$ sudo podman exec -it nova_libvirt virsh hostname
compute-0.redhat.local
~~~

~~~
[heat-admin@compute-0 ~]$ python
Python 3.6.8 (default, Dec 5 2019, 15:45:45)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostname()
'compute-0'
>>> socket.getaddrinfo(host=socket.gethostname(), port=None, family=socket.AF_UNSPEC, flags=socket.AI_CANONNAME)[0][3]
'compute-0.redhat.local'
~~~

Comment 21 smooney 2021-05-04 15:14:23 UTC

you are not using a valid 16.1 deoployment 

form the output you are using rhel 8.3. rhel 8.2 is the only supproted rhel version for 16.1.

you might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1949385
with 16.2 untill that is pulled into a new compose the underfloud was mis-configured which resulted in the over cloud host
having the incorrect name.

but before we change neutron to work around this can you provide the /etc/hosts and /etc/hostname values you have.

if the value of socket.gethostname() has change between one rhel version and another it will break nova in other ways so i dont think its valid to work around this in neutron.
it will just mask the issue but the deployment will still be incorrect.

Comment 22 Takashi Kajinami 2021-05-04 16:11:29 UTC

I agree that Red Hat 8.3.1-5 shown is very strange
but it is definitely RHOSP16.1 env with RHEL8.2
~~~
[heat-admin@compute-0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)
[heat-admin@compute-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.5 GA (Train)
[heat-admin@compute-0 ~]$ rpm -q platform-python
platform-python-3.6.8-23.el8.x86_64
~~~

> you might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1949385
If I read the bug correctly that bug causes inconsistency between hostname (especially domain name) used
but I don't see any inconsistency as far as I've checked.
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                      | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
...
| 984ec96a-b338-4236-9797-fe27e88609cd | nova-compute   | compute-0.redhat.local    | nova     | enabled | up    | 2021-05-04T16:05:28.000000 |
| dd60d897-2d0d-42e7-884f-8a3ecd0afdbe | nova-compute   | compute-1.redhat.local    | nova     | enabled | up    | 2021-05-04T16:05:28.000000 |
+--------------------------------------+----------------+---------------------------+----------+---------+-------+----------------------------+
(overcloud) [stack@undercloud-0 ~]$ nova hypervisor-list
+--------------------------------------+------------------------+-------+---------+
| ID                                   | Hypervisor hostname    | State | Status  |
+--------------------------------------+------------------------+-------+---------+
| 9b5c642f-2dd6-4641-b820-b9076e3b92e8 | compute-0.redhat.local | up    | enabled |
| 064b8ed5-9dc3-40de-8729-9ed065e67061 | compute-1.redhat.local | up    | enabled |
+--------------------------------------+------------------------+-------+---------+
(overcloud) [stack@undercloud-0 ~]$ openstack resource provider list
+--------------------------------------+------------------------+------------+
| uuid                                 | name                   | generation |
+--------------------------------------+------------------------+------------+
| 9b5c642f-2dd6-4641-b820-b9076e3b92e8 | compute-0.redhat.local |          6 |
| 064b8ed5-9dc3-40de-8729-9ed065e67061 | compute-1.redhat.local |          6 |
+--------------------------------------+------------------------+------------+

[heat-admin@compute-0 ~]$ hostname -f
compute-0.redhat.local
[heat-admin@compute-0 ~]$ sudo grep -r ^host /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf 
host=compute-0.redhat.local
~~~

> if the value of socket.gethostname() has change between one rhel version and another
> it will break nova in other ways so i dont think its valid to work around this in neutron.
I've rechecked implementation in nova but IIUC nova doesn't rely on socket.gethostname() but
on get_hostname() function in libvirt to determine the resource provider name.
Since the logic in neutron is supposed to determine the compute resource provider name,
I believe we should fix the logic to make it compatible with libvirt as I proposed in the patch...

Comment 23 smooney 2021-05-05 14:45:36 UTC

can you provide the ouput of /etc/hostname and /etc/hosts

Comment 24 smooney 2021-05-05 14:48:54 UTC

we do not relay on it for the placement RP name correct we use Hypervisor hostname which comes form libvirt for the RP name

but we rely on socket.gethostname() for the default value of host in the nova.conf which si used in the compute service list

libvirt hardcodes it to the server canonical FQDN but in general it is normally set to socket.gethostname()

because ooo overrides that behaviro we also expect tehm to ensure that whatever tehy set host too is what is retruned by both libvirt and socket.gethostname().

Comment 26 Takashi Kajinami 2021-05-05 15:09:05 UTC

~~~
[heat-admin@compute-0 ~]$ cat /etc/hostname
compute-0
[heat-admin@compute-0 ~]$ cat /etc/hosts
# BEGIN ANSIBLE MANAGED BLOCK
172.17.1.17 compute-0.redhat.local compute-0
172.17.3.33 compute-0.storage.redhat.local compute-0.storage
172.17.1.17 compute-0.internalapi.redhat.local compute-0.internalapi
172.17.2.88 compute-0.tenant.redhat.local compute-0.tenant
192.168.24.38 compute-0.ctlplane.redhat.local compute-0.ctlplane
172.17.1.120 compute-1.redhat.local compute-1
172.17.3.93 compute-1.storage.redhat.local compute-1.storage
172.17.1.120 compute-1.internalapi.redhat.local compute-1.internalapi
172.17.2.50 compute-1.tenant.redhat.local compute-1.tenant
192.168.24.8 compute-1.ctlplane.redhat.local compute-1.ctlplane
172.17.1.57 controller-0.redhat.local controller-0
172.17.3.25 controller-0.storage.redhat.local controller-0.storage
172.17.4.145 controller-0.storagemgmt.redhat.local controller-0.storagemgmt
172.17.1.57 controller-0.internalapi.redhat.local controller-0.internalapi
172.17.2.110 controller-0.tenant.redhat.local controller-0.tenant
10.0.0.148 controller-0.external.redhat.local controller-0.external
192.168.24.15 controller-0.ctlplane.redhat.local controller-0.ctlplane
172.17.1.52 controller-1.redhat.local controller-1
172.17.3.137 controller-1.storage.redhat.local controller-1.storage
172.17.4.120 controller-1.storagemgmt.redhat.local controller-1.storagemgmt
172.17.1.52 controller-1.internalapi.redhat.local controller-1.internalapi
172.17.2.124 controller-1.tenant.redhat.local controller-1.tenant
10.0.0.126 controller-1.external.redhat.local controller-1.external
192.168.24.25 controller-1.ctlplane.redhat.local controller-1.ctlplane
172.17.1.87 controller-2.redhat.local controller-2
172.17.3.50 controller-2.storage.redhat.local controller-2.storage
172.17.4.63 controller-2.storagemgmt.redhat.local controller-2.storagemgmt
172.17.1.87 controller-2.internalapi.redhat.local controller-2.internalapi
172.17.2.76 controller-2.tenant.redhat.local controller-2.tenant
10.0.0.138 controller-2.external.redhat.local controller-2.external
192.168.24.54 controller-2.ctlplane.redhat.local controller-2.ctlplane

192.168.24.1 undercloud-0.ctlplane.redhat.local undercloud-0.ctlplane
192.168.24.45  overcloud.ctlplane.localdomain
172.17.3.51  overcloud.storage.localdomain
172.17.4.21  overcloud.storagemgmt.localdomain
172.17.1.148  overcloud.internalapi.localdomain
10.0.0.141  overcloud.localdomain
# END ANSIBLE MANAGED BLOCK
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
[heat-admin@compute-0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)
[heat-admin@compute-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.5 GA (Train)
~~~

Comment 32 smooney 2021-05-05 18:41:12 UTC

actully this si http://lists.openstack.org/pipermail/openstack-discuss/2019-November/011044.html

we should implement the solution that is descirbed there to fix this where neutron used the hypervisor api to find the uuid.

Comment 33 smooney 2021-05-05 19:09:15 UTC

*** Bug 1952073 has been marked as a duplicate of this bug. ***

Comment 34 smooney 2021-05-06 14:55:24 UTC

i have just looked at both a 13 env and 16 env.

as it stand ooo is not correctly configuring neutron with the resource_provider_hypervisors map.
i think that we shoudl have ooo template that out by default so that it maps the hostname to the FQDN
that will correct this issue and allow neutron to find the placement RP.

long term we shoudl still likely explore other options but in the short term this is likely the best approach.

Comment 36 Takashi Kajinami 2021-05-06 15:15:31 UTC

@Sean


Thank you for your inputs and bringing this topic in irc.

Using resource_provider_hypervisors is a solution currently available but I have concern with that approach
especially from UX perspective...
That approach always requires users to define the same bridges in both resource_provider_hypervisors
and resource_provider_bandwidths.
From downstream's perspective we can implement a logic in triple so that TripleO parses
the resource_provider_bandwidths parameter to generate the resource_provider_hypervisors parameter
automatically but it brings undesired complexity here.

If the proposed patch is not acceptable (I think so as per our discussion so far) and it takes time
until we implement that long term solution, I believe adding a single option to override the default
hypervisor name still helps users (and TripleO) to solve the inconsistency much more easily.

If that makes sense then I'll restore my previous patch as a short term solution but I'd like to hear
some feedback from you and Slawek since you once disagree with that approach...
 https://review.opendev.org/c/openstack/neutron/+/763563

Comment 37 smooney 2021-05-06 18:43:21 UTC

i agree with you that the UX of Using resource_provider_hypervisors is not great
but could that not be mitigated by updating ooo to always template it for us automatically.
so that the end user does not have to set it?

i dont think we need to  resource_provider_bandwidths parameter to generate the resource_provider_hypervisors parameter

since ooo is generaging the contence of /etc/hosts, /etc/hostname and setting the [DEFAULT]/host value in the configs already
it know what the sortname and FQDN are for each host so it can just populated resource_provider_hypervisors with teh shortname to fqdn mapping.

form osp10/newton ish ooo has been using FQDN for [DEFAULT]/host and for the hypervior hostname.
so we know that in a ooo deployment the FQDN will be the RP name and it will match what is in  [DEFAULT]/host
so ooo can just encode that in the neutron config.

Comment 38 smooney 2021-05-06 18:46:48 UTC

ah resource_provider_bandwidths is <network_device>:<hypervisor> not hostname to compute node name.
i think ooo can still template that without parseing based on the network info it has.
i can see why you would like  https://review.opendev.org/c/openstack/neutron/+/763563 a little more but im still unsure if neutron shoudl be doing that.

Comment 39 Takashi Kajinami 2021-05-08 05:13:45 UTC

> ah resource_provider_bandwidths is <network_device>:<hypervisor> not hostname to compute node name.
> i think ooo can still template that without parseing based on the network info it has.

Currently the NeutronOvsResourceProviderBandwidths parameter to set the resource_provider_bandwidths parameter
takes an array of '<bridge/device>:<egress_bw>:<ingress_bw>' .

So to set the resource_provider_hypervisors (which takes a list of <bridge/device>:<hypervisor>) correctly
then we should implement something like;

[ [ bw.split(':').[0], cname ].join(':') for bw in NeutronOvsResourceProviderBandwidths ].join(',')

The above example is written like a python code but ideally we should implement this in tht layer using yaql.
I'm not sure how we can implement such logic very clean.
Another option is to implement the same in puppet-tripleo layer but it still leaves
such dirty implementation in puppet.

Since the problem here is that the default hypervisor name neutron is guessing doesn't match
with the real hypervisor name in the deployment, I still feel like somehow fixing that wrong default
would be more direct and clean approach...

Comment 40 Takashi Kajinami 2021-05-08 05:16:05 UTC

> [ [ bw.split(':').[0], cname ].join(':') for bw in NeutronOvsResourceProviderBandwidths ].join(',')

The last join is currently implemented in puppet-neturon so what we need is the remaining part.

[ [ bw.split(':').[0], cname ].join(':') for bw in NeutronOvsResourceProviderBandwidths ]

Though this still looks too tricky to be implemented by yaml...

Comment 50 Alex Katz 2021-08-10 16:14:25 UTC

Verified on RHOS-16.1-RHEL-8-20210804.n.0

Comment 72 errata-xmlrpc 2021-12-09 20:17:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762