2033953 – [OSP 16.1] [OVN-DVR] [Overcloud BM] Baremetal worker getting 503 from ovs-metadata agent

Bug 2033953 - [OSP 16.1] [OVN-DVR] [Overcloud BM] Baremetal worker getting 503 from ovs-metadata agent

Summary: [OSP 16.1] [OVN-DVR] [Overcloud BM] Baremetal worker getting 503 from ovs-met...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-ovn
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	z9
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Lucas Alvares Gomes
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1892000 (view as bug list)
Depends On:
Blocks:	2041364 2083120
TreeView+	depends on / blocked

Reported:	2021-12-19 08:17 UTC by Udi Shkalim
Modified:	2023-06-19 09:18 UTC (History)
CC List:	26 users (show)
Fixed In Version:	python-networking-ovn-7.3.1-1.20220525113339.4e24f4c.el8ost
Doc Type:	Bug Fix
Doc Text:	Before this update, the machine-config-operator passed an afterburn systemd unit to new machines that set the hostname based on the user data passed through the Compute service (nova) metadata service. In some cases, for example, bare metal, the instance did not have connectivity to Compute service metadata. With this update, the afterburn systemd unit attempts to fetch data from configdrive first and then falls back to the Compute service metadata service. The hostname of instances is set, irrespective of the reachability of the Compute service metadata service.
Clone Of:
Clones:	2083120 (view as bug list)
Environment:
Last Closed:	2023-06-19 09:18:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift install log (334.03 KB, text/plain) 2021-12-19 08:17 UTC, Udi Shkalim	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2903	None	Merged	Bug 2033953: Afterburn to try config-drive before Nova metadata	2022-11-22 14:13:54 UTC
OpenStack gerrit	840287	None	MERGED	[OVN] Add baremetal support with Neutron DHCP agent	2022-11-22 14:13:53 UTC
Red Hat Issue Tracker	OSP-12251	None	None	None	2022-01-23 08:29:16 UTC

Description Udi Shkalim 2021-12-19 08:17:21 UTC

Created attachment 1846907 [details]
openshift install log

Version:

$ openshift-install version
./openshift-install 4.8.24
built from commit 7123680a2275e9f6f33a0a325ab61a129a54c2af
release image quay.io/openshift-release-dev/ocp-release@sha256:0708475f51e969dd9e6902d958f8ffed668b1b9c8d63b6241e7c9e40d9548eee
Platform:

OpenStack

Please specify:
* IPI 

What happened?

Cluster in unable to create the workers nodes when using the baremetal service on the overcloud.
openshift-install.log is attached

workers nodes afterburn-hostname service is failing:
[core@ostest-lb9s6-worker-0-6ltt2 ~]$ journalctl -u  afterburn-hostname
-- Logs begin at Tue 2021-11-30 20:09:57 UTC, end at Tue 2021-11-30 20:18:09 UTC. --
Nov 30 20:17:12 localhost systemd[1]: Starting Afterburn Hostname...
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Nov 30 20:17:54.283 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt #11
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Nov 30 20:17:54.285 INFO Failed to fetch: 503 Service Unavailable
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Error: failed to run
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Caused by: writing hostname
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Caused by: maximum number of retries (10) reached
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 afterburn[1650]: Caused by: failed to fetch: 503 Service Unavailable
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=1/FAILURE
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'.
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 systemd[1]: Failed to start Afterburn Hostname.
Nov 30 20:17:54 ostest-lb9s6-worker-0-6ltt2 systemd[1]: afterburn-hostname.service: Consumed 23ms CPU time

[core@ostest-lb9s6-worker-0-6ltt2 ~]$ ping 169.254.169.254
PING 169.254.169.254 (169.254.169.254) 56(84) bytes of data.
64 bytes from 169.254.169.254: icmp_seq=1 ttl=64 time=1.15 ms
64 bytes from 169.254.169.254: icmp_seq=2 ttl=64 time=0.635 ms

What did you expect to happen?

afterburn-hostname service to run

How to reproduce it (as minimally and precisely as possible)?

Deploy OSP 16.1 overcloud with OVN-DVR and baremetal service (Ironic)
Run OpenShift cluster installation with BM workers

Anything else we need to know?

Dec 14 16:19:51 ostest-rhk55-worker-0-rkqc8 hyperkube[15277]: E1214 16:19:51.719552   15277 kubelet.go:1383] "Kubelet failed to get node info" err="unexpected status code when reading instance type from http://169.254.169.254/2009-04-04/meta-data/instance-type: 503 Service Unavailable"

Comment 1 Martin André 2021-12-20 16:15:16 UTC

Hi Udi, if nova-metadata service returns a 503 error you should check the nova-metadata logs and paste them somewhere in the BZ.
This would also likely be an openstack bug, not an openshift one.

Comment 2 Pierre Prinetti 2021-12-22 23:04:20 UTC

Tentatively setting as blocker- because in case it’s a valid bug, it’s not likely to be a regression.

Comment 3 Udi Shkalim 2021-12-23 10:15:51 UTC

(In reply to Martin André from comment #1)
> Hi Udi, if nova-metadata service returns a 503 error you should check the
> nova-metadata logs and paste them somewhere in the BZ.
> This would also likely be an openstack bug, not an openshift one.

Hi Martin,

Uploaded the debug log from the nova-metadata-api.log on controller2 which holds the metadata ip that returns 503

######### metadata route from the instance: ################
[core@ostest-rhk55-worker-0-rkqc8 ~]$ ip route
default via 172.27.7.1 dev enp6s0f1 proto dhcp metric 101
169.254.169.254 via 172.27.7.154 dev enp6s0f1 proto dhcp metric 101
172.27.7.0/24 dev enp6s0f1 proto kernel scope link src 172.27.7.158 metric 101


######### dhcp namespace info: ##################
[root@controller-2 ~]# ip netns exec qdhcp-d544024c-3827-4a19-b393-d74b84733e84 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
17: tapf15afd62-b9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:c7:ef:61 brd ff:ff:ff:ff:ff:ff
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tapf15afd62-b9
       valid_lft forever preferred_lft forever
    inet 172.27.7.154/24 brd 172.27.7.255 scope global tapf15afd62-b9
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fec7:ef61/64 scope link
       valid_lft forever preferred_lft forever

Comment 5 Udi Shkalim 2021-12-23 10:22:31 UTC

More information:

(shiftstack) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+
| ID                                   | Name                        | Status | Networks                  | Image              | Flavor |
+--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+
| bf942e66-6a17-41cc-ac41-ec76792d547e | ostest-rhk55-worker-0-tfbc7 | ACTIVE | provisioning=172.27.7.184 | ostest-rhk55-rhcos |        |
| 44031de0-5916-447a-a5a2-07f341031d63 | ostest-rhk55-worker-0-rkqc8 | ACTIVE | provisioning=172.27.7.158 | ostest-rhk55-rhcos |        |
| 2ac64888-47ed-4380-abda-37bbabf83fb3 | ostest-rhk55-master-2       | ACTIVE | provisioning=172.27.7.196 | ostest-rhk55-rhcos |        |
| 1a3293c4-e515-4ae9-9fcb-7fafa662be5f | ostest-rhk55-master-1       | ACTIVE | provisioning=172.27.7.183 | ostest-rhk55-rhcos |        |
| 5f119850-c9d5-46ab-8c85-1ffa24101780 | ostest-rhk55-master-0       | ACTIVE | provisioning=172.27.7.200 | ostest-rhk55-rhcos |        |
+--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+

Few summary points:
The *master* servers are virtual, coreos based, running on the osp computeare using ovn-dhcp as forwarding agent to nova-metadata agent.
The *worker* servers are baremetal, coreos, based, using the neutron-dhcp as forwarding agent to nova-metadata agent.
We can reach the workers as they are using ignition, image base metadata service, the afterburn-hostname service is failing during worker startup as shown in initial description.

Comment 7 Julia Kreger 2022-01-06 14:41:01 UTC

Seems like kubelet has been written to only check the API endpoint which may or may not be available with a physical baremetal deployment, instead of if there is local network metadata passed through a configuration drive. Seems like a deficiency in kubelet. Often we recommend local configuraiton metadata with config drives use for bare metal instead the metadata proxy neutron may offer, or the metadata service nova may launch. Ultimately, those two services help support different use cases, and it seems your hitting a proxy, but it just doesn't know what to make of it.

But realistically, that same information *should* be on the local disk in a config-2 partition. You should validate that it is present.

Comment 8 Julia Kreger 2022-01-06 16:37:49 UTC

Err, minor correction, hyperkube and afterburn. So it looks like aterburn does have code to read configuration drives, so maybe this is just a configuration issue?

Comment 9 Pierre Prinetti 2022-01-10 12:48:49 UTC

the configdrive is there indeed

```
# lsblk -o NAME,LABEL,SIZE,RO,TYPE,MOUNTPOINT
NAME   LABEL        SIZE RO TYPE MOUNTPOINT
sda               894.3G  0 disk
|-sda1                1M  0 part
|-sda2 EFI-SYSTEM   127M  0 part
|-sda3 boot         384M  0 part /boot
|-sda4 root       893.7G  0 part /sysroot
`-sda5 config-2    64.3M  0 part
sr0                1024M  0 rom
```

Comment 10 Julia Kreger 2022-01-10 14:25:00 UTC

The question is likely, why are the services directly trying to reach out to the API then. Unfortunately I suspect that is a question for OpenShift folks.

Comment 11 Pierre Prinetti 2022-01-10 15:28:16 UTC

The linked machine-config-operator patch makes Afterburn use the `openstack` provider, which[1] tries configdrive before falling back to the metadata service.

[1]: https://github.com/coreos/afterburn/blob/main/src/providers/openstack/mod.rs

Comment 12 ShiftStack Bugwatcher 2022-01-11 07:04:18 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 15 Udi Shkalim 2022-01-20 08:53:48 UTC

Bug Still exists.
It seems that there is an issue getting the metadata from Openstack.
Failing QA

Comment 16 Pierre Prinetti 2022-01-21 09:24:46 UTC

Indeed, resorting to config-drive for setting the hostname didn't resolve other dependencies on the Nova metadata.

With the hope of an easy fix now lost, we should investigate the lack of connectivity between the bare metal instance and the Nova metadata service. I propose to ask for some help from the OpenStack territory.

Comment 17 Udi Shkalim 2022-01-23 08:23:27 UTC

Agree. Since connectivity is ok, moving this one to Compute to understand why we are getting 503 response...

Comment 18 Bogdan Dobrelya 2022-02-15 10:27:01 UTC

Please let me summarize the issue: There is config drive provided, and nova metadata should not be in the loop normally. But for a reason it failed back from config drive to metadata APIs? And the request is to find the cause for 503 errors to nova-metadata?

Please provide sosreports with Nova DEBUG logs for the involved nodes.

Comment 19 Udi Shkalim 2022-02-16 09:43:12 UTC

The config drive is for the boot of the instances. There are services in side the openshift cluster that still need to use the metadata API.
The same is happening for the master nodes(which are virt instances) and they are able to reach both config drive and metadata API.
The nodes that are failing the metadata API are baremetal instances.

The request is to find out why baremetal nodes are getting 503 from metadata API.

Comment 30 Lucas Alvares Gomes 2022-05-26 10:01:59 UTC

Fix included in https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2026360

- [OVN] Add baremetal support with Neutron DHCP agent (rhbz#2033953)

...

Quick reminder, in order for it to work we also need:

1) To add the OVNMetadataAgent to the controller templates

2) Add OVNCMSOptions: "enable-chassis-as-gw" to the controller templates. This is how Neutron/OVN know where to schedule the external ports for the baremetal nodes.

Comment 31 Lucas Alvares Gomes 2022-06-20 14:13:08 UTC

*** Bug 1892000 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
apevec
bdobreli
cjanisze
dasmith
eduen
eglynn
ekuris
imatza
jhakimra
jkreger
jlibosva
joflynn
kchamart
ldenny
lhh
lmartins
majopela
m.andre
mdemaced
pprinett
sbauza
scohen
sgordon
stephenfin
vromanso