Bug 2033953
Summary: | [OSP 16.1] [OVN-DVR] [Overcloud BM] Baremetal worker getting 503 from ovs-metadata agent | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Udi Shkalim <ushkalim> | ||||
Component: | python-networking-ovn | Assignee: | Lucas Alvares Gomes <lmartins> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Eran Kuris <ekuris> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 16.1 (Train) | CC: | aos-bugs, apevec, bdobreli, cjanisze, dasmith, eduen, eglynn, ekuris, imatza, jhakimra, jkreger, jlibosva, joflynn, kchamart, ldenny, lhh, lmartins, majopela, m.andre, mdemaced, pprinett, sbauza, scohen, sgordon, stephenfin, vromanso | ||||
Target Milestone: | z9 | Keywords: | TestBlocker, Triaged | ||||
Target Release: | 16.1 (Train on RHEL 8.2) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | python-networking-ovn-7.3.1-1.20220525113339.4e24f4c.el8ost | Doc Type: | Bug Fix | ||||
Doc Text: |
Before this update, the machine-config-operator passed an afterburn systemd unit to new machines that set the hostname based on the user data passed through the Compute service (nova) metadata service. In some cases, for example, bare metal, the instance did not have connectivity to Compute service metadata. With this update, the afterburn systemd unit attempts to fetch data from configdrive first and then falls back to the Compute service metadata service. The hostname of instances is set, irrespective of the reachability of the Compute service metadata service.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 2083120 (view as bug list) | Environment: | |||||
Last Closed: | 2023-06-19 09:18:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2041364, 2083120 | ||||||
Attachments: |
|
Description
Udi Shkalim
2021-12-19 08:17:21 UTC
Hi Udi, if nova-metadata service returns a 503 error you should check the nova-metadata logs and paste them somewhere in the BZ. This would also likely be an openstack bug, not an openshift one. Tentatively setting as blocker- because in case it’s a valid bug, it’s not likely to be a regression. (In reply to Martin André from comment #1) > Hi Udi, if nova-metadata service returns a 503 error you should check the > nova-metadata logs and paste them somewhere in the BZ. > This would also likely be an openstack bug, not an openshift one. Hi Martin, Uploaded the debug log from the nova-metadata-api.log on controller2 which holds the metadata ip that returns 503 ######### metadata route from the instance: ################ [core@ostest-rhk55-worker-0-rkqc8 ~]$ ip route default via 172.27.7.1 dev enp6s0f1 proto dhcp metric 101 169.254.169.254 via 172.27.7.154 dev enp6s0f1 proto dhcp metric 101 172.27.7.0/24 dev enp6s0f1 proto kernel scope link src 172.27.7.158 metric 101 ######### dhcp namespace info: ################## [root@controller-2 ~]# ip netns exec qdhcp-d544024c-3827-4a19-b393-d74b84733e84 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 17: tapf15afd62-b9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:c7:ef:61 brd ff:ff:ff:ff:ff:ff inet 169.254.169.254/16 brd 169.254.255.255 scope global tapf15afd62-b9 valid_lft forever preferred_lft forever inet 172.27.7.154/24 brd 172.27.7.255 scope global tapf15afd62-b9 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fec7:ef61/64 scope link valid_lft forever preferred_lft forever More information: (shiftstack) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+ | bf942e66-6a17-41cc-ac41-ec76792d547e | ostest-rhk55-worker-0-tfbc7 | ACTIVE | provisioning=172.27.7.184 | ostest-rhk55-rhcos | | | 44031de0-5916-447a-a5a2-07f341031d63 | ostest-rhk55-worker-0-rkqc8 | ACTIVE | provisioning=172.27.7.158 | ostest-rhk55-rhcos | | | 2ac64888-47ed-4380-abda-37bbabf83fb3 | ostest-rhk55-master-2 | ACTIVE | provisioning=172.27.7.196 | ostest-rhk55-rhcos | | | 1a3293c4-e515-4ae9-9fcb-7fafa662be5f | ostest-rhk55-master-1 | ACTIVE | provisioning=172.27.7.183 | ostest-rhk55-rhcos | | | 5f119850-c9d5-46ab-8c85-1ffa24101780 | ostest-rhk55-master-0 | ACTIVE | provisioning=172.27.7.200 | ostest-rhk55-rhcos | | +--------------------------------------+-----------------------------+--------+---------------------------+--------------------+--------+ Few summary points: The *master* servers are virtual, coreos based, running on the osp computeare using ovn-dhcp as forwarding agent to nova-metadata agent. The *worker* servers are baremetal, coreos, based, using the neutron-dhcp as forwarding agent to nova-metadata agent. We can reach the workers as they are using ignition, image base metadata service, the afterburn-hostname service is failing during worker startup as shown in initial description. Seems like kubelet has been written to only check the API endpoint which may or may not be available with a physical baremetal deployment, instead of if there is local network metadata passed through a configuration drive. Seems like a deficiency in kubelet. Often we recommend local configuraiton metadata with config drives use for bare metal instead the metadata proxy neutron may offer, or the metadata service nova may launch. Ultimately, those two services help support different use cases, and it seems your hitting a proxy, but it just doesn't know what to make of it. But realistically, that same information *should* be on the local disk in a config-2 partition. You should validate that it is present. Err, minor correction, hyperkube and afterburn. So it looks like aterburn does have code to read configuration drives, so maybe this is just a configuration issue? the configdrive is there indeed ``` # lsblk -o NAME,LABEL,SIZE,RO,TYPE,MOUNTPOINT NAME LABEL SIZE RO TYPE MOUNTPOINT sda 894.3G 0 disk |-sda1 1M 0 part |-sda2 EFI-SYSTEM 127M 0 part |-sda3 boot 384M 0 part /boot |-sda4 root 893.7G 0 part /sysroot `-sda5 config-2 64.3M 0 part sr0 1024M 0 rom ``` The question is likely, why are the services directly trying to reach out to the API then. Unfortunately I suspect that is a question for OpenShift folks. The linked machine-config-operator patch makes Afterburn use the `openstack` provider, which[1] tries configdrive before falling back to the metadata service. [1]: https://github.com/coreos/afterburn/blob/main/src/providers/openstack/mod.rs Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Bug Still exists. It seems that there is an issue getting the metadata from Openstack. Failing QA Indeed, resorting to config-drive for setting the hostname didn't resolve other dependencies on the Nova metadata. With the hope of an easy fix now lost, we should investigate the lack of connectivity between the bare metal instance and the Nova metadata service. I propose to ask for some help from the OpenStack territory. Agree. Since connectivity is ok, moving this one to Compute to understand why we are getting 503 response... Please let me summarize the issue: There is config drive provided, and nova metadata should not be in the loop normally. But for a reason it failed back from config drive to metadata APIs? And the request is to find the cause for 503 errors to nova-metadata? Please provide sosreports with Nova DEBUG logs for the involved nodes. The config drive is for the boot of the instances. There are services in side the openshift cluster that still need to use the metadata API. The same is happening for the master nodes(which are virt instances) and they are able to reach both config drive and metadata API. The nodes that are failing the metadata API are baremetal instances. The request is to find out why baremetal nodes are getting 503 from metadata API. Fix included in https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2026360 - [OVN] Add baremetal support with Neutron DHCP agent (rhbz#2033953) ... Quick reminder, in order for it to work we also need: 1) To add the OVNMetadataAgent to the controller templates 2) Add OVNCMSOptions: "enable-chassis-as-gw" to the controller templates. This is how Neutron/OVN know where to schedule the external ports for the baremetal nodes. *** Bug 1892000 has been marked as a duplicate of this bug. *** |