1986392 – Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack

Bug 1986392 - Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack

Summary: Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	MCO Team
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-27 12:39 UTC by Matthew Booth
Modified:	2021-11-01 01:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-29 15:19:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator issues 2693	None	open	Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack	2021-07-27 12:39:51 UTC
Github	openshift machine-config-operator pull 2694	None	open	Bug 1986392: Persist kubelet node name for OpenStack nodes	2021-07-27 12:41:14 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-11-01 01:35:27 UTC

Description Matthew Booth 2021-07-27 12:39:51 UTC

A kubelet running with --cloud-provider=”openstack” or --cloud-provider="aws" may stop working when upgraded to use CCM with --cloud-provider=”external”.

Kubelet’s notion of node name changes during the upgrade. The in-tree cloud provider does not use FQDN hostname as the node name. Kubelet defaults to using FQDN hostname as the node name when cloud provider is changed to external. If these do not match we have an upgrade problem.

For OpenStack, the detail is as follows; the issue is analogous on AWS:

OpenStack in-tree cloud provider returns the node name as name returned by Nova metadata: openstack:CurrentNodeName().

When we switch to an external cloud provider, cloud is unset and kubelet defaults to using hostname: kubelet:getNodeName().

Hostname is set by afterburn to be hostname as returned by Nova metadata, which contains a domain suffix if one is defined: afterburn service for openstack.

After the upgrade, kubelet can no longer find its own Node, because name != hostname. Hostname contains a domain suffix, whereas name does not.

AWS has worked round this issue by using afterburn to set hostname to the unqualified hostname rather than the fully-qualified hostname. However, this change potentially has its own upgrade issues, especially when using third-party extensions which also rely on hostname, e.g. Calico. I believe there is a safer solution that can work for both providers.

Steps to reproduce the issue:

Install OpenShift on an OpenStack cloud that returns a domain name in hostname. This is the default for non-OSP OpenStack installations. OSP does not set a domain name by default, but can be configured to do so.
Apply the ExternalCloudProvider feature gate

Describe the results you received:
The first node to upgrade will fail. kubelet logs are full of errors about being unable to find nodename. Static pods have not started. Heartbeats are not updated on the Node.

Note that this is somewhat similar to, but distinct from https://github.com/kubernetes/kubernetes/issues/70897.

Comment 7 errata-xmlrpc 2021-11-01 01:35:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.