Bug 1986392

Summary:	Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack
Product:	OpenShift Container Platform	Reporter:	Matthew Booth <mbooth>
Component:	Machine Config Operator	Assignee:	MCO Team <team-mco>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, mkrejci, vlaad
Version:	4.9
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-29 15:19:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Booth 2021-07-27 12:39:51 UTC

A kubelet running with --cloud-provider=”openstack” or --cloud-provider="aws" may stop working when upgraded to use CCM with --cloud-provider=”external”.

Kubelet’s notion of node name changes during the upgrade. The in-tree cloud provider does not use FQDN hostname as the node name. Kubelet defaults to using FQDN hostname as the node name when cloud provider is changed to external. If these do not match we have an upgrade problem.

For OpenStack, the detail is as follows; the issue is analogous on AWS:

OpenStack in-tree cloud provider returns the node name as name returned by Nova metadata: openstack:CurrentNodeName().

When we switch to an external cloud provider, cloud is unset and kubelet defaults to using hostname: kubelet:getNodeName().

Hostname is set by afterburn to be hostname as returned by Nova metadata, which contains a domain suffix if one is defined: afterburn service for openstack.

After the upgrade, kubelet can no longer find its own Node, because name != hostname. Hostname contains a domain suffix, whereas name does not.

AWS has worked round this issue by using afterburn to set hostname to the unqualified hostname rather than the fully-qualified hostname. However, this change potentially has its own upgrade issues, especially when using third-party extensions which also rely on hostname, e.g. Calico. I believe there is a safer solution that can work for both providers.

Steps to reproduce the issue:

Install OpenShift on an OpenStack cloud that returns a domain name in hostname. This is the default for non-OSP OpenStack installations. OSP does not set a domain name by default, but can be configured to do so.
Apply the ExternalCloudProvider feature gate

Describe the results you received:
The first node to upgrade will fail. kubelet logs are full of errors about being unable to find nodename. Static pods have not started. Heartbeats are not updated on the Node.

Note that this is somewhat similar to, but distinct from https://github.com/kubernetes/kubernetes/issues/70897.

Comment 7 errata-xmlrpc 2021-11-01 01:35:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759