Bug 1986392 - Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack
Summary: Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.9.0
Assignee: MCO Team
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-27 12:39 UTC by Matthew Booth
Modified: 2021-11-01 01:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-29 15:19:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator issues 2693 0 None open Kubelet can't find Node after upgrade to external CCM on AWS/OpenStack 2021-07-27 12:39:51 UTC
Github openshift machine-config-operator pull 2694 0 None open Bug 1986392: Persist kubelet node name for OpenStack nodes 2021-07-27 12:41:14 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-11-01 01:35:27 UTC

Description Matthew Booth 2021-07-27 12:39:51 UTC
A kubelet running with --cloud-provider=”openstack” or --cloud-provider="aws" may stop working when upgraded to use CCM with --cloud-provider=”external”.

Kubelet’s notion of node name changes during the upgrade. The in-tree cloud provider does not use FQDN hostname as the node name. Kubelet defaults to using FQDN hostname as the node name when cloud provider is changed to external. If these do not match we have an upgrade problem.

For OpenStack, the detail is as follows; the issue is analogous on AWS:

OpenStack in-tree cloud provider returns the node name as name returned by Nova metadata: openstack:CurrentNodeName().

When we switch to an external cloud provider, cloud is unset and kubelet defaults to using hostname: kubelet:getNodeName().

Hostname is set by afterburn to be hostname as returned by Nova metadata, which contains a domain suffix if one is defined: afterburn service for openstack.

After the upgrade, kubelet can no longer find its own Node, because name != hostname. Hostname contains a domain suffix, whereas name does not.

AWS has worked round this issue by using afterburn to set hostname to the unqualified hostname rather than the fully-qualified hostname. However, this change potentially has its own upgrade issues, especially when using third-party extensions which also rely on hostname, e.g. Calico. I believe there is a safer solution that can work for both providers.

Steps to reproduce the issue:

    Install OpenShift on an OpenStack cloud that returns a domain name in hostname. This is the default for non-OSP OpenStack installations. OSP does not set a domain name by default, but can be configured to do so.
    Apply the ExternalCloudProvider feature gate

Describe the results you received:
The first node to upgrade will fail. kubelet logs are full of errors about being unable to find nodename. Static pods have not started. Heartbeats are not updated on the Node.

Note that this is somewhat similar to, but distinct from https://github.com/kubernetes/kubernetes/issues/70897.

Comment 7 errata-xmlrpc 2021-11-01 01:35:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.