1735180 – master bootstrap credentials are not managed

Bug 1735180 - master bootstrap credentials are not managed

Summary: master bootstrap credentials are not managed

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1693951
TreeView+	depends on / blocked

Reported:	2019-07-31 18:56 UTC by David Eads
Modified:	2023-12-15 16:39 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-01 08:29:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4271712	0	None	None	None	2019-09-09 20:41:10 UTC

Description David Eads 2019-07-31 18:56:33 UTC

During startup a kubelet...

 1. tries to use current client-cert.  if it is missing or invalid it...
 2. uses a bootstrap credential to make a CSR for a new client-cert

This flow happens on initial startup and when clusters are restarted after being off for an extended period.  On the masters, step 2 fails after one day.

Master kubelets do not use the same bootstrap credentials as the rest of the cluster.  Because it's initially a client-cert on the masters, we cannot extend the lifetime indefinitely because client-certs are not individually revokable.

The master kubelets should be updated sometime after the initial boot to use the same serviceaccount token that the rest of the nodes use.  Now that the MCO doesn't reboot a machine for every update, this should work.

To my knowledge, this is the only reason that clusters cannot be shutdown shortly after installation.  It's also the only reason that step 9 exists here: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html

Comment 1 Seth Jennings 2019-07-31 19:42:21 UTC

> Now that the MCO doesn't reboot a machine for every update, this should work.

This would be news to me if the MCD does this now.

If it does not, then we are looking at two reboots per node during install: one for the original pivot and one to apply the changed MC that includes the new bootstrap credentials.

Comment 2 David Eads 2019-07-31 19:53:42 UTC

I misunderstood how the kubelet ca updates were being handled.  If they are rebooting all the machines, I guess you face a similar choice here.

Regardless, this is the only thing I'm aware of that prevents an immediate shutdown of a cluster after installation.

Comment 3 Seth Jennings 2019-08-05 17:09:46 UTC

We really need the MCD to be more feature-rich to make this work.  In particular, we need to be able to reproject files changed in the MC without a reboot.  Rebooting the nodes twice during install is a disruptive change.

For this reason, I'm deferring to 4.3.  I've talked to Antonio and this functionality is a priority for MCO in 4.3.  I'll reference a Jira story tracking the progress when one exists.

Comment 4 Seth Jennings 2019-08-05 17:18:42 UTC

https://jira.coreos.com/browse/PROD-1025

Comment 5 Eric Rich 2019-09-09 20:44:59 UTC

Is this the root cause for: https://bugzilla.redhat.com/show_bug.cgi?id=1693951

Comment 6 Masaki Furuta ( RH ) 2020-05-12 05:21:44 UTC

Hello,

I found that we had changed Target release recently from 4.5.0 to 4.6.0 now.
I believe this BZ has something to do with RFE-297/MSTR-931 , if my understanding is correct, there's possibility that we would miss landing the feature of fully guaranteed/tested shutdown procedure on v4.5 (or 4.5.z) ?
( MSTR-931 seems still pointing to 4.5, so replease correct me if I am mistaken.)

I’d like to get a better understanding of the current situation, would you please clarify it to set appropriate expactation to the customers ?

I am grateful for your help and clarification.

Thank you,

BR,
Masaki

Comment 7 Harshal Patil 2020-05-19 06:02:28 UTC

This doesn't seem to be related to node, requesting MCO team to look into it.

Comment 8 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 9 Antonio Murdaca 2020-07-09 12:28:31 UTC

I'm not sure about the status of this - there has been some work last year from David https://github.com/openshift/machine-config-operator/pull/1027 but we kind of lost track, is there still something needed from the MCO?

Adding UpcomingSprint

Comment 10 Antonio Murdaca 2020-07-10 09:40:40 UTC

Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.

Comment 11 Antonio Murdaca 2020-08-02 20:29:59 UTC

Adding UpcomingSprint as I've been busy with 4.6 features delivery. We'll attempt this next sprint.

Comment 13 Red Hat Bugzilla 2023-09-14 05:36:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.