During startup a kubelet...
1. tries to use current client-cert. if it is missing or invalid it...
2. uses a bootstrap credential to make a CSR for a new client-cert
This flow happens on initial startup and when clusters are restarted after being off for an extended period. On the masters, step 2 fails after one day.
Master kubelets do not use the same bootstrap credentials as the rest of the cluster. Because it's initially a client-cert on the masters, we cannot extend the lifetime indefinitely because client-certs are not individually revokable.
The master kubelets should be updated sometime after the initial boot to use the same serviceaccount token that the rest of the nodes use. Now that the MCO doesn't reboot a machine for every update, this should work.
To my knowledge, this is the only reason that clusters cannot be shutdown shortly after installation. It's also the only reason that step 9 exists here: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html
> Now that the MCO doesn't reboot a machine for every update, this should work.
This would be news to me if the MCD does this now.
If it does not, then we are looking at two reboots per node during install: one for the original pivot and one to apply the changed MC that includes the new bootstrap credentials.
I misunderstood how the kubelet ca updates were being handled. If they are rebooting all the machines, I guess you face a similar choice here.
Regardless, this is the only thing I'm aware of that prevents an immediate shutdown of a cluster after installation.
We really need the MCD to be more feature-rich to make this work. In particular, we need to be able to reproject files changed in the MC without a reboot. Rebooting the nodes twice during install is a disruptive change.
For this reason, I'm deferring to 4.3. I've talked to Antonio and this functionality is a priority for MCO in 4.3. I'll reference a Jira story tracking the progress when one exists.
Is this the root cause for: https://bugzilla.redhat.com/show_bug.cgi?id=1693951
I found that we had changed Target release recently from 4.5.0 to 4.6.0 now.
I believe this BZ has something to do with RFE-297/MSTR-931 , if my understanding is correct, there's possibility that we would miss landing the feature of fully guaranteed/tested shutdown procedure on v4.5 (or 4.5.z) ?
( MSTR-931 seems still pointing to 4.5, so replease correct me if I am mistaken.)
I’d like to get a better understanding of the current situation, would you please clarify it to set appropriate expactation to the customers ?
I am grateful for your help and clarification.
This doesn't seem to be related to node, requesting MCO team to look into it.
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.
I'm not sure about the status of this - there has been some work last year from David https://github.com/openshift/machine-config-operator/pull/1027 but we kind of lost track, is there still something needed from the MCO?
Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.
Adding UpcomingSprint as I've been busy with 4.6 features delivery. We'll attempt this next sprint.