Bug 2034883
Summary: | MCO does not sync kubeAPIServerServingCAData to controllerconfig if there are not ready nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> |
Component: | Machine Config Operator | Assignee: | John Kyros <jkyros> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED ERRATA | Docs Contact: | Jeana Routh <jrouth> |
Severity: | high | ||
Priority: | high | CC: | dornelas, jiewu, jkyros, mkrejci, openshift-bugs-escalate, palonsor, skumari, soh, sregidor, srengan, waljaber, wking, yaoli |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
* Previously, the Machine Config Operator (MCO) `ControllerConfig` resource, which contains important certificates, was only synced if the Operator's daemon sync succeeded. By design, unready nodes during a daemon sync prevent that daemon sync from succeeding, so unready nodes were indirectly preventing the `ControllerConfig` resource, and therefore those certificates, from syncing. This resulted in eventual cluster degradation when there were unready nodes due to inability to rotate the certificates contained in the `ControllerConfig` resource. With this release, the sync of the `ControllerConfig` resource is no longer dependent on the daemon sync succeeding, so the `ControllerConfig` resource now continues to sync if the daemon sync fails. This means that unready nodes no longer prevent the `ControllerConfig` resource from syncing, so certificates continue to be updated even when there are unready nodes.
(link:https://bugzilla.redhat.com/show_bug.cgi?id=2034883[*2034883*])
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:46:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pablo Alonso Rodriguez
2021-12-22 11:33:14 UTC
Just fixing or deleting the NotReady nodes allows the machine-config-operator reconciliation to return to normal, so the new CA is then properly deployed. However, we cannot have the MCO stop the world and cause cluster-wide impact just because only some concrete nodes are NotReady and experiencing problems. Marking not a blocker: undesirable behavior, but does not appear to be a regression. Most if not all of that code has been that way for years. Verified using IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-0.nightly-2022-10-28-001703 True False 15m Cluster version is 4.12.0-0.nightly-2022-10-28-001703
1. Shutdown a worker node to make it NotReady
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-50-178.us-east-2.compute.internal NotReady worker 22m v1.25.2+4bd0702
ip-10-0-58-207.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702
ip-10-0-64-37.us-east-2.compute.internal Ready worker 21m v1.25.2+4bd0702
ip-10-0-67-209.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702
ip-10-0-69-97.us-east-2.compute.internal Ready worker 27m v1.25.2+4bd0702
ip-10-0-78-41.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702
2. Capture the current kubeAPIServerServingCAData value in controllerconfig
$ oc get controllerconfig machine-config-controller -ojsonpath='{.spec.kubeAPIServerServingCAData}' > /tmp/kubeAPIServerServingCAData.orig
3. Create a machineconfig to operator's daemon sync to fail
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-file
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
filesystem: root
mode: 0644
path: /etc/test
4. Wait 10 minutes for s for operator's daemon sync to fail --
I1028 10:57:44.627844 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"aaec3cce-9acf-44d0-9644-794ed156bd41", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigDaemonFailed' Failed to resync 4.12.0-0.nightly-2022-10-28-001703 because: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
I1028 10:57:44.641080 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"aaec3cce-9acf-44d0-9644-794ed156bd41", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MachineConfigDaemonFailed' Cluster not available for [{operator 4.12.0-0.nightly-2022-10-28-001703}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
(END)
5. perform a cert rotation:
oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}' kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator
6. -- wait 20 minutes for operator's daemon sync to fail --
7. observe that controllerconfig still gets updated and new machineconfig still gets generated
A new MC was created for worker and master pools
$ oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
00-worker 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
01-master-container-runtime 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
01-master-kubelet 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
01-worker-container-runtime 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
01-worker-kubelet 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
99-master-generated-registries 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
99-master-ssh 3.2.0 87m
99-worker-generated-registries 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
99-worker-ssh 3.2.0 87m
rendered-master-0b9ea8216169f4164eeb3a5c613a423c 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 6m19s <------ NEW CONFIG
rendered-master-48f904388022db45d30e20f75c8b2809 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
rendered-worker-9415217d244c0e0bca08566ca1099525 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 6m19s <------ NEW CONFIG
rendered-worker-d3a7ddebead6e47155549f9f74836aaa 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m
rendered-worker-eb2f768f26133ef8a1257a41b2609839 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 47m
The kubeAPIServerServingCAData value in controllerconfig was updated.
$ oc get controllerconfig machine-config-controller -ojsonpath='{.spec.kubeAPIServerServingCAData}' > /tmp/kubeAPIServerServingCAData.new
$ diff /tmp/kubeAPIServerServingCAData.orig /tmp/kubeAPIServerServingCAData.new
1c1
< LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURNRE...............
\ No newline at end of file
---
> LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURNREN...............
\ No newline at end of file
I would like to stress that it took 20 minutes for the new MC to be created and the kubeAPIServerServingCAData to be updated.
We move the status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |