Hide Forgot
Description of problem: Situation (starting point): - There is an ongoing change to the machine-config-daemon daemonset being applied by the machine-config-operator pod. It is waiting for the daemonset to roll out. - There are some nodes not ready, so daemonset rollout never ends and waiting on that ends in timeout error. Problem: - Machine-config-operator pods stops trying to reconcile stuff whenever it finds timeout error in waiting for the machine-config-daemon rollout - This implies that the `spec.kubeAPIServerServingCAData` field of controllerconfig/machine-config-controller object is not updated when the kube-apiserver-operator updates kube-apiserver-to-kubelet-client-ca configmap. - Without that field updated, a kube-apiserver-to-kubelet-client-ca change is never rolled out to the nodes. - That ultimately leads to cluster-wide unavailability of "oc logs", "oc rsh" etc. commands when the kube-apiserver-operator starts using a client cert signed by the new kube-apiserver-to-kubelet-client-ca to access kubelet ports. Version-Release number of MCO (Machine Config Operator) (if applicable): 4.7.21 Platform (AWS, VSphere, Metal, etc.): (not relevant) Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: Always if the said conditions are met. Steps to Reproduce: 1. Have some nodes not ready 2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this) 3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator Actual results: New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes Expected results: kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes. Additional info: In comments
Just fixing or deleting the NotReady nodes allows the machine-config-operator reconciliation to return to normal, so the new CA is then properly deployed. However, we cannot have the MCO stop the world and cause cluster-wide impact just because only some concrete nodes are NotReady and experiencing problems.
Marking not a blocker: undesirable behavior, but does not appear to be a regression. Most if not all of that code has been that way for years.
Verified using IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-10-28-001703 True False 15m Cluster version is 4.12.0-0.nightly-2022-10-28-001703 1. Shutdown a worker node to make it NotReady $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-50-178.us-east-2.compute.internal NotReady worker 22m v1.25.2+4bd0702 ip-10-0-58-207.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702 ip-10-0-64-37.us-east-2.compute.internal Ready worker 21m v1.25.2+4bd0702 ip-10-0-67-209.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702 ip-10-0-69-97.us-east-2.compute.internal Ready worker 27m v1.25.2+4bd0702 ip-10-0-78-41.us-east-2.compute.internal Ready control-plane,master 32m v1.25.2+4bd0702 2. Capture the current kubeAPIServerServingCAData value in controllerconfig $ oc get controllerconfig machine-config-controller -ojsonpath='{.spec.kubeAPIServerServingCAData}' > /tmp/kubeAPIServerServingCAData.orig 3. Create a machineconfig to operator's daemon sync to fail apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-file spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/test 4. Wait 10 minutes for s for operator's daemon sync to fail -- I1028 10:57:44.627844 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"aaec3cce-9acf-44d0-9644-794ed156bd41", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigDaemonFailed' Failed to resync 4.12.0-0.nightly-2022-10-28-001703 because: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] I1028 10:57:44.641080 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"aaec3cce-9acf-44d0-9644-794ed156bd41", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MachineConfigDaemonFailed' Cluster not available for [{operator 4.12.0-0.nightly-2022-10-28-001703}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] (END) 5. perform a cert rotation: oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}' kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator 6. -- wait 20 minutes for operator's daemon sync to fail -- 7. observe that controllerconfig still gets updated and new machineconfig still gets generated A new MC was created for worker and master pools $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 00-worker 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 01-master-container-runtime 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 01-master-kubelet 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 01-worker-container-runtime 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 01-worker-kubelet 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 99-master-generated-registries 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 99-master-ssh 3.2.0 87m 99-worker-generated-registries 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m 99-worker-ssh 3.2.0 87m rendered-master-0b9ea8216169f4164eeb3a5c613a423c 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 6m19s <------ NEW CONFIG rendered-master-48f904388022db45d30e20f75c8b2809 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m rendered-worker-9415217d244c0e0bca08566ca1099525 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 6m19s <------ NEW CONFIG rendered-worker-d3a7ddebead6e47155549f9f74836aaa 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 85m rendered-worker-eb2f768f26133ef8a1257a41b2609839 3bbdfc851bf97737ac251d25987d6d816b64ace9 3.2.0 47m The kubeAPIServerServingCAData value in controllerconfig was updated. $ oc get controllerconfig machine-config-controller -ojsonpath='{.spec.kubeAPIServerServingCAData}' > /tmp/kubeAPIServerServingCAData.new $ diff /tmp/kubeAPIServerServingCAData.orig /tmp/kubeAPIServerServingCAData.new 1c1 < LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURNRE............... \ No newline at end of file --- > LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURNREN............... \ No newline at end of file I would like to stress that it took 20 minutes for the new MC to be created and the kubeAPIServerServingCAData to be updated. We move the status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399