Description of problem: Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4 Error during syncRequiredMachineConfigPoolscontroller - version mismatch Version-Release number of MCO (Machine Config Operator) (if applicable): 4.10.24 Platform (AWS, VSphere, Metal, etc.): AWS Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? Not sure How reproducible: For now I have result for AWS only. Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/loaded-upgrade/579/ Steps to Reproduce: 1. IPI - OVN - AWS installation: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/123249/ 2. Scale up to 50 working nodes: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/1125/ 3. Run cluster-density test: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/890/ 4. Upgrade to 4.11.0-rc.4 version: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/upgrade/262/ Actual results: Master node in SchedulingDisabled state. From oc get co machine-config -o yaml status: conditions: - lastTransitionTime: "2022-07-22T09:20:06Z" message: Working towards 4.11.0-rc.4 status: "True" type: Progressing - lastTransitionTime: "2022-07-22T09:25:56Z" message: One or more machine config pools are updating, please see `oc get mcp` for further details reason: PoolUpdating status: "False" type: Upgradeable - lastTransitionTime: "2022-07-22T10:28:13Z" message: 'Unable to apply 4.11.0-rc.4: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, pool master has not progressed to latest configuration: controller version mismatch for rendered-master-1bcb4b6fd0168e4119f0ee9d9b039f6c expected e722bb791349c5f74548633d5c3407c414ece9bf has be815b38adf693db9f6def6747fe5542bdd67f36: 3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-3160a1e1a282d0664ac7c4a6dcff6730, retrying]' reason: RequiredPoolsFailed status: "True" type: Degraded $ oc get mcp master -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2022-07-22T07:47:34Z" generation: 3 labels: machineconfiguration.openshift.io/mco-built-in: "" operator.machineconfiguration.openshift.io/required-for-upgrade: "" pools.operator.machineconfiguration.openshift.io/master: "" name: master resourceVersion: "248554" uid: cceafe0e-b059-4e2e-b9f6-a9379f357d5f spec: configuration: name: rendered-master-3160a1e1a282d0664ac7c4a6dcff6730 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-crio-seccomp-use-default - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: master nodeSelector: matchLabels: node-role.kubernetes.io/master: "" paused: false status: conditions: - lastTransitionTime: "2022-07-22T07:48:27Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2022-07-22T07:48:36Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2022-07-22T07:48:36Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2022-07-22T09:25:55Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2022-07-22T09:25:55Z" message: All nodes are updating to rendered-master-3160a1e1a282d0664ac7c4a6dcff6730 reason: "" status: "True" type: Updating configuration: name: rendered-master-1bcb4b6fd0168e4119f0ee9d9b039f6c source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-crio-seccomp-use-default - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-ssh degradedMachineCount: 0 machineCount: 3 observedGeneration: 3 readyMachineCount: 2 unavailableMachineCount: 1 updatedMachineCount: 3
Simon, Is the "cluster-density" test some type of "torture test"? :) There is something weird going on in this cluster in terms of apiserver connectivity/apiserver processing, the MCO is having trouble getting timely/reliable responses. That said, thank you for bringing this to our attention, it does look like the MCO did let a node "sneak through" without making sure it was uncordoned, and I think there is a small MCO bug here in the drain controller where we are checking to make sure the uncordon actually worked -- we should probably be checking the updated node object not the original one: https://github.com/openshift/machine-config-operator/blob/8a8c20eef73817fc400198d16726cde104d5ae95/pkg/controller/drain/drain_controller.go#L415 I think had we checked the updated object and noticed it was still in the wrong state, it would have waited/retried instead of moving on. Supporting details for my suspicions from the must-gather: It looks like the cluster is wedged because one of the masters is "SchedulingDisabled" even though the MCO thinks it's done/uncordoned. ip-10-0-129-11.us-east-2.compute.internal Ready master 3h39m v1.24.0+9546431 ip-10-0-208-204.us-east-2.compute.internal Ready,SchedulingDisabled master 3h39m v1.24.0+9546431 <-- ip-10-0-161-124.us-east-2.compute.internal Ready master 3h39m v1.24.0+9546431 From the machine-config-controller log in the must-gather: It's trying to finish and uncordon master ip-10-0-208-204, but it looks like somehow uncordon is reversed? 2022-07-22T10:02:04.409756543Z I0722 10:02:04.409721 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true) 2022-07-22T10:02:04.416449688Z I0722 10:02:04.416425 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false) <-- this is bad, we should have retried here 2022-07-22T10:02:04.416449688Z I0722 10:02:04.416437 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation It thinks it did, and then we start getting errors uncordoning a different node: 2022-07-22T10:02:06.832204953Z I0722 10:02:06.832167 1 drain_controller.go:219] Error syncing node ip-10-0-129-142.us-east-2.compute.internal: node ip-10-0-129-142.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-129-142.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded 2022-07-22T10:02:06.832228679Z I0722 10:02:06.832207 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordoning 2022-07-22T10:02:06.832228679Z I0722 10:02:06.832215 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating cordon (currently schedulable: false) 2022-07-22T10:02:06.843697668Z I0722 10:02:06.843648 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordon succeeded (currently schedulable: false) 2022-07-22T10:02:06.843697668Z I0722 10:02:06.843683 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating drain Then we get some more response timeouts: 2022-07-22T10:02:15.569135401Z I0722 10:02:15.569103 1 kubelet_config_nodes.go:123] empty Node resource found 2022-07-22T10:02:15.672781541Z I0722 10:02:15.672742 1 container_runtime_config_controller.go:364] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 99-worker-generated-registries) 2022-07-22T10:02:15.714225623Z I0722 10:02:15.714199 1 template_controller.go:421] Error syncing controllerconfig machine-config-controller: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 00-master) 2022-07-22T10:02:17.626411127Z E0722 10:02:17.626377 1 drain_controller.go:112] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-s5fd4, openshift-cluster-node-tuning-operator/tuned-9g6gb, openshift-dns/dns-default-mffv9, openshift-dns/node-resolver-qlhns, openshift-image-registry/node-ca-kzx87, openshift-ingress-canary/ingress-canary-l759n, openshift-machine-config-operator/machine-config-daemon-w5bqb, openshift-monitoring/node-exporter-ptmpz, openshift-multus/multus-6z8wk, openshift-multus/multus-additional-cni-plugins-npbf5, openshift-multus/network-metrics-daemon-h4rb8, openshift-network-diagnostics/network-check-target-b952n, openshift-ovn-kubernetes/ovnkube-node-xxwvr 2022-07-22T10:02:17.626411127Z I0722 10:02:17.626398 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: operation successful; applying completion annotation 2022-07-22T10:02:38.426570497Z I0722 10:02:38.426538 1 drain_controller.go:219] Error syncing node ip-10-0-208-204.us-east-2.compute.internal: node ip-10-0-208-204.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-208-204.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded Then we try to uncordon, it looks like it didn't stuck, and then we think we succeeded and continue: 2022-07-22T10:02:38.431741550Z I0722 10:02:38.431715 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordoning 2022-07-22T10:02:38.431741550Z I0722 10:02:38.431731 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true) 2022-07-22T10:02:38.436085233Z I0722 10:02:38.436063 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false) <-- this is also bad 2022-07-22T10:02:38.436085233Z I0722 10:02:38.436078 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation Then ip-10-0-208-204 is still cordoned even though we supposedly uncordoned it and completed it: 2022-07-22T10:02:38.470063779Z I0722 10:02:38.470042 1 node_controller.go:446] Pool master[zone=us-east-2c]: node ip-10-0-208-204.us-east-2.compute.internal: Reporting unready: node ip-10-0-208-204.us-east-2.compute.internal is reporting Unschedulable Then we get more sync errors: 2022-07-22T10:02:43.729761612Z E0722 10:02:43.729731 1 render_controller.go:460] Error updating MachineConfigPool master: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout 2022-07-22T10:02:43.729761612Z I0722 10:02:43.729747 1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - context deadline exceeded
Hi, Test it by the below steps: 1. IPI - OVN - AWS installation: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/126189/ 2. Scale up to 50 working nodes: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/1150/ 3. Run cluster-density test: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/941/ 4. Upgrade to 4.12.0-0.nightly-2022-07-31-185642 version(the fixed PR merged in): mac:~ jianzhang$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706 --force --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706 mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.24 True True 15s Working towards 4.12.0-0.nightly-2022-07-31-185642: 6 of 802 done (0% complete) mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.24 True True 7m27s Working towards 4.12.0-0.nightly-2022-07-31-185642: 106 of 802 done (13% complete) mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-07-31-185642 True False 3m21s Cluster version is 4.12.0-0.nightly-2022-07-31-185642 mac:~ jianzhang$ oc get mcp master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-080e4b403818d032ddd8fd76af316b1b True False False 3 3 3 0 6h46m mac:~ jianzhang$ oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-08-01T03:27:09Z" generation: 1 name: machine-config ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 8562db36-ad99-4ec3-be61-0940754c27d2 resourceVersion: "446725" uid: 67e177fd-eecb-4a7c-848c-dd67e08b7a36 spec: {} status: conditions: - lastTransitionTime: "2022-08-01T09:43:23Z" message: Cluster version is 4.12.0-0.nightly-2022-07-31-185642 status: "False" type: Progressing - lastTransitionTime: "2022-08-01T09:08:08Z" message: One or more machine config pools are updating, please see `oc get mcp` for further details reason: PoolUpdating status: "False" type: Upgradeable - lastTransitionTime: "2022-08-01T09:43:22Z" status: "False" type: Degraded - lastTransitionTime: "2022-08-01T03:34:10Z" message: Cluster has deployed [{operator 4.12.0-0.nightly-2022-07-31-185642}] status: "True" type: Available extension: master: all 3 nodes are at latest configuration rendered-master-080e4b403818d032ddd8fd76af316b1b worker: 19 (ready 19) out of 50 nodes are updating to latest configuration rendered-worker-189d3a81a43ea71c5d0f0218f2234821 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: "" resource: machineconfigpools - group: machineconfiguration.openshift.io name: "" resource: controllerconfigs - group: machineconfiguration.openshift.io name: "" resource: kubeletconfigs - group: machineconfiguration.openshift.io name: "" resource: containerruntimeconfigs - group: machineconfiguration.openshift.io name: "" resource: machineconfigs - group: "" name: "" resource: nodes - group: "" name: openshift-kni-infra resource: namespaces - group: "" name: openshift-openstack-infra resource: namespaces - group: "" name: openshift-ovirt-infra resource: namespaces - group: "" name: openshift-vsphere-infra resource: namespaces - group: "" name: openshift-nutanix-infra resource: namespaces versions: - name: operator version: 4.12.0-0.nightly-2022-07-31-185642 LGTM, verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399