Bug 2109963
Summary: | Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon <skordas> |
Component: | Machine Config Operator | Assignee: | John Kyros <jkyros> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Jian Zhang <jiazha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | jerzhang, jiazha, jkyros, mkrejci |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:53:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2110737 |
Description
Simon
2022-07-22 14:46:02 UTC
Simon, Is the "cluster-density" test some type of "torture test"? :) There is something weird going on in this cluster in terms of apiserver connectivity/apiserver processing, the MCO is having trouble getting timely/reliable responses. That said, thank you for bringing this to our attention, it does look like the MCO did let a node "sneak through" without making sure it was uncordoned, and I think there is a small MCO bug here in the drain controller where we are checking to make sure the uncordon actually worked -- we should probably be checking the updated node object not the original one: https://github.com/openshift/machine-config-operator/blob/8a8c20eef73817fc400198d16726cde104d5ae95/pkg/controller/drain/drain_controller.go#L415 I think had we checked the updated object and noticed it was still in the wrong state, it would have waited/retried instead of moving on. Supporting details for my suspicions from the must-gather: It looks like the cluster is wedged because one of the masters is "SchedulingDisabled" even though the MCO thinks it's done/uncordoned. ip-10-0-129-11.us-east-2.compute.internal Ready master 3h39m v1.24.0+9546431 ip-10-0-208-204.us-east-2.compute.internal Ready,SchedulingDisabled master 3h39m v1.24.0+9546431 <-- ip-10-0-161-124.us-east-2.compute.internal Ready master 3h39m v1.24.0+9546431 From the machine-config-controller log in the must-gather: It's trying to finish and uncordon master ip-10-0-208-204, but it looks like somehow uncordon is reversed? 2022-07-22T10:02:04.409756543Z I0722 10:02:04.409721 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true) 2022-07-22T10:02:04.416449688Z I0722 10:02:04.416425 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false) <-- this is bad, we should have retried here 2022-07-22T10:02:04.416449688Z I0722 10:02:04.416437 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation It thinks it did, and then we start getting errors uncordoning a different node: 2022-07-22T10:02:06.832204953Z I0722 10:02:06.832167 1 drain_controller.go:219] Error syncing node ip-10-0-129-142.us-east-2.compute.internal: node ip-10-0-129-142.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-129-142.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded 2022-07-22T10:02:06.832228679Z I0722 10:02:06.832207 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordoning 2022-07-22T10:02:06.832228679Z I0722 10:02:06.832215 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating cordon (currently schedulable: false) 2022-07-22T10:02:06.843697668Z I0722 10:02:06.843648 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordon succeeded (currently schedulable: false) 2022-07-22T10:02:06.843697668Z I0722 10:02:06.843683 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating drain Then we get some more response timeouts: 2022-07-22T10:02:15.569135401Z I0722 10:02:15.569103 1 kubelet_config_nodes.go:123] empty Node resource found 2022-07-22T10:02:15.672781541Z I0722 10:02:15.672742 1 container_runtime_config_controller.go:364] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 99-worker-generated-registries) 2022-07-22T10:02:15.714225623Z I0722 10:02:15.714199 1 template_controller.go:421] Error syncing controllerconfig machine-config-controller: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 00-master) 2022-07-22T10:02:17.626411127Z E0722 10:02:17.626377 1 drain_controller.go:112] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-s5fd4, openshift-cluster-node-tuning-operator/tuned-9g6gb, openshift-dns/dns-default-mffv9, openshift-dns/node-resolver-qlhns, openshift-image-registry/node-ca-kzx87, openshift-ingress-canary/ingress-canary-l759n, openshift-machine-config-operator/machine-config-daemon-w5bqb, openshift-monitoring/node-exporter-ptmpz, openshift-multus/multus-6z8wk, openshift-multus/multus-additional-cni-plugins-npbf5, openshift-multus/network-metrics-daemon-h4rb8, openshift-network-diagnostics/network-check-target-b952n, openshift-ovn-kubernetes/ovnkube-node-xxwvr 2022-07-22T10:02:17.626411127Z I0722 10:02:17.626398 1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: operation successful; applying completion annotation 2022-07-22T10:02:38.426570497Z I0722 10:02:38.426538 1 drain_controller.go:219] Error syncing node ip-10-0-208-204.us-east-2.compute.internal: node ip-10-0-208-204.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-208-204.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded Then we try to uncordon, it looks like it didn't stuck, and then we think we succeeded and continue: 2022-07-22T10:02:38.431741550Z I0722 10:02:38.431715 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordoning 2022-07-22T10:02:38.431741550Z I0722 10:02:38.431731 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true) 2022-07-22T10:02:38.436085233Z I0722 10:02:38.436063 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false) <-- this is also bad 2022-07-22T10:02:38.436085233Z I0722 10:02:38.436078 1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation Then ip-10-0-208-204 is still cordoned even though we supposedly uncordoned it and completed it: 2022-07-22T10:02:38.470063779Z I0722 10:02:38.470042 1 node_controller.go:446] Pool master[zone=us-east-2c]: node ip-10-0-208-204.us-east-2.compute.internal: Reporting unready: node ip-10-0-208-204.us-east-2.compute.internal is reporting Unschedulable Then we get more sync errors: 2022-07-22T10:02:43.729761612Z E0722 10:02:43.729731 1 render_controller.go:460] Error updating MachineConfigPool master: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout 2022-07-22T10:02:43.729761612Z I0722 10:02:43.729747 1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - context deadline exceeded Hi, Test it by the below steps: 1. IPI - OVN - AWS installation: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/126189/ 2. Scale up to 50 working nodes: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/1150/ 3. Run cluster-density test: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/941/ 4. Upgrade to 4.12.0-0.nightly-2022-07-31-185642 version(the fixed PR merged in): mac:~ jianzhang$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706 --force --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706 mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.24 True True 15s Working towards 4.12.0-0.nightly-2022-07-31-185642: 6 of 802 done (0% complete) mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.24 True True 7m27s Working towards 4.12.0-0.nightly-2022-07-31-185642: 106 of 802 done (13% complete) mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-07-31-185642 True False 3m21s Cluster version is 4.12.0-0.nightly-2022-07-31-185642 mac:~ jianzhang$ oc get mcp master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-080e4b403818d032ddd8fd76af316b1b True False False 3 3 3 0 6h46m mac:~ jianzhang$ oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-08-01T03:27:09Z" generation: 1 name: machine-config ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 8562db36-ad99-4ec3-be61-0940754c27d2 resourceVersion: "446725" uid: 67e177fd-eecb-4a7c-848c-dd67e08b7a36 spec: {} status: conditions: - lastTransitionTime: "2022-08-01T09:43:23Z" message: Cluster version is 4.12.0-0.nightly-2022-07-31-185642 status: "False" type: Progressing - lastTransitionTime: "2022-08-01T09:08:08Z" message: One or more machine config pools are updating, please see `oc get mcp` for further details reason: PoolUpdating status: "False" type: Upgradeable - lastTransitionTime: "2022-08-01T09:43:22Z" status: "False" type: Degraded - lastTransitionTime: "2022-08-01T03:34:10Z" message: Cluster has deployed [{operator 4.12.0-0.nightly-2022-07-31-185642}] status: "True" type: Available extension: master: all 3 nodes are at latest configuration rendered-master-080e4b403818d032ddd8fd76af316b1b worker: 19 (ready 19) out of 50 nodes are updating to latest configuration rendered-worker-189d3a81a43ea71c5d0f0218f2234821 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: "" resource: machineconfigpools - group: machineconfiguration.openshift.io name: "" resource: controllerconfigs - group: machineconfiguration.openshift.io name: "" resource: kubeletconfigs - group: machineconfiguration.openshift.io name: "" resource: containerruntimeconfigs - group: machineconfiguration.openshift.io name: "" resource: machineconfigs - group: "" name: "" resource: nodes - group: "" name: openshift-kni-infra resource: namespaces - group: "" name: openshift-openstack-infra resource: namespaces - group: "" name: openshift-ovirt-infra resource: namespaces - group: "" name: openshift-vsphere-infra resource: namespaces - group: "" name: openshift-nutanix-infra resource: namespaces versions: - name: operator version: 4.12.0-0.nightly-2022-07-31-185642 LGTM, verify it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |