2109963 – Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4

Bug 2109963 - Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4

Summary: Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	John Kyros
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2110737
TreeView+	depends on / blocked

Reported:	2022-07-22 14:46 UTC by Simon
Modified:	2023-01-17 19:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:53:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3264	0	None	open	Bug 2109963: Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4	2022-07-22 21:27:03 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:53:25 UTC

Description Simon 2022-07-22 14:46:02 UTC

Description of problem:
Master node in SchedulingDisabled after upgrade from 4.10.24 -> 4.11.0-rc.4
Error during syncRequiredMachineConfigPoolscontroller - version mismatch

Version-Release number of MCO (Machine Config Operator) (if applicable): 4.10.24

Platform (AWS, VSphere, Metal, etc.): AWS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
Not sure

How reproducible: For now I have result for AWS only.

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/loaded-upgrade/579/

Steps to Reproduce:
1. IPI - OVN - AWS installation: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/123249/
2. Scale up to 50 working nodes: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/1125/
3. Run cluster-density test: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/890/
4. Upgrade to 4.11.0-rc.4 version: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/upgrade/262/

Actual results: Master node in SchedulingDisabled state.
From oc get co machine-config -o yaml
status:
  conditions:
  - lastTransitionTime: "2022-07-22T09:20:06Z"
    message: Working towards 4.11.0-rc.4
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-07-22T09:25:56Z"
    message: One or more machine config pools are updating, please see `oc get mcp`
      for further details
    reason: PoolUpdating
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2022-07-22T10:28:13Z"
    message: 'Unable to apply 4.11.0-rc.4: error during syncRequiredMachineConfigPools:
      [timed out waiting for the condition, pool master has not progressed to latest
      configuration: controller version mismatch for rendered-master-1bcb4b6fd0168e4119f0ee9d9b039f6c
      expected e722bb791349c5f74548633d5c3407c414ece9bf has be815b38adf693db9f6def6747fe5542bdd67f36:
      3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-3160a1e1a282d0664ac7c4a6dcff6730,
      retrying]'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded

$ oc get mcp master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-07-22T07:47:34Z"
  generation: 3
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    operator.machineconfiguration.openshift.io/required-for-upgrade: ""
    pools.operator.machineconfiguration.openshift.io/master: ""
  name: master
  resourceVersion: "248554"
  uid: cceafe0e-b059-4e2e-b9f6-a9379f357d5f
spec:
  configuration:
    name: rendered-master-3160a1e1a282d0664ac7c4a6dcff6730
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-generated-crio-seccomp-use-default
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2022-07-22T07:48:27Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2022-07-22T07:48:36Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2022-07-22T07:48:36Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-07-22T09:25:55Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2022-07-22T09:25:55Z"
    message: All nodes are updating to rendered-master-3160a1e1a282d0664ac7c4a6dcff6730
    reason: ""
    status: "True"
    type: Updating
  configuration:
    name: rendered-master-1bcb4b6fd0168e4119f0ee9d9b039f6c
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-generated-crio-seccomp-use-default
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-ssh
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 3
  readyMachineCount: 2
  unavailableMachineCount: 1
  updatedMachineCount: 3

Comment 2 John Kyros 2022-07-22 20:55:20 UTC

Simon,

Is the "cluster-density" test some type of "torture test"? :) 

There is something weird going on in this cluster in terms of apiserver connectivity/apiserver processing, the MCO is having trouble getting timely/reliable responses.

That said, thank you for bringing this to our attention, it does look like the MCO did let a node "sneak through" without making sure it was uncordoned, and I think there is a small MCO bug here in the drain controller where we are checking to make sure the uncordon actually worked -- we should probably be checking the updated node object not the original one: 

https://github.com/openshift/machine-config-operator/blob/8a8c20eef73817fc400198d16726cde104d5ae95/pkg/controller/drain/drain_controller.go#L415

I think had we checked the updated object and noticed it was still in the wrong state, it would have waited/retried instead of moving on. 



Supporting details for my suspicions from the must-gather: 


It looks like the cluster is wedged because one of the masters is "SchedulingDisabled" even though the MCO thinks it's done/uncordoned. 

ip-10-0-129-11.us-east-2.compute.internal   Ready                     master  3h39m  v1.24.0+9546431
ip-10-0-208-204.us-east-2.compute.internal  Ready,SchedulingDisabled  master  3h39m  v1.24.0+9546431  <-- 
ip-10-0-161-124.us-east-2.compute.internal  Ready                     master  3h39m  v1.24.0+9546431

From the machine-config-controller log in the must-gather: 

It's trying to finish and uncordon master ip-10-0-208-204, but it looks like somehow uncordon is reversed? 

2022-07-22T10:02:04.409756543Z I0722 10:02:04.409721       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true)
2022-07-22T10:02:04.416449688Z I0722 10:02:04.416425       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false)  <-- this is bad, we should have retried here 
2022-07-22T10:02:04.416449688Z I0722 10:02:04.416437       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation

It thinks it did, and then we start getting errors uncordoning a different node:  

2022-07-22T10:02:06.832204953Z I0722 10:02:06.832167       1 drain_controller.go:219] Error syncing node ip-10-0-129-142.us-east-2.compute.internal: node ip-10-0-129-142.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-129-142.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded
2022-07-22T10:02:06.832228679Z I0722 10:02:06.832207       1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordoning
2022-07-22T10:02:06.832228679Z I0722 10:02:06.832215       1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating cordon (currently schedulable: false)
2022-07-22T10:02:06.843697668Z I0722 10:02:06.843648       1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: cordon succeeded (currently schedulable: false)
2022-07-22T10:02:06.843697668Z I0722 10:02:06.843683       1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: initiating drain

Then we get some more response timeouts: 

2022-07-22T10:02:15.569135401Z I0722 10:02:15.569103       1 kubelet_config_nodes.go:123] empty Node resource found
2022-07-22T10:02:15.672781541Z I0722 10:02:15.672742       1 container_runtime_config_controller.go:364] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 99-worker-generated-registries)
2022-07-22T10:02:15.714225623Z I0722 10:02:15.714199       1 template_controller.go:421] Error syncing controllerconfig machine-config-controller: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 00-master)
2022-07-22T10:02:17.626411127Z E0722 10:02:17.626377       1 drain_controller.go:112] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-s5fd4, openshift-cluster-node-tuning-operator/tuned-9g6gb, openshift-dns/dns-default-mffv9, openshift-dns/node-resolver-qlhns, openshift-image-registry/node-ca-kzx87, openshift-ingress-canary/ingress-canary-l759n, openshift-machine-config-operator/machine-config-daemon-w5bqb, openshift-monitoring/node-exporter-ptmpz, openshift-multus/multus-6z8wk, openshift-multus/multus-additional-cni-plugins-npbf5, openshift-multus/network-metrics-daemon-h4rb8, openshift-network-diagnostics/network-check-target-b952n, openshift-ovn-kubernetes/ovnkube-node-xxwvr
2022-07-22T10:02:17.626411127Z I0722 10:02:17.626398       1 drain_controller.go:147] node ip-10-0-129-142.us-east-2.compute.internal: operation successful; applying completion annotation
2022-07-22T10:02:38.426570497Z I0722 10:02:38.426538       1 drain_controller.go:219] Error syncing node ip-10-0-208-204.us-east-2.compute.internal: node ip-10-0-208-204.us-east-2.compute.internal: failed to set node uncordoned annotation: node ip-10-0-208-204.us-east-2.compute.internal: unable to update: Timeout: request did not complete within requested timeout - context deadline exceeded

Then we try to uncordon, it looks like it didn't stuck, and then we think we succeeded and continue: 

2022-07-22T10:02:38.431741550Z I0722 10:02:38.431715       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordoning
2022-07-22T10:02:38.431741550Z I0722 10:02:38.431731       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: initiating uncordon (currently schedulable: true) 
2022-07-22T10:02:38.436085233Z I0722 10:02:38.436063       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: uncordon succeeded (currently schedulable: false) <-- this is also bad 
2022-07-22T10:02:38.436085233Z I0722 10:02:38.436078       1 drain_controller.go:147] node ip-10-0-208-204.us-east-2.compute.internal: operation successful; applying completion annotation

Then ip-10-0-208-204 is still cordoned even though we supposedly uncordoned it and completed it: 

2022-07-22T10:02:38.470063779Z I0722 10:02:38.470042       1 node_controller.go:446] Pool master[zone=us-east-2c]: node ip-10-0-208-204.us-east-2.compute.internal: Reporting unready: node ip-10-0-208-204.us-east-2.compute.internal is reporting Unschedulable

Then we get more sync errors: 

2022-07-22T10:02:43.729761612Z E0722 10:02:43.729731       1 render_controller.go:460] Error updating MachineConfigPool master: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
2022-07-22T10:02:43.729761612Z I0722 10:02:43.729747       1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - context deadline exceeded

Comment 4 Jian Zhang 2022-08-01 10:27:16 UTC

Hi,

Test it by the below steps:

1. IPI - OVN - AWS installation: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/126189/

2. Scale up to 50 working nodes: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/cluster-workers-scaling/1150/ 

3. Run cluster-density test: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/kube-burner/941/

4. Upgrade to 4.12.0-0.nightly-2022-07-31-185642 version(the fixed PR merged in): 
mac:~ jianzhang$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release@sha256:38ab2a0a0ec17d15fb813b1d0ab4240768ce6f8ee7415083519cc410de493706
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.24   True        True          15s     Working towards 4.12.0-0.nightly-2022-07-31-185642: 6 of 802 done (0% complete)
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.24   True        True          7m27s   Working towards 4.12.0-0.nightly-2022-07-31-185642: 106 of 802 done (13% complete)
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-31-185642   True        False         3m21s   Cluster version is 4.12.0-0.nightly-2022-07-31-185642


mac:~ jianzhang$ oc get mcp master 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-080e4b403818d032ddd8fd76af316b1b   True      False      False      3              3                   3                     0                      6h46m

mac:~ jianzhang$ oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-08-01T03:27:09Z"
  generation: 1
  name: machine-config
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 8562db36-ad99-4ec3-be61-0940754c27d2
  resourceVersion: "446725"
  uid: 67e177fd-eecb-4a7c-848c-dd67e08b7a36
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-08-01T09:43:23Z"
    message: Cluster version is 4.12.0-0.nightly-2022-07-31-185642
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-08-01T09:08:08Z"
    message: One or more machine config pools are updating, please see `oc get mcp`
      for further details
    reason: PoolUpdating
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2022-08-01T09:43:22Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-08-01T03:34:10Z"
    message: Cluster has deployed [{operator 4.12.0-0.nightly-2022-07-31-185642}]
    status: "True"
    type: Available
  extension:
    master: all 3 nodes are at latest configuration rendered-master-080e4b403818d032ddd8fd76af316b1b
    worker: 19 (ready 19) out of 50 nodes are updating to latest configuration rendered-worker-189d3a81a43ea71c5d0f0218f2234821
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: ""
    resource: controllerconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: kubeletconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: containerruntimeconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigs
  - group: ""
    name: ""
    resource: nodes
  - group: ""
    name: openshift-kni-infra
    resource: namespaces
  - group: ""
    name: openshift-openstack-infra
    resource: namespaces
  - group: ""
    name: openshift-ovirt-infra
    resource: namespaces
  - group: ""
    name: openshift-vsphere-infra
    resource: namespaces
  - group: ""
    name: openshift-nutanix-infra
    resource: namespaces
  versions:
  - name: operator
    version: 4.12.0-0.nightly-2022-07-31-185642

LGTM, verify it.

Comment 7 errata-xmlrpc 2023-01-17 19:53:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.