2103786 – MCP upgrades can stall waiting for master node reboots since MCC no longer gets drained

Bug 2103786 - MCP upgrades can stall waiting for master node reboots since MCC no longer gets drained

Summary: MCP upgrades can stall waiting for master node reboots since MCC no longer ge...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Yu Qi Zhang
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2104687
TreeView+	depends on / blocked

Reported:	2022-07-04 21:59 UTC by Yu Qi Zhang
Modified:	2023-01-17 19:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:51:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3212	0	None	open	Bug 2103786: drain controller: don't skip the MCC pod drain	2022-07-04 22:04:31 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:51:44 UTC

Description Yu Qi Zhang 2022-07-04 21:59:02 UTC

In 4.11, when we switched over the MCO to using controller to drain, we skip draining the MCC since we want it to finish the node it is working on.

In practice this means the master node the MCC runs on retains the MCC pod. The pod won't get restarted until the master node has finished os updates and reboots. In some setups, this could mean a long time for the MCO to stall while waiting for the master node to come up.

This may get resolved with graceful shutdown (?), or maybe I am misunderstanding how pod scheduling works alongside shutdown, but I think in the short term the MCC pod should get drained. With John's recent fix to leader elections, this will make the new controller pick up where the old one left off.

Comment 3 Rio Liu 2022-07-12 04:30:54 UTC

verified on 4.12.0-0.nightly-2022-07-07-092951

1. create mc to trigger update on master nodes

$ oc create -f change-masters-chrony-configuration.yaml
machineconfig.machineconfiguration.openshift.io/change-masters-chrony-configuration created

$ cat change-masters-chrony-configuration.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
>>    machineconfiguration.openshift.io/role: master
  name: change-masters-chrony-configuration
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.2.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,cG9vbCAwLnJoZWwucG9vbC5udHAub3JnIGlidXJzdApkcmlmdGZpbGUgL3Zhci9saWIvY2hyb255L2RyaWZ0Cm1ha2VzdGVwIDEuMCAzCnJ0Y3N5bmMKbG9nZGlyIC92YXIvbG9nL2Nocm9ueQo=
        mode: 420
        overwrite: true
        path: /etc/chrony.conf
  osImageURL: ""

2. check node name of mcc pod

$ oc get pod -n openshift-machine-config-operator
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-7fbd48c6fc-lwttm   2/2     Running   0          21m
...

$ oc get pod/machine-config-controller-7fbd48c6fc-lwttm -n openshift-machine-config-operator -o yaml|yq -y '.spec.nodeName'
>> ip-10-0-193-82.ec2.internal

3. make sure node drain is happened on this node

$ oc get events -n default --field-selector involvedObject.name=ip-10-0-193-82.ec2.internal,type!=Warning
LAST SEEN   TYPE     REASON                      OBJECT                             MESSAGE
...
23m         Normal   Cordon                      node/ip-10-0-193-82.ec2.internal   Cordoned node to apply update
>> 23m         Normal   Drain                       node/ip-10-0-193-82.ec2.internal   Draining node to update config.
23m         Normal   NodeNotSchedulable          node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable
22m         Normal   OSUpdateStarted             node/ip-10-0-193-82.ec2.internal
22m         Normal   OSUpgradeSkipped            node/ip-10-0-193-82.ec2.internal   OS upgrade skipped; new MachineConfig (rendered-master-5beee16903bea1f4444aaa5362b5cca8) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbeb2b82e57a51be17177daaffb33b7d557ea7595db2c52c83e8806cc0104100) as old MachineConfig (rendered-master-1341b162472e3d65bde2fa96a3a7e1a8)
22m         Normal   OSUpdateStaged              node/ip-10-0-193-82.ec2.internal   Changes to OS staged
22m         Normal   PendingConfig               node/ip-10-0-193-82.ec2.internal   Written pending config rendered-master-5beee16903bea1f4444aaa5362b5cca8
>> 22m         Normal   Reboot                      node/ip-10-0-193-82.ec2.internal   Node will reboot into config rendered-master-5beee16903bea1f4444aaa5362b5cca8
21m         Normal   NodeNotReady                node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeNotReady
19m         Normal   Starting                    node/ip-10-0-193-82.ec2.internal   Starting kubelet.
19m         Normal   NodeAllocatableEnforced     node/ip-10-0-193-82.ec2.internal   Updated Node Allocatable limit across pods
19m         Normal   NodeHasSufficientMemory     node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientMemory
19m         Normal   NodeHasNoDiskPressure       node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeHasNoDiskPressure
19m         Normal   NodeHasSufficientPID        node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientPID
19m         Normal   NodeReady                   node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeReady
19m         Normal   NodeNotSchedulable          node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable
19m         Normal   Starting                    node/ip-10-0-193-82.ec2.internal   openshift-sdn done initializing node networking.
19m         Normal   NodeDone                    node/ip-10-0-193-82.ec2.internal   Setting node ip-10-0-193-82.ec2.internal, currentConfig rendered-master-5beee16903bea1f4444aaa5362b5cca8 to Done
19m         Normal   NodeSchedulable             node/ip-10-0-193-82.ec2.internal   Node ip-10-0-193-82.ec2.internal status is now: NodeSchedulable
19m         Normal   Uncordon                    node/ip-10-0-193-82.ec2.internal   Update completed for config rendered-master-5beee16903bea1f4444aaa5362b5cca8 and node has been uncordoned
19m         Normal   ConfigDriftMonitorStarted   node/ip-10-0-193-82.ec2.internal   Config Drift Monitor started, watching against rendered-master-5beee16903bea1f4444aaa5362b5c

4. check pod running on above node when update is completed on master pool, no mcc pod found

$ oc get mcp/master
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-5beee16903bea1f4444aaa5362b5cca8   True      False      False      3              3                   3                     0                      40m

$ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-193-82.ec2.internal
NAME                                       READY   STATUS    RESTARTS   AGE
machine-config-daemon-7nv52                2/2     Running   2          41m
machine-config-operator-7b567bfc64-lbc6x   1/1     Running   0          12m
machine-config-server-bl2sb                1/1     Running   1          39m

5. mcc pod is scheduled on another node

$ oc get pod -n openshift-machine-config-operator |grep machine-config-controller
machine-config-controller-7fbd48c6fc-bmwf2   2/2     Running   0          9m49s

$ oc get pod machine-config-controller-7fbd48c6fc-bmwf2 -n openshift-machine-config-operator -o yaml | yq -y '.spec.nodeName'
>> ip-10-0-144-46.ec2.internal

Comment 6 errata-xmlrpc 2023-01-17 19:51:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.