Bug 2104687
| Summary: | MCP upgrades can stall waiting for master node reboots since MCC no longer gets drained | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
| Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Sergio <sregidor> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | mkrejci, skumari, sregidor |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-23 15:08:36 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2103786 | ||
| Bug Blocks: | |||
|
Description
OpenShift BugZilla Robot
2022-07-06 21:35:48 UTC
We decided on not mark this bug as 4.11 blocker because it doesn't blocks upgrade or new 4.11 cluster install. With this bug, upgrade may be slower for machines taking longer reboot time. This should land soon in 4.11.z. Verified using IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-08-16-032235 True False 30m Cluster version is 4.11.0-0.nightly-2022-08-16-032235
Verification steps:
1. check node name of mcc pod
$ oc get pod -n openshift-machine-config-operator
NAME READY STATUS RESTARTS AGE
machine-config-controller-bb5957b5d-h2g87 2/2 Running 0 21m
$ oc get pod/machine-config-controller-bb5957b5d-h2g87 -o yaml|yq -y '.spec.nodeName'
ip-10-0-223-27.us-east-2.compute.internal
2. create mc to trigger update on master nodes
$ cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: change-masters-chrony-configuration
spec:
config:
ignition:
config: {}
security:
tls: {}
timeouts: {}
version: 3.2.0
networkd: {}
passwd: {}
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,cG9vbCAwLnJoZWwucG9vbC5udHAub3JnIGlidXJzdApkcmlmdGZpbGUgL3Zhci9saWIvY2hyb255L2RyaWZ0Cm1ha2VzdGVwIDEuMCAzCnJ0Y3N5bmMKbG9nZGlyIC92YXIvbG9nL2Nocm9ueQo=
mode: 420
overwrite: true
path: /etc/test.conf
osImageURL: ""
EOF
3. make sure node drain is happened on machine-config-controller's node (ip-10-0-223-27.us-east-2.compute.internal)
$ oc get events -n default --sort-by metadata.creationTimestamp --field-selector involvedObject.name=ip-10-0-223-27.us-east-2.compute.internal
....
111s Normal Cordon node/ip-10-0-223-27.us-east-2.compute.internal Cordoned node to apply update
>> 111s Normal Drain node/ip-10-0-223-27.us-east-2.compute.internal Draining node to update config.
103s Normal NodeNotSchedulable node/ip-10-0-223-27.us-east-2.compute.internal Node ip-10-0-223-27.us-east-2.compute.internal status is now: NodeNotSchedulable
9s Normal OSUpdateStarted node/ip-10-0-223-27.us-east-2.compute.internal
9s Normal OSUpgradeSkipped node/ip-10-0-223-27.us-east-2.compute.internal OS upgrade skipped; new MachineConfig (rendered-master-beddfd51fed4832914e143b7a5c7c8db) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e7c8e9e407ebab51eac2482d13c07d071c0be1a5755a36a64f0be1b73b3999a) as old MachineConfig (rendered-master-ac378d4e7670937fac6263c69d7a85c6)
9s Normal OSUpdateStaged node/ip-10-0-223-27.us-east-2.compute.internal Changes to OS staged
9s Normal PendingConfig node/ip-10-0-223-27.us-east-2.compute.internal Written pending config rendered-master-beddfd51fed4832914e143b7a5c7c8db
>> 9s Normal Reboot node/ip-10-0-223-27.us-east-2.compute.internal Node will reboot into config rendered-master-beddfd51fed4832914e143b7a5c7c8db
4. check pod running on above node when update is completed on master pool, no mcc pod found
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-beddfd51fed4832914e143b7a5c7c8db True False False 3 3 3 0 73m
worker rendered-worker-dd1de4a36326291d324ed78cb2ddf723 True False False 3 3 3 0 73m
Machine-controller not running in ip-10-0-223-27.us-east-2.compute.internal anymore
$ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-223-27.us-east-2.compute.internal
NAME READY STATUS RESTARTS AGE
machine-config-daemon-5lqz4 2/2 Running 0 71m
machine-config-server-md8v6 1/1 Running 0 70m
5. mcc pod is scheduled on another node
$ oc get pod -n openshift-machine-config-operator |grep machine-config-controller
machine-config-controller-bb5957b5d-trs8h 2/2 Running 0 4m39s
$ oc get pod machine-config-controller-bb5957b5d-trs8h -n openshift-machine-config-operator -o yaml | yq -y '.spec.nodeName'
ip-10-0-130-51.us-east-2.compute.internal
We move the status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.11.1 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6103 |