Bug 2103786
Summary: | MCP upgrades can stall waiting for master node reboots since MCC no longer gets drained | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yu Qi Zhang <jerzhang> |
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Sergio <sregidor> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | mkrejci, rioliu |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:51:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2104687 |
Description
Yu Qi Zhang
2022-07-04 21:59:02 UTC
verified on 4.12.0-0.nightly-2022-07-07-092951 1. create mc to trigger update on master nodes $ oc create -f change-masters-chrony-configuration.yaml machineconfig.machineconfiguration.openshift.io/change-masters-chrony-configuration created $ cat change-masters-chrony-configuration.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: >> machineconfiguration.openshift.io/role: master name: change-masters-chrony-configuration spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.2.0 networkd: {} passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,cG9vbCAwLnJoZWwucG9vbC5udHAub3JnIGlidXJzdApkcmlmdGZpbGUgL3Zhci9saWIvY2hyb255L2RyaWZ0Cm1ha2VzdGVwIDEuMCAzCnJ0Y3N5bmMKbG9nZGlyIC92YXIvbG9nL2Nocm9ueQo= mode: 420 overwrite: true path: /etc/chrony.conf osImageURL: "" 2. check node name of mcc pod $ oc get pod -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-7fbd48c6fc-lwttm 2/2 Running 0 21m ... $ oc get pod/machine-config-controller-7fbd48c6fc-lwttm -n openshift-machine-config-operator -o yaml|yq -y '.spec.nodeName' >> ip-10-0-193-82.ec2.internal 3. make sure node drain is happened on this node $ oc get events -n default --field-selector involvedObject.name=ip-10-0-193-82.ec2.internal,type!=Warning LAST SEEN TYPE REASON OBJECT MESSAGE ... 23m Normal Cordon node/ip-10-0-193-82.ec2.internal Cordoned node to apply update >> 23m Normal Drain node/ip-10-0-193-82.ec2.internal Draining node to update config. 23m Normal NodeNotSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable 22m Normal OSUpdateStarted node/ip-10-0-193-82.ec2.internal 22m Normal OSUpgradeSkipped node/ip-10-0-193-82.ec2.internal OS upgrade skipped; new MachineConfig (rendered-master-5beee16903bea1f4444aaa5362b5cca8) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbeb2b82e57a51be17177daaffb33b7d557ea7595db2c52c83e8806cc0104100) as old MachineConfig (rendered-master-1341b162472e3d65bde2fa96a3a7e1a8) 22m Normal OSUpdateStaged node/ip-10-0-193-82.ec2.internal Changes to OS staged 22m Normal PendingConfig node/ip-10-0-193-82.ec2.internal Written pending config rendered-master-5beee16903bea1f4444aaa5362b5cca8 >> 22m Normal Reboot node/ip-10-0-193-82.ec2.internal Node will reboot into config rendered-master-5beee16903bea1f4444aaa5362b5cca8 21m Normal NodeNotReady node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotReady 19m Normal Starting node/ip-10-0-193-82.ec2.internal Starting kubelet. 19m Normal NodeAllocatableEnforced node/ip-10-0-193-82.ec2.internal Updated Node Allocatable limit across pods 19m Normal NodeHasSufficientMemory node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientMemory 19m Normal NodeHasNoDiskPressure node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasNoDiskPressure 19m Normal NodeHasSufficientPID node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientPID 19m Normal NodeReady node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeReady 19m Normal NodeNotSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable 19m Normal Starting node/ip-10-0-193-82.ec2.internal openshift-sdn done initializing node networking. 19m Normal NodeDone node/ip-10-0-193-82.ec2.internal Setting node ip-10-0-193-82.ec2.internal, currentConfig rendered-master-5beee16903bea1f4444aaa5362b5cca8 to Done 19m Normal NodeSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeSchedulable 19m Normal Uncordon node/ip-10-0-193-82.ec2.internal Update completed for config rendered-master-5beee16903bea1f4444aaa5362b5cca8 and node has been uncordoned 19m Normal ConfigDriftMonitorStarted node/ip-10-0-193-82.ec2.internal Config Drift Monitor started, watching against rendered-master-5beee16903bea1f4444aaa5362b5c 4. check pod running on above node when update is completed on master pool, no mcc pod found $ oc get mcp/master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-5beee16903bea1f4444aaa5362b5cca8 True False False 3 3 3 0 40m $ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-193-82.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-7nv52 2/2 Running 2 41m machine-config-operator-7b567bfc64-lbc6x 1/1 Running 0 12m machine-config-server-bl2sb 1/1 Running 1 39m 5. mcc pod is scheduled on another node $ oc get pod -n openshift-machine-config-operator |grep machine-config-controller machine-config-controller-7fbd48c6fc-bmwf2 2/2 Running 0 9m49s $ oc get pod machine-config-controller-7fbd48c6fc-bmwf2 -n openshift-machine-config-operator -o yaml | yq -y '.spec.nodeName' >> ip-10-0-144-46.ec2.internal Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |