Description of problem: In 4.11 when a new machine configuration is provided the nodes should be updated in this order: 1. Upgrade nodes in topology.kubernetes.io/zone order and then by node age (oldest first). 2. If zones are not present (for ex: baremetal deployments) upgrade nodes by age oldest first. The problem is that now when there are several nodes in the same zone, those nodes are updated in a random order (usually newest first, but not always). They should be updated oldest first. Version-Release number of MCO (Machine Config Operator) (if applicable): $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.11.0-0.nightly-2022-04-18-091618 True False False 80m Platform (AWS, VSphere, Metal, etc.): All platforms where topology.kubernetes.io/zone is defined. Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Yes How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Scale one machineset to 3 (for example), so that we make sure that there are 3 nodes in the same zone. $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE test22-b6886-worker-a 1 1 1 1 89m test22-b6886-worker-b 1 1 1 1 89m test22-b6886-worker-c 1 1 1 1 89m test22-b6886-worker-f 0 0 89m $ oc scale machineset -n openshift-machine-api test22-b6886-worker-a --replicas=3 machineset.machine.openshift.io/test22-b6886-worker-a scaled 2. Create a machineconfig to force an update in the nodes' configuration $ cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: mco-test-file spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,MCO%20test%20file%20order%0A path: /etc/mco-test-file-order EOF 3. Check the order used to update the nodes $ watch "oc get node -l node-role.kubernetes.io/worker --sort-by '.metadata.labels.topology\.kubernetes\.io/zone' -ocustom-columns='ZONE:.metadata.labels.topology\.kubernetes\.io/zone,TIMES:.metadata.creationTimestamp,NAME:.metadata.name,MCO-STATE:.metadata.annotations.machineconfiguration\.openshift\.io/state'" ZONE TIMES NAME MCO-STATE us-central1-a 2022-04-19T08:56:44Z test22-b6886-worker-a-5rndd.c.openshift-qe.internal Done us-central1-a 2022-04-19T08:56:52Z test22-b6886-worker-a-lzglj.c.openshift-qe.internal Working us-central1-a 2022-04-19T07:35:56Z test22-b6886-worker-a-sgvs8.c.openshift-qe.internal Done us-central1-b 2022-04-19T07:36:00Z test22-b6886-worker-b-g7zfv.c.openshift-qe.internal Done us-central1-c 2022-04-19T07:35:53Z test22-b6886-worker-c-jtd7b.c.openshift-qe.internal Done Actual results: The order in which the nodes in the same zone are updated is not always "oldest first". Expected results: Nodes in the same zone should be updated by node age (oldest first). Additional info:
Verified using build: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-04-060900 True False 141m Cluster version is 4.11.0-0.nightly-2022-05-04-060900 Nodes in the same zone are updated "oldest first". We move the issue to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069