Bug 2076521

Summary: Nodes in the same zone are not updated in the right order
Product: OpenShift Container Platform Reporter: Sergio <sregidor>
Component: Machine Config OperatorAssignee: Kirsten Garrison <kgarriso>
Machine Config Operator sub component: Machine Config Operator QA Contact: Sergio <sregidor>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: high CC: aos-bugs, kgarriso, mkrejci, rioliu
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:07:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2022-04-19 09:18:55 UTC
Description of problem:
In 4.11 when a new machine configuration is provided the nodes should be updated in this order:

1. Upgrade nodes in topology.kubernetes.io/zone order and then by node age (oldest first).
2. If zones are not present (for ex: baremetal deployments) upgrade nodes by age oldest first.


The problem is that now when there are several nodes in the same zone, those nodes are updated in a random order (usually newest first, but not always). They should be updated oldest first.


Version-Release number of MCO (Machine Config Operator) (if applicable):

$ oc get co machine-config 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.11.0-0.nightly-2022-04-18-091618   True        False         False      80m     


Platform (AWS, VSphere, Metal, etc.):
All platforms where topology.kubernetes.io/zone is defined.

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:
Always


Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Scale one machineset to 3 (for example), so that we make sure that there are 3 nodes in the same zone.
$ oc get machineset -n openshift-machine-api 
NAME                             DESIRED   CURRENT   READY   AVAILABLE   AGE
test22-b6886-worker-a   1         1         1       1           89m
test22-b6886-worker-b   1         1         1       1           89m
test22-b6886-worker-c   1         1         1       1           89m
test22-b6886-worker-f   0         0                             89m

$ oc scale machineset -n openshift-machine-api test22-b6886-worker-a --replicas=3
machineset.machine.openshift.io/test22-b6886-worker-a scaled

2. Create a machineconfig to force an update in the nodes' configuration

$ cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mco-test-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,MCO%20test%20file%20order%0A
        path: /etc/mco-test-file-order
EOF

3. Check the order used to update the nodes

$ watch "oc get node -l node-role.kubernetes.io/worker --sort-by '.metadata.labels.topology\.kubernetes\.io/zone' -ocustom-columns='ZONE:.metadata.labels.topology\.kubernetes\.io/zone,TIMES:.metadata.creationTimestamp,NAME:.metadata.name,MCO-STATE:.metadata.annotations.machineconfiguration\.openshift\.io/state'"

ZONE            TIMES                  NAME                                                  MCO-STATE
us-central1-a   2022-04-19T08:56:44Z   test22-b6886-worker-a-5rndd.c.openshift-qe.internal   Done
us-central1-a   2022-04-19T08:56:52Z   test22-b6886-worker-a-lzglj.c.openshift-qe.internal   Working
us-central1-a   2022-04-19T07:35:56Z   test22-b6886-worker-a-sgvs8.c.openshift-qe.internal   Done
us-central1-b   2022-04-19T07:36:00Z   test22-b6886-worker-b-g7zfv.c.openshift-qe.internal   Done
us-central1-c   2022-04-19T07:35:53Z   test22-b6886-worker-c-jtd7b.c.openshift-qe.internal   Done


Actual results:
The order in which the nodes in the same zone are updated is not always "oldest first".


Expected results:
Nodes in the same zone should be updated by node age (oldest first).


Additional info:

Comment 2 Sergio 2022-05-04 12:23:57 UTC
Verified using build:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-04-060900   True        False         141m    Cluster version is 4.11.0-0.nightly-2022-05-04-060900


Nodes in the same zone are updated "oldest first".


We move the issue to VERIFIED status.

Comment 4 errata-xmlrpc 2022-08-10 11:07:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069