Bug 2076521 - Nodes in the same zone are not updated in the right order
Summary: Nodes in the same zone are not updated in the right order
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: 4.11.0
Assignee: Kirsten Garrison
QA Contact: Sergio
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-19 09:18 UTC by Sergio
Modified: 2022-08-10 11:08 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:07:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3125 0 None open Bug 2076521: mcc: update node sort to account for matching zones 2022-04-28 21:11:38 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:08:02 UTC

Description Sergio 2022-04-19 09:18:55 UTC
Description of problem:
In 4.11 when a new machine configuration is provided the nodes should be updated in this order:

1. Upgrade nodes in topology.kubernetes.io/zone order and then by node age (oldest first).
2. If zones are not present (for ex: baremetal deployments) upgrade nodes by age oldest first.


The problem is that now when there are several nodes in the same zone, those nodes are updated in a random order (usually newest first, but not always). They should be updated oldest first.


Version-Release number of MCO (Machine Config Operator) (if applicable):

$ oc get co machine-config 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.11.0-0.nightly-2022-04-18-091618   True        False         False      80m     


Platform (AWS, VSphere, Metal, etc.):
All platforms where topology.kubernetes.io/zone is defined.

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:
Always


Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Scale one machineset to 3 (for example), so that we make sure that there are 3 nodes in the same zone.
$ oc get machineset -n openshift-machine-api 
NAME                             DESIRED   CURRENT   READY   AVAILABLE   AGE
test22-b6886-worker-a   1         1         1       1           89m
test22-b6886-worker-b   1         1         1       1           89m
test22-b6886-worker-c   1         1         1       1           89m
test22-b6886-worker-f   0         0                             89m

$ oc scale machineset -n openshift-machine-api test22-b6886-worker-a --replicas=3
machineset.machine.openshift.io/test22-b6886-worker-a scaled

2. Create a machineconfig to force an update in the nodes' configuration

$ cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mco-test-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,MCO%20test%20file%20order%0A
        path: /etc/mco-test-file-order
EOF

3. Check the order used to update the nodes

$ watch "oc get node -l node-role.kubernetes.io/worker --sort-by '.metadata.labels.topology\.kubernetes\.io/zone' -ocustom-columns='ZONE:.metadata.labels.topology\.kubernetes\.io/zone,TIMES:.metadata.creationTimestamp,NAME:.metadata.name,MCO-STATE:.metadata.annotations.machineconfiguration\.openshift\.io/state'"

ZONE            TIMES                  NAME                                                  MCO-STATE
us-central1-a   2022-04-19T08:56:44Z   test22-b6886-worker-a-5rndd.c.openshift-qe.internal   Done
us-central1-a   2022-04-19T08:56:52Z   test22-b6886-worker-a-lzglj.c.openshift-qe.internal   Working
us-central1-a   2022-04-19T07:35:56Z   test22-b6886-worker-a-sgvs8.c.openshift-qe.internal   Done
us-central1-b   2022-04-19T07:36:00Z   test22-b6886-worker-b-g7zfv.c.openshift-qe.internal   Done
us-central1-c   2022-04-19T07:35:53Z   test22-b6886-worker-c-jtd7b.c.openshift-qe.internal   Done


Actual results:
The order in which the nodes in the same zone are updated is not always "oldest first".


Expected results:
Nodes in the same zone should be updated by node age (oldest first).


Additional info:

Comment 2 Sergio 2022-05-04 12:23:57 UTC
Verified using build:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-04-060900   True        False         141m    Cluster version is 4.11.0-0.nightly-2022-05-04-060900


Nodes in the same zone are updated "oldest first".


We move the issue to VERIFIED status.

Comment 4 errata-xmlrpc 2022-08-10 11:07:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.