2083482 – Avoid update races between old and new NTO operands during cluster upgrades

Bug 2083482 - Avoid update races between old and new NTO operands during cluster upgrades

Summary: Avoid update races between old and new NTO operands during cluster upgrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Tuning Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jiří Mencák
QA Contact:	liqcui
Docs Contact:
URL:
Whiteboard:
Depends On:	2080123
Blocks:	2085245
TreeView+	depends on / blocked

Reported:	2022-05-10 08:21 UTC by Jiří Mencák
Modified:	2022-05-25 04:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2080123
Environment:
Last Closed:	2022-05-25 04:30:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-node-tuning-operator pull 361	0	None	open	Bug 2083482: Ignore Profile updates triggered by old operands	2022-05-10 09:42:20 UTC
Red Hat Product Errata	RHSA-2022:2283	0	None	None	None	2022-05-25 04:30:40 UTC

Comment 2 liqcui 2022-05-16 06:16:12 UTC

Verified Results:

[mirroradmin@ec2-18-217-45-133 ~]$  oc create -f- <<'EOF'
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfigPool
> metadata:
>   name: worker-rt
>   labels:
>     worker-rt: ""
> spec:
>   machineConfigSelector:
>     matchExpressions:
>       - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-rt]}
>   nodeSelector:
>     matchLabels:
>       node-role.kubernetes.io/worker-rt: ""
> EOF
machineconfigpool.machineconfiguration.openshift.io/worker-rt created
[mirroradmin@ec2-18-217-45-133 ~]$ 
[mirroradmin@ec2-18-217-45-133 ~]$ oc create -f- <<'EOF'
> apiVersion: tuned.openshift.io/v1
> kind: Tuned
> metadata:
>   name: openshift-tuned-fight
>   namespace: openshift-cluster-node-tuning-operator
> spec:
>   profile:
>   - data: |
>       [main]
>       summary=Custom OpenShift profile
>       [bootloader]
>       cmdline=+trigger_tuned_fight=${f:exec:/usr/bin/bash:-c:echo $RELEASE_VERSION}
>     name: openshift-tuned-fight
> 
>   recommend:
>   - machineConfigLabels:
>       machineconfiguration.openshift.io/role: "worker-rt"
>     priority: 20
>     profile: openshift-tuned-fight
> EOF
tuned.tuned.openshift.io/openshift-tuned-fight created
[mirroradmin@ec2-18-217-45-133 ~]$ oc scale machineset liqcui-oc49ngt-vtctw-worker-us-east-2b --replicas=1 -n openshift-machine-api
machineset.machine.openshift.io/liqcui-oc49ngt-vtctw-worker-us-east-2b scaled
[mirroradmin@ec2-18-217-45-133 ~]$  for n in $(oc get no --selector=node-role.kubernetes.io/worker -o name) ; do oc label $n node-role.kubernetes.io/worker-rt= ; done
node/ip-10-0-146-199.us-east-2.compute.internal labeled
node/ip-10-0-150-195.us-east-2.compute.internal labeled
node/ip-10-0-166-87.us-east-2.compute.internal labeled
[mirroradmin@ec2-18-217-45-133 ~]$ oc get mcp
NAME        CONFIG                                                UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master      rendered-master-a6300299af3f648aaebf6c3a1445bb51      True      False      False      1              1                   1                     0                      33m
worker      rendered-worker-68d793f9cdc5eefa5ccdc805d80050b8      True      False      False      0              0                   0                     0                      33m
worker-rt   rendered-worker-rt-68d793f9cdc5eefa5ccdc805d80050b8   False     True       False      3              0                   0                     0                      5m57s
[mirroradmin@ec2-18-217-45-133 ~]$ oc get mcp
NAME        CONFIG                                                UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master      rendered-master-a6300299af3f648aaebf6c3a1445bb51      True      False      False      1              1                   1                     0                      38m
worker      rendered-worker-68d793f9cdc5eefa5ccdc805d80050b8      True      False      False      0              0                   0                     0                      38m
worker-rt   rendered-worker-rt-14c22b3cc8e80c1c61724cdb7c7a3f00   True      False      False      3              3                   3                     0                      10m
[mirroradmin@ec2-18-217-45-133 ~]$ oc edit deploy/cluster-node-tuning-operator 
Error from server (NotFound): deployments.apps "cluster-node-tuning-operator" not found
[mirroradmin@ec2-18-217-45-133 ~]$ oc project openshift-cluster-node-tuning-operator
Now using project "openshift-cluster-node-tuning-operator" on server "https://api.liqcui-oc49ngt.qe.devcluster.openshift.com:6443".
[mirroradmin@ec2-18-217-45-133 ~]$ oc edit deploy/cluster-node-tuning-operator 
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-05-16T05:24:12Z"
  generation: 1
  name: cluster-node-tuning-operator
  namespace: openshift-cluster-node-tuning-operator
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 81215011-9a7e-4170-9296-ceb8cf201bac
  resourceVersion: "7135"
  uid: 1a76653a-2494-47c9-91d6-4f7f45fdd694
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: cluster-node-tuning-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      creationTimestamp: null
      labels:
        name: cluster-node-tuning-operator
    spec:
      containers:
      - args:
        - -v=0
        command:
        - cluster-node-tuning-operator
        env:
Edit cancelled, no changes made.
[mirroradmin@ec2-18-217-45-133 ~]$ oc scale deploy/cluster-version-operator --replicas=0 -n openshift-cluster-version
deployment.apps/cluster-version-operator scaled
[mirroradmin@ec2-18-217-45-133 ~]$ oc edit deploy/cluster-node-tuning-operator 
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: cluster-node-tuning-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      creationTimestamp: null
      labels:
        name: cluster-node-tuning-operator
    spec:
      containers:
      - args:
        - -v=0
        command:
        - cluster-node-tuning-operator
        env:
        - name: RELEASE_VERSION
          value: bz-fix-4.9.0-0.nightly-2022-05-14-014831
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: RESYNC_PERIOD
          value: "600"
        - name: CLUSTER_NODE_TUNED_IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c84cd1d454191573c3c55e52a3af494d54e5b0ca3e3ceda53617014ce3be269
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c84cd1d454191573c3c55e52a3af494d54e5b0ca3e3ceda53617014ce3be269
        imagePullPolicy: IfNotPresent
        name: cluster-node-tuning-operator
        ports:
        - containerPort: 60000
          name: metrics
"/tmp/oc-edit-jnimz.yaml" 139L, 4446C written
deployment.apps/cluster-node-tuning-operator edited
[mirroradmin@ec2-18-217-45-133 ~]$ oc get pods
NAME                                            READY   STATUS        RESTARTS   AGE
cluster-node-tuning-operator-6784455854-p5cbh   1/1     Running       0          3s
cluster-node-tuning-operator-77fbb68bdf-d6s8v   1/1     Terminating   0          41m
tuned-5bc4p                                     1/1     Running       0          39m
tuned-dj72h                                     1/1     Running       1          12m
tuned-jhgh7                                     1/1     Running       1          8m7s
tuned-n7twj                                     1/1     Running       1          32m
[mirroradmin@ec2-18-217-45-133 ~]$ oc logs -f cluster-node-tuning-operator-6784455854-p5cbh
I0516 06:06:43.448441       1 main.go:25] Go Version: go1.16.12
I0516 06:06:43.448688       1 main.go:26] Go OS/Arch: linux/amd64
I0516 06:06:43.448718       1 main.go:27] node-tuning Version: v4.9.0-202205130537.p0.gcadc2f1.assembly.stream-0-g0f57c3a-dirty
I0516 06:06:43.455975       1 controller.go:1055] trying to become a leader
I0516 06:06:43.456589       1 server.go:51] starting metrics server
I0516 06:06:43.463223       1 leaderelection.go:248] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock...
I0516 06:06:43.464902       1 controller.go:1128] current leader: cluster-node-tuning-operator-77fbb68bdf-d6s8v_fc48642a-3e86-4920-a662-f9857fd5fbc4
I0516 06:07:28.341411       1 leaderelection.go:258] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock
I0516 06:07:28.341602       1 controller.go:1115] became leader: cluster-node-tuning-operator-6784455854-p5cbh_749147df-e5ea-4ddd-9660-ae2add6b0a59
I0516 06:07:28.341617       1 controller.go:981] starting Tuned controller
I0516 06:07:28.442863       1 controller.go:1033] started events processor/controller
E0516 06:07:28.469442       1 controller.go:181] unable to sync(clusteroperator//node-tuning) requeued (0): failed to sync DaemonSet: failed to update DaemonSet: Operation cannot be fulfilled on daemonsets.apps "tuned": the object has been modified; please apply your changes to the latest version and try again
I0516 06:07:28.504961       1 controller.go:642] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-150-195.us-east-2.compute.internal" change generated by operand version "4.9.0-0.nightly-2022-05-14-014831"
E0516 06:07:28.519995       1 status.go:56] unable to update ClusterOperator: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
E0516 06:07:28.520022       1 controller.go:181] unable to sync(profile/openshift-cluster-node-tuning-operator/ip-10-0-146-199.us-east-2.compute.internal) requeued (4): failed to sync Profile ip-10-0-146-199.us-east-2.compute.internal: failed to sync OperatorStatus: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
I0516 06:07:28.523375       1 controller.go:642] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-166-87.us-east-2.compute.internal" change generated by operand version "4.9.0-0.nightly-2022-05-14-014831"
I0516 06:07:28.535319       1 controller.go:642] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-150-195.us-east-2.compute.internal" change generated by operand version "4.9.0-0.nightly-2022-05-14-014831"
I0516 06:07:28.602487       1 controller.go:642] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-146-199.us-east-2.compute.internal" change generated by operand version "4.9.0-0.nightly-2022-05-14-014831"
I0516 06:07:30.335325       1 controller.go:719] updated MachineConfig 50-nto-worker-rt with kernel parameters: [trigger_tuned_fight=bz-fix-4.9.0-0.nightly-2022-05-14-01483]
^C
[mirroradmin@ec2-18-217-45-133 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-05-14-014831   True        False         30m     Cluster version is 4.9.0-0.nightly-2022-05-14-014831
[mirroradmin@ec2-18-217-45-133 ~]$ oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-node-tuning-operator-6784455854-p5cbh   1/1     Running   0          6m16s
tuned-grjp7                                     1/1     Running   0          5m27s
tuned-mv74w                                     1/1     Running   1          5m23s
tuned-qkvmm                                     1/1     Running   1          5m25s
tuned-rz4x5                                     1/1     Running   1          5m28s

Comment 5 errata-xmlrpc 2022-05-25 04:30:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.35 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:2283

Note You need to log in before you can comment on or make changes to this bug.