Description of problem: During cluster (NTO operator) upgrades, depending on TuneD configuration and version, TuneD may calculate different kernel arguments. It may happen that old and newly started NTO operands (with newer version of TuneD) will fight to push their version of kernel parameters. The operator will happily accept the updates and keep flipping the MachineConfig from old to new version and back. Version-Release number of selected component (if applicable): OCP 4.5-4.11 How reproducible: In special cases when more recent TuneD version generates different kernel parameters for the same profile than the old one. Steps to Reproduce: 1. Create a cluster with at least 3 worker nodes to increase the chance of hitting this bug during upgrade. 2. Move all 3 worker nodes into the same "worker-rt" machine pool: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-rt labels: worker-rt: "" spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-rt]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-rt: "" 3. Create this Tuned profile: apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: openshift-tuned-fight namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Custom OpenShift profile [bootloader] cmdline=+trigger_tuned_fight=${f:exec:/usr/bin/bash:-c:echo $RELEASE_VERSION} name: openshift-tuned-fight recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "worker-rt" priority: 20 profile: openshift-tuned-fight 4. Upgrade NTO or the whole cluster and you'll see in operator logs something like: I0426 14:44:41.421647 1 controller.go:736] updated MachineConfig 50-nto-worker-rt with kernel parameters: [trigger_tuned_fight=0.0.1-snapshot-4.1] I0426 14:44:46.508174 1 controller.go:736] updated MachineConfig 50-nto-worker-rt with kernel parameters: [trigger_tuned_fight=4.11.0-0.nightly-2022-04-26-08534] I0426 14:44:46.518672 1 controller.go:736] updated MachineConfig 50-nto-worker-rt with kernel parameters: [trigger_tuned_fight=0.0.1-snapshot-4.1] Actual results: NTO takes into account updates from the old containerized TuneD operands. Expected results: NTO ignores updates from the old containerized TuneD operands.
Verified Result: [ocpadmin@ec2-18-217-45-133 sro]$ oc logs -f cluster-node-tuning-operator-84578cc8f4-56dzg I0506 12:26:18.050842 1 main.go:66] Go Version: go1.17.5 I0506 12:26:18.050923 1 main.go:67] Go OS/Arch: linux/amd64 I0506 12:26:18.050927 1 main.go:68] node-tuning Version: v4.11.0-202205040908.p0.gba8d935.assembly.stream-0-g53eba90-dirty I0506 12:26:22.212050 1 request.go:665] Waited for 1.039026141s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/autoscaling.openshift.io/v1?timeout=32s I0506 12:26:27.361254 1 leaderelection.go:248] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock... I0506 12:31:47.287145 1 leaderelection.go:258] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock I0506 12:31:47.287368 1 controller.go:1007] starting Tuned controller I0506 12:31:47.297982 1 server.go:54] starting metrics server I0506 12:31:47.488427 1 controller.go:1059] started events processor/controller I0506 12:31:47.593523 1 controller.go:667] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-156-138.us-east-2.compute.internal" change generated by operand version "4.11.0-0.nightly-2022-05-04-214114" I0506 12:31:47.604187 1 controller.go:667] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-175-18.us-east-2.compute.internal" change generated by operand version "4.11.0-0.nightly-2022-05-04-214114" I0506 12:31:47.606872 1 controller.go:667] refusing to sync MachineConfig "50-nto-worker-rt" due to Profile "ip-10-0-207-212.us-east-2.compute.internal" change generated by operand version "4.11.0-0.nightly-2022-05-04-214114" I0506 12:31:51.512743 1 controller.go:745] updated MachineConfig 50-nto-worker-rt with kernel parameters: [trigger_tuned_fight=my-4.11.0-0.nightly-2022-05-04-21411] ^C [ocpadmin@ec2-18-217-45-133 sro]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-node-tuning-operator-84578cc8f4-56dzg 1/1 Running 0 13m tuned-4zcrv 1/1 Running 1 7m55s tuned-k5bxw 1/1 Running 1 7m58s tuned-m4qjm 1/1 Running 1 7m59s tuned-xt8s2 1/1 Running 0 8m1s [ocpadmin@ec2-18-217-45-133 sro]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-04-214114 True False 5h28m Cluster version is 4.11.0-0.nightly-2022-05-04-214114 [ocpadmin@ec2-18-217-45-133 sro]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-156-138.us-east-2.compute.internal Ready worker,worker-rt 5h37m v1.23.3+d464c70 ip-10-0-159-172.us-east-2.compute.internal Ready master 5h45m v1.23.3+d464c70 ip-10-0-175-18.us-east-2.compute.internal Ready worker,worker-rt 5h37m v1.23.3+d464c70 ip-10-0-207-212.us-east-2.compute.internal Ready worker,worker-rt 5h34m v1.23.3+d464c70 [ocpadmin@ec2-18-217-45-133 sro]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069