Cause:
Tuned does a cleanup of a irqbalanced config file on startup.
Consequence:
When the Tuned daemon was restarted (as part of the cluster Node Tuning Operator) out of order, the cpu affinity of interrupt handlers was reset and the tuning was compromised.
Fix:
The irqbalance plugin in Tuned was disabled. OCP relies fully on the logic and interaction between cri-o and irqbalance now.
Result:
NTO restart does not affect irq handler cpu affinity.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:7399
Comment 19Red Hat Bugzilla
2023-09-18 04:41:26 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days
Verification: Version: OCP: 4.12.0-0.nightly-2022-08-22-032534 Steps: 1. apply the following PP: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "2-6" reserved: "0-1" realTimeKernel: enabled: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" 2. create gu pod with disabled irq balancing: apiVersion: v1 kind: Pod metadata: name: test annotations: irq-load-balancing.crio.io: "disable" spec: containers: - name: test image: nginx imagePullPolicy: IfNotPresent command: ["/bin/sh", "-c"] args: [ "while true; do sleep 100000; done;" ] resources: requests: cpu: 4 memory: "200M" limits: cpu: 4 memory: "200M" nodeSelector: node-role.kubernetes.io/worker-cnf: "" runtimeClassName: performance-performance 3. check banned cpus on the node that the pod was scheduled on: sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS # IRQBALANCE_BANNED_CPUS #IRQBALANCE_BANNED_CPUS= IRQBALANCE_BANNED_CPUS="00000000,0000f03c" 4. delete the tuned pod which is scheduled on that same node to trigger recreation, and check the banned CPUs again to see if it was not overwriten: # oc delete pod tuned-zpc8k -n openshift-cluster-node-tuning-operator pod "tuned-zpc8k" deleted oc get pod -A -o wide | grep tun openshift-cluster-node-tuning-operator cluster-node-tuning-operator-758dfc7745-dvfzv 1/1 Running 2 (162m ago) 173m 10.134.0.24 ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com <none> <none> openshift-cluster-node-tuning-operator tuned-685ph 1/1 Running 0 167m 192.168.122.242 ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com <none> <none> openshift-cluster-node-tuning-operator tuned-fn7jd 1/1 Running 0 5s 192.168.122.75 ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com <none> <none> <-------- openshift-cluster-node-tuning-operator tuned-g7rdn 1/1 Running 0 167m 192.168.122.252 ocp412790959-master-0.libvirt.lab.eng.tlv2.redhat.com <none> <none> openshift-cluster-node-tuning-operator tuned-k2jh5 1/1 Running 1 146m 192.168.122.8 ocp412790959-worker-1.libvirt.lab.eng.tlv2.redhat.com <none> <none> openshift-cluster-node-tuning-operator tuned-l8hvd 1/1 Running 0 167m 192.168.122.107 ocp412790959-master-2.libvirt.lab.eng.tlv2.redhat.com <none> <none> openshift-cluster-node-tuning-operator tuned-m5ndk 1/1 Running 0 146m 192.168.122.14 ocp412790959-worker-2.libvirt.lab.eng.tlv2.redhat.com <none> <none> [root@ocp-edge89 ~]# oc debug node/ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com ... To use host binaries, run `chroot /host` Pod IP: 192.168.122.75 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS # IRQBALANCE_BANNED_CPUS #IRQBALANCE_BANNED_CPUS= IRQBALANCE_BANNED_CPUS="00000000,0000f03c" sh-4.4# as can be seen, IRQBALANCE_BANNED_CPUS was not overridden despite the restart.