Bug 2105123 - Tuned overwriting IRQBALANCE_BANNED_CPUS
Summary: Tuned overwriting IRQBALANCE_BANNED_CPUS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Performance Addon Operator
Version: 4.11
Hardware: x86_64
OS: All
high
high
Target Milestone: ---
: 4.12.0
Assignee: Francesco Romani
QA Contact: Gowrishankar Rajaiyan
URL:
Whiteboard:
Depends On:
Blocks: 2088578 2182337
TreeView+ depends on / blocked
 
Reported: 2022-07-08 02:14 UTC by browsell
Modified: 2024-01-12 17:09 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Tuned does a cleanup of a irqbalanced config file on startup. Consequence: When the Tuned daemon was restarted (as part of the cluster Node Tuning Operator) out of order, the cpu affinity of interrupt handlers was reset and the tuning was compromised. Fix: The irqbalance plugin in Tuned was disabled. OCP relies fully on the logic and interaction between cri-o and irqbalance now. Result: NTO restart does not affect irq handler cpu affinity.
Clone Of:
Environment:
Last Closed: 2023-01-17 19:51:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 396 0 None Merged Bug 2105123: tuned: disable irqbalance 2022-08-24 15:08:48 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:52:07 UTC

Comment 10 Shereen Haj Makhoul 2022-08-23 16:29:52 UTC
Verification:

Version:
OCP: 4.12.0-0.nightly-2022-08-22-032534


Steps:

1. apply the following PP:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "2-6"
    reserved: "0-1"
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

2. create gu pod with disabled irq balancing:
apiVersion: v1
kind: Pod
metadata:
  name: test
  annotations:
     irq-load-balancing.crio.io: "disable"
spec:
  containers:
  - name: test
    image: nginx   
    imagePullPolicy: IfNotPresent
    command: ["/bin/sh", "-c"]
    args: [ "while true; do sleep 100000; done;" ]
    resources:
      requests:
        cpu: 4
        memory: "200M"
      limits:
        cpu: 4
        memory: "200M"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  runtimeClassName: performance-performance

3. check banned cpus on the node that the pod was scheduled on:
sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS
# IRQBALANCE_BANNED_CPUS
#IRQBALANCE_BANNED_CPUS=
IRQBALANCE_BANNED_CPUS="00000000,0000f03c"

4. delete the tuned pod which is scheduled on that same node to trigger recreation, and check the banned CPUs again to see if it was not overwriten:

# oc delete pod tuned-zpc8k -n openshift-cluster-node-tuning-operator
pod "tuned-zpc8k" deleted

 oc get pod -A -o wide | grep tun
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-758dfc7745-dvfzv                                          1/1     Running            2 (162m ago)    173m    10.134.0.24       ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-685ph                                                                            1/1     Running            0               167m    192.168.122.242   ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-fn7jd                                                                            1/1     Running            0               5s      192.168.122.75    ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>   <--------
openshift-cluster-node-tuning-operator             tuned-g7rdn                                                                            1/1     Running            0               167m    192.168.122.252   ocp412790959-master-0.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-k2jh5                                                                            1/1     Running            1               146m    192.168.122.8     ocp412790959-worker-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-l8hvd                                                                            1/1     Running            0               167m    192.168.122.107   ocp412790959-master-2.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-m5ndk                                                                            1/1     Running            0               146m    192.168.122.14    ocp412790959-worker-2.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
[root@ocp-edge89 ~]# oc debug node/ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com
...
To use host binaries, run `chroot /host`
Pod IP: 192.168.122.75
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS
# IRQBALANCE_BANNED_CPUS
#IRQBALANCE_BANNED_CPUS=
IRQBALANCE_BANNED_CPUS="00000000,0000f03c"
sh-4.4# 

as can be seen, IRQBALANCE_BANNED_CPUS was not overridden despite the restart.

Comment 11 Shereen Haj Makhoul 2022-08-23 16:34:08 UTC
moving back to QA as the last say is for OCP-QE.

Comment 12 Shereen Haj Makhoul 2022-08-24 14:14:19 UTC
@liqcui FYI, moving this bug to verified.

Comment 18 errata-xmlrpc 2023-01-17 19:51:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 19 Red Hat Bugzilla 2023-09-18 04:41:26 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.