Bug 2105123

Summary:	Tuned overwriting IRQBALANCE_BANNED_CPUS
Product:	OpenShift Container Platform	Reporter:	browsell
Component:	Performance Addon Operator	Assignee:	Francesco Romani <fromani>
Status:	CLOSED ERRATA	QA Contact:	Gowrishankar Rajaiyan <grajaiya>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.11	CC:	bhu, bwensley, bzvonar, dagray, dgonyier, dphillip, fromani, grajaiya, keyoung, kquinn, liqcui, msivak, shajmakh, yquinn
Target Milestone:	---
Target Release:	4.12.0
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Tuned does a cleanup of a irqbalanced config file on startup. Consequence: When the Tuned daemon was restarted (as part of the cluster Node Tuning Operator) out of order, the cpu affinity of interrupt handlers was reset and the tuning was compromised. Fix: The irqbalance plugin in Tuned was disabled. OCP relies fully on the logic and interaction between cri-o and irqbalance now. Result: NTO restart does not affect irq handler cpu affinity.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:51:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2088578, 2182337

Comment 10 Shereen Haj Makhoul 2022-08-23 16:29:52 UTC

Verification:

Version:
OCP: 4.12.0-0.nightly-2022-08-22-032534


Steps:

1. apply the following PP:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "2-6"
    reserved: "0-1"
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

2. create gu pod with disabled irq balancing:
apiVersion: v1
kind: Pod
metadata:
  name: test
  annotations:
     irq-load-balancing.crio.io: "disable"
spec:
  containers:
  - name: test
    image: nginx   
    imagePullPolicy: IfNotPresent
    command: ["/bin/sh", "-c"]
    args: [ "while true; do sleep 100000; done;" ]
    resources:
      requests:
        cpu: 4
        memory: "200M"
      limits:
        cpu: 4
        memory: "200M"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  runtimeClassName: performance-performance

3. check banned cpus on the node that the pod was scheduled on:
sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS
# IRQBALANCE_BANNED_CPUS
#IRQBALANCE_BANNED_CPUS=
IRQBALANCE_BANNED_CPUS="00000000,0000f03c"

4. delete the tuned pod which is scheduled on that same node to trigger recreation, and check the banned CPUs again to see if it was not overwriten:

# oc delete pod tuned-zpc8k -n openshift-cluster-node-tuning-operator
pod "tuned-zpc8k" deleted

 oc get pod -A -o wide | grep tun
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-758dfc7745-dvfzv                                          1/1     Running            2 (162m ago)    173m    10.134.0.24       ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-685ph                                                                            1/1     Running            0               167m    192.168.122.242   ocp412790959-master-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-fn7jd                                                                            1/1     Running            0               5s      192.168.122.75    ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>   <--------
openshift-cluster-node-tuning-operator             tuned-g7rdn                                                                            1/1     Running            0               167m    192.168.122.252   ocp412790959-master-0.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-k2jh5                                                                            1/1     Running            1               146m    192.168.122.8     ocp412790959-worker-1.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-l8hvd                                                                            1/1     Running            0               167m    192.168.122.107   ocp412790959-master-2.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-m5ndk                                                                            1/1     Running            0               146m    192.168.122.14    ocp412790959-worker-2.libvirt.lab.eng.tlv2.redhat.com   <none>           <none>
[root@ocp-edge89 ~]# oc debug node/ocp412790959-worker-0.libvirt.lab.eng.tlv2.redhat.com
...
To use host binaries, run `chroot /host`
Pod IP: 192.168.122.75
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/sysconfig/irqbalance |grep IRQBALANCE_BANNED_CPUS
# IRQBALANCE_BANNED_CPUS
#IRQBALANCE_BANNED_CPUS=
IRQBALANCE_BANNED_CPUS="00000000,0000f03c"
sh-4.4# 

as can be seen, IRQBALANCE_BANNED_CPUS was not overridden despite the restart.

Comment 11 Shereen Haj Makhoul 2022-08-23 16:34:08 UTC

moving back to QA as the last say is for OCP-QE.

Comment 12 Shereen Haj Makhoul 2022-08-24 14:14:19 UTC

@liqcui FYI, moving this bug to verified.

Comment 18 errata-xmlrpc 2023-01-17 19:51:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 19 Red Hat Bugzilla 2023-09-18 04:41:26 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days