## Description of problem: - Sometimes spec.cpu.reserved is not set correctly by default if only spec.cpu.isolated is defined in PerformanceProfile ## Steps to Reproduce: ~~~ apiVersion: performance.openshift.io/v1 kind: PerformanceProfile metadata: name: performance namespace: openshift-operators spec: cpu: isolated: 0-32 <----- hugepages: defaultHugepagesSize: 1G pages: - count: 30 node: 0 size: 1G realTimeKernel: enabled: true numa: topologyPolicy: restricted nodeSelector: node-role.kubernetes.io/worker-cnf: "" runtimeClass: performance-performance tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-performance ~~~ - Then the PAO will show following errors and the node won't set reserved CPU for kubelet in the kernel command line. ~~~ # oc logs performance-operator-6494d9944-zx6k4 I0727 04:36:57.398251 1 main.go:72] Operator Version: I0727 04:36:57.398298 1 main.go:73] Git Commit: I0727 04:36:57.398302 1 main.go:74] Build Date: 2021-06-07T07:14:55+0000 I0727 04:36:57.398307 1 main.go:75] Go Version: go1.13.15 I0727 04:36:57.398310 1 main.go:76] Go OS/Arch: linux/amd64 I0727 04:36:58.449175 1 request.go:621] Throttling request took 1.040254562s, request: GET:https://172.30.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s 2021-07-27T04:37:00.504Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "0.0.0.0:8383"} 2021-07-27T04:37:00.504Z INFO controller-runtime.builder skip registering a mutating webhook, admission.Defaulter interface is not implemented {"GVK": "performance.openshift.io/v1, Kind=PerformanceProfile"} 2021-07-27T04:37:00.504Z INFO controller-runtime.builder skip registering a validating webhook, admission.Validator interface is not implemented {"GVK": "performance.openshift.io/v1, Kind=PerformanceProfile"} 2021-07-27T04:37:00.504Z INFO controller-runtime.webhook registering webhook {"path": "/convert"} 2021-07-27T04:37:00.504Z INFO controller-runtime.builder conversion webhook enabled {"object": {"metadata":{"creationTimestamp":null},"spec":{},"status":{}}} I0727 04:37:00.505049 1 main.go:142] Starting the Cmd. I0727 04:37:00.505168 1 leaderelection.go:242] attempting to acquire leader lease openshift-operators/performance-addon-operators... 2021-07-27T04:37:00.505Z INFO controller-runtime.webhook.webhooks starting webhook server 2021-07-27T04:37:00.505Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2021-07-27T04:37:00.513Z INFO controller-runtime.certwatcher Updated current TLS certificate 2021-07-27T04:37:00.513Z INFO controller-runtime.webhook serving webhook server {"host": "", "port": 4343} 2021-07-27T04:37:00.514Z INFO controller-runtime.certwatcher Starting certificate watcher I0727 04:37:17.911941 1 leaderelection.go:252] successfully acquired lease openshift-operators/performance-addon-operators 2021-07-27T04:37:17.912Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:17.912Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"performance-addon-operators","uid":"f07f5899-4140-4bdf-a5de-f92a3ad30a8d","apiVersion":"v1","resourceVersion":"55735609"}, "reason": "LeaderElection", "message": "performance-operator-6494d9944-zx6k4_1ef933b5-21f6-477b-8d31-51a8170939bb became leader"} 2021-07-27T04:37:18.113Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:18.313Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:18.514Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:18.715Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:18.916Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:19.116Z INFO controller Starting EventSource {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="} 2021-07-27T04:37:19.317Z INFO controller Starting Controller {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile"} 2021-07-27T04:37:19.317Z INFO controller Starting workers {"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "worker count": 1} I0727 04:37:19.318148 1 performanceprofile_controller.go:216] Reconciling PerformanceProfile E0727 04:37:19.318263 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 871 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x160b1a0, 0x26331e0) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x160b1a0, 0x26331e0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.validateCPUCores(0xc000df3880, 0xc000040238, 0x13fdfa0) /remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:181 +0x2d github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.ValidateParameters(0xc000ddbd40, 0x1805de5, 0x13) /remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:28 +0x4a github.com/openshift-kni/performance-addon-operators/controllers.(*PerformanceProfileReconciler).Reconcile(0xc00077fb80, 0x0, 0x0, 0xc00044fd04, 0xb, 0xc000df37e0, 0xc000a72518, 0x1a31460, 0xc000a72510) /remote-source/app/controllers/performanceprofile_controller.go:271 +0x3c1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ab55f0, 0x166bd00, 0xc0002abd80, 0x0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x27d sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ab55f0, 0x0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209 +0xcb sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000ab55f0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:188 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000443f50) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1a1fcc0, 0xc0008fbb30, 0x1, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000443f50, 0x3b9aca00, 0x0, 0xc000455d01, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xaa k8s.io/apimachinery/pkg/util/wait.Until(0xc000443f50, 0x3b9aca00, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:170 +0x431 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x139d52d] goroutine 871 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105 panic(0x160b1a0, 0x26331e0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.validateCPUCores(0xc000df3880, 0xc000040238, 0x13fdfa0) /remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:181 +0x2d github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.ValidateParameters(0xc000ddbd40, 0x1805de5, 0x13) /remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:28 +0x4a github.com/openshift-kni/performance-addon-operators/controllers.(*PerformanceProfileReconciler).Reconcile(0xc00077fb80, 0x0, 0x0, 0xc00044fd04, 0xb, 0xc000df37e0, 0xc000a72518, 0x1a31460, 0xc000a72510) /remote-source/app/controllers/performanceprofile_controller.go:271 +0x3c1 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ab55f0, 0x166bd00, 0xc0002abd80, 0x0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x27d sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ab55f0, 0x0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209 +0xcb sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000ab55f0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:188 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000443f50) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1a1fcc0, 0xc0008fbb30, 0x1, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000443f50, 0x3b9aca00, 0x0, 0xc000455d01, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xaa k8s.io/apimachinery/pkg/util/wait.Until(0xc000443f50, 0x3b9aca00, 0xc00056a360) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:170 +0x431 ~~~ - If set some resevation CPU in the PerformanceProfile, the CPUs except ones which are isolated will be set to reserved normally. ~~~ [performanceprofile] apiVersion: performance.openshift.io/v1 kind: PerformanceProfile metadata: name: performance namespace: openshift-operators spec: cpu: isolated: 2-15 reserved: 0-1 <------ just define 2 CPUs here. hugepages: defaultHugepagesSize: 1G pages: - count: 30 node: 0 size: 1G realTimeKernel: enabled: true numa: topologyPolicy: restricted nodeSelector: node-role.kubernetes.io/worker-cnf: "" runtimeClass: performance-performance tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-performance Then we can see that the CPUs number 0-1 and 16-63 were set to reserved. [root@worker01 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-4f77d3fe15031a70b8025d96c85b25358c801e9557bed3a7a3657deed7faa062/vmlinuz-4.18.0-240.22.1.rt7.77.el8_3.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/4f77d3fe15031a70b8025d96c85b25358c801e9557bed3a7a3657deed7faa062/0 root=UUID=a7f8b4cf-609e-4f60-8402-7ac75322ab24 rw rootflags=prjquota intel_iommu=on iommu=pt skew_tick=1 nohz=on rcu_nocbs=2-15 tuned.non_isolcpus=ffffffff,ffff0003 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,2-15 systemd.cpu_affinity=0,1,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63 default_hugepagesz=1G + ~~~ ## Actual results: Sometimes spec.cpu.reserved fails to be set correctly with errors. ## Expected results: The spec.cpu.reserved should always be set correctly by default even if only spec.cpu.isolated is defined in PerformanceProfile Additional info:
This is expected behavior. You are supposed to set both reserved and isolated in PerformanceProfile. The sets must not overlap and the sum of all the cpus mentioned must cover all the cpus expected on the workers in the targeted pool.
(In reply to Martin Sivák from comment #1) > This is expected behavior. You are supposed to set both reserved and > isolated in PerformanceProfile. The sets must not overlap and the sum of all > the cpus mentioned must cover all the cpus expected on the workers in the > targeted pool. Martin, Thanks for your answer ! According to https://github.com/openshift-kni/performance-addon-operators/blob/master/docs/performance_profile.md#cpu, Field Description Scheme Required reserved *CPUSet false isolated CPUSet true seems 'reserved' field is not 'required' in PerformanceProfile, so do you mean that 'reserved' field is also 'requried' ? If kindly correct me if my understanding is wrong Thanks
> If kindly correct me if my understanding is wrong <-- typo Please kindly correct me if my understanding is wrong
Mario, lets double check our CRD metadata. This is mostly about https://github.com/openshift-kni/performance-addon-operators/blob/master/deploy/olm-catalog/performance-addon-operator/4.10.0/performance.openshift.io_performanceprofiles_crd.yaml this file and all its instances upstream and downstream. Yanir should be able to help you.
Verification: version: ocp: Server Version: 4.10.0-0.nightly-2022-01-27-104747 pao: performance-addon-operator-container-v4.10.0-29 Steps: -Installed PAO: [root@cnfdf05-installer performance]# oc get csv NAME DISPLAY VERSION REPLACES PHASE performance-addon-operator.v4.10.0 Performance Addon Operator 4.10.0 Succeeded [root@cnfdf05-installer performance]# oc describe csv performance-addon-operator.v4.10.0 | grep Image containerImage: f:containerImage: f:relatedImages: Image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d Related Images: Image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d Image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d # cat pp.yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "3-5" realTimeKernel: enabled: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" # oc apply -f pp.yaml The PerformanceProfile "performance" is invalid: spec.cpu.reserved: Required value # oc get performanceprofile No resources found # oc describe crd performanceprofiles ... Cpu: Description: CPU defines a set of CPU related parameters. Properties: Balance Isolated: Description: BalanceIsolated toggles whether or not the Isolated CPU set is eligible for load balancing work loads. When this option is set to "false", the Isolated CPU set will be static, meaning workloads have to explicitly assign each thread to a specific cpu in order to work across multiple CPUs. Setting this to "true" allows workloads to be balanced across CPUs. Setting this to "false" offers the most predictable performance for guaranteed workloads, but it offloads the complexity of cpu load balancing to the application. Defaults to "true" Type: boolean Isolated: Description: Isolated defines a set of CPUs that will be used to give to application threads the most execution time possible, which means removing as many extraneous tasks off a CPU as possible. It is important to notice the CPU manager can choose any CPU to run the workload except the reserved CPUs. In order to guarantee that your workload will run on the isolated CPU: 1. The union of reserved CPUs and isolated CPUs should include all online CPUs 2. The isolated CPUs field should be the complementary to reserved CPUs field Type: string Reserved: Description: Reserved defines a set of CPUs that will not be used for any container workloads initiated by kubelet. Type: string Required: isolated reserved <------- Verified successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10 low-latency extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:0640