Bug 1986681 - spec.cpu.reserved is not set correctly by default if only spec.cpu.isolated is defined in PerformanceProfile
Summary: spec.cpu.reserved is not set correctly by default if only spec.cpu.isolated i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Performance Addon Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Mario Fernández
QA Contact: Niranjan Mallapadi Raghavender
URL:
Whiteboard:
Depends On:
Blocks: 2033011
TreeView+ depends on / blocked
 
Reported: 2021-07-28 05:40 UTC by Xingbin Li
Modified: 2022-08-04 15:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: CPU reserved field its not required in CRD. Consequence: CPU reserved can be ommited Fix: Fix documentation and add CPU reserved required Result: Correct Documentation
Clone Of:
Environment:
Last Closed: 2022-03-10 19:34:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni performance-addon-operators pull 792 0 None Merged Set isolated and required CPU parameter in CRD 2021-12-15 16:52:50 UTC
Github openshift/cluster-node-tuning-operator/blob/8d52656c9a1ec26e406759acd35d730384926dea/pkg/apis/performanceprofile/v2/performanceprofile_validation_test.go#L141 0 None None None 2022-08-04 15:47:45 UTC
Red Hat Product Errata RHEA-2022:0640 0 None None None 2022-03-10 19:34:50 UTC

Description Xingbin Li 2021-07-28 05:40:43 UTC
## Description of problem:

 - Sometimes spec.cpu.reserved is not set correctly by default if only spec.cpu.isolated is defined in PerformanceProfile

## Steps to Reproduce:

~~~
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: performance
  namespace: openshift-operators
spec:
  cpu:
    isolated: 0-32  <-----
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 30
      node: 0
      size: 1G
  realTimeKernel:
    enabled: true
  numa:
    topologyPolicy: restricted
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  runtimeClass: performance-performance
  tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-performance
~~~

  - Then the PAO will show following errors and the node won't set reserved CPU for kubelet in the kernel command line.

~~~
# oc logs performance-operator-6494d9944-zx6k4
I0727 04:36:57.398251       1 main.go:72] Operator Version:
I0727 04:36:57.398298       1 main.go:73] Git Commit:
I0727 04:36:57.398302       1 main.go:74] Build Date: 2021-06-07T07:14:55+0000
I0727 04:36:57.398307       1 main.go:75] Go Version: go1.13.15
I0727 04:36:57.398310       1 main.go:76] Go OS/Arch: linux/amd64
I0727 04:36:58.449175       1 request.go:621] Throttling request took 1.040254562s, request: GET:https://172.30.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
2021-07-27T04:37:00.504Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "0.0.0.0:8383"}
2021-07-27T04:37:00.504Z	INFO	controller-runtime.builder	skip registering a mutating webhook, admission.Defaulter interface is not implemented	{"GVK": "performance.openshift.io/v1, Kind=PerformanceProfile"}
2021-07-27T04:37:00.504Z	INFO	controller-runtime.builder	skip registering a validating webhook, admission.Validator interface is not implemented	{"GVK": "performance.openshift.io/v1, Kind=PerformanceProfile"}
2021-07-27T04:37:00.504Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/convert"}
2021-07-27T04:37:00.504Z	INFO	controller-runtime.builder	conversion webhook enabled	{"object": {"metadata":{"creationTimestamp":null},"spec":{},"status":{}}}
I0727 04:37:00.505049       1 main.go:142] Starting the Cmd.
I0727 04:37:00.505168       1 leaderelection.go:242] attempting to acquire leader lease  openshift-operators/performance-addon-operators...
2021-07-27T04:37:00.505Z	INFO	controller-runtime.webhook.webhooks	starting webhook server
2021-07-27T04:37:00.505Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2021-07-27T04:37:00.513Z	INFO	controller-runtime.certwatcher	Updated current TLS certificate
2021-07-27T04:37:00.513Z	INFO	controller-runtime.webhook	serving webhook server	{"host": "", "port": 4343}
2021-07-27T04:37:00.514Z	INFO	controller-runtime.certwatcher	Starting certificate watcher
I0727 04:37:17.911941       1 leaderelection.go:252] successfully acquired lease openshift-operators/performance-addon-operators
2021-07-27T04:37:17.912Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:17.912Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-operators","name":"performance-addon-operators","uid":"f07f5899-4140-4bdf-a5de-f92a3ad30a8d","apiVersion":"v1","resourceVersion":"55735609"}, "reason": "LeaderElection", "message": "performance-operator-6494d9944-zx6k4_1ef933b5-21f6-477b-8d31-51a8170939bb became leader"}
2021-07-27T04:37:18.113Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:18.313Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:18.514Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:18.715Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:18.916Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:19.116Z	INFO	controller	Starting EventSource	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "source": "kind source: /, Kind="}
2021-07-27T04:37:19.317Z	INFO	controller	Starting Controller	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile"}
2021-07-27T04:37:19.317Z	INFO	controller	Starting workers	{"reconcilerGroup": "performance.openshift.io", "reconcilerKind": "PerformanceProfile", "controller": "performanceprofile", "worker count": 1}
I0727 04:37:19.318148       1 performanceprofile_controller.go:216] Reconciling PerformanceProfile
E0727 04:37:19.318263       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 871 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x160b1a0, 0x26331e0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x160b1a0, 0x26331e0)
	/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.validateCPUCores(0xc000df3880, 0xc000040238, 0x13fdfa0)
	/remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:181 +0x2d
github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.ValidateParameters(0xc000ddbd40, 0x1805de5, 0x13)
	/remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:28 +0x4a
github.com/openshift-kni/performance-addon-operators/controllers.(*PerformanceProfileReconciler).Reconcile(0xc00077fb80, 0x0, 0x0, 0xc00044fd04, 0xb, 0xc000df37e0, 0xc000a72518, 0x1a31460, 0xc000a72510)
	/remote-source/app/controllers/performanceprofile_controller.go:271 +0x3c1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ab55f0, 0x166bd00, 0xc0002abd80, 0x0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x27d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ab55f0, 0x0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000ab55f0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:188 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000443f50)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1a1fcc0, 0xc0008fbb30, 0x1, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000443f50, 0x3b9aca00, 0x0, 0xc000455d01, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xaa
k8s.io/apimachinery/pkg/util/wait.Until(0xc000443f50, 0x3b9aca00, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:170 +0x431
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x139d52d]

goroutine 871 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x160b1a0, 0x26331e0)
	/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.validateCPUCores(0xc000df3880, 0xc000040238, 0x13fdfa0)
	/remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:181 +0x2d
github.com/openshift-kni/performance-addon-operators/pkg/controller/performanceprofile/components/profile.ValidateParameters(0xc000ddbd40, 0x1805de5, 0x13)
	/remote-source/app/pkg/controller/performanceprofile/components/profile/profile.go:28 +0x4a
github.com/openshift-kni/performance-addon-operators/controllers.(*PerformanceProfileReconciler).Reconcile(0xc00077fb80, 0x0, 0x0, 0xc00044fd04, 0xb, 0xc000df37e0, 0xc000a72518, 0x1a31460, 0xc000a72510)
	/remote-source/app/controllers/performanceprofile_controller.go:271 +0x3c1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ab55f0, 0x166bd00, 0xc0002abd80, 0x0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x27d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ab55f0, 0x0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000ab55f0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:188 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000443f50)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1a1fcc0, 0xc0008fbb30, 0x1, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000443f50, 0x3b9aca00, 0x0, 0xc000455d01, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xaa
k8s.io/apimachinery/pkg/util/wait.Until(0xc000443f50, 0x3b9aca00, 0xc00056a360)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:170 +0x431
~~~

 - If set some resevation CPU in the PerformanceProfile, the CPUs except ones which are isolated will be set to reserved normally.

~~~
[performanceprofile]

apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: performance
  namespace: openshift-operators
spec:
  cpu:
    isolated: 2-15
    reserved: 0-1  <------ just define 2 CPUs here.
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 30
      node: 0
      size: 1G
  realTimeKernel:
    enabled: true
  numa:
    topologyPolicy: restricted
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  runtimeClass: performance-performance
  tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-performance

 Then we can see that the CPUs number 0-1 and 16-63 were set to reserved.

[root@worker01 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-4f77d3fe15031a70b8025d96c85b25358c801e9557bed3a7a3657deed7faa062/vmlinuz-4.18.0-240.22.1.rt7.77.el8_3.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/4f77d3fe15031a70b8025d96c85b25358c801e9557bed3a7a3657deed7faa062/0 root=UUID=a7f8b4cf-609e-4f60-8402-7ac75322ab24 rw rootflags=prjquota intel_iommu=on iommu=pt skew_tick=1 nohz=on rcu_nocbs=2-15 tuned.non_isolcpus=ffffffff,ffff0003 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,2-15 systemd.cpu_affinity=0,1,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63 default_hugepagesz=1G +
~~~


## Actual results:

 Sometimes spec.cpu.reserved fails to be set correctly with errors.

## Expected results:
 
 The spec.cpu.reserved should always be set correctly by default even if only spec.cpu.isolated is defined in PerformanceProfile

Additional info:

Comment 1 Martin Sivák 2021-07-28 11:22:45 UTC
This is expected behavior. You are supposed to set both reserved and isolated in PerformanceProfile. The sets must not overlap and the sum of all the cpus mentioned must cover all the cpus expected on the workers in the targeted pool.

Comment 2 Xingbin Li 2021-07-29 01:06:50 UTC
(In reply to Martin Sivák from comment #1)
> This is expected behavior. You are supposed to set both reserved and
> isolated in PerformanceProfile. The sets must not overlap and the sum of all
> the cpus mentioned must cover all the cpus expected on the workers in the
> targeted pool.


Martin, Thanks for your answer !

According to https://github.com/openshift-kni/performance-addon-operators/blob/master/docs/performance_profile.md#cpu, 

Field 	 Description     Scheme          Required
reserved                 *CPUSet         false
isolated                  CPUSet         true

seems 'reserved' field is not 'required' in PerformanceProfile, so do you mean that 'reserved' field is also 'requried' ?

If kindly correct me if my understanding is wrong

Thanks

Comment 3 Xingbin Li 2021-07-29 01:07:51 UTC
> If kindly correct me if my understanding is wrong  <-- typo

Please kindly correct me if my understanding is wrong

Comment 4 Martin Sivák 2021-11-08 09:09:48 UTC
Mario, lets double check our CRD metadata. This is mostly about https://github.com/openshift-kni/performance-addon-operators/blob/master/deploy/olm-catalog/performance-addon-operator/4.10.0/performance.openshift.io_performanceprofiles_crd.yaml this file and all its instances upstream and downstream. Yanir should be able to help you.

Comment 8 Shereen Haj Makhoul 2022-02-22 15:59:10 UTC
Verification:

version: 
ocp: Server Version: 4.10.0-0.nightly-2022-01-27-104747
pao: performance-addon-operator-container-v4.10.0-29

Steps:

-Installed PAO:
[root@cnfdf05-installer performance]# oc get csv 
NAME                                        DISPLAY                      VERSION             REPLACES   PHASE
performance-addon-operator.v4.10.0          Performance Addon Operator   4.10.0                         Succeeded
[root@cnfdf05-installer performance]# oc describe csv performance-addon-operator.v4.10.0 | grep Image 
              containerImage:
          f:containerImage:
        f:relatedImages:
                Image:             registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d
  Related Images:
    Image:  registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d
    Image:  registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:767ff13075b2f503afb6f26e265282488b79acaf22cbf1c77055c3e008fdda8d

# cat pp.yaml 
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "3-5"
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

# oc apply -f pp.yaml
The PerformanceProfile "performance" is invalid: spec.cpu.reserved: Required value

# oc get performanceprofile 
No resources found

# oc describe crd performanceprofiles
...
 Cpu:
                Description:  CPU defines a set of CPU related parameters.
                Properties:
                  Balance Isolated:
                    Description:  BalanceIsolated toggles whether or not the Isolated CPU set is eligible for load balancing work loads. When this option is set to "false", the Isolated CPU set will be static, meaning workloads have to explicitly assign each thread to a specific cpu in order to work across multiple CPUs. Setting this to "true" allows workloads to be balanced across CPUs. Setting this to "false" offers the most predictable performance for guaranteed workloads, but it offloads the complexity of cpu load balancing to the application. Defaults to "true"
                    Type:         boolean
                  Isolated:
                    Description:  Isolated defines a set of CPUs that will be used to give to application threads the most execution time possible, which means removing as many extraneous tasks off a CPU as possible. It is important to notice the CPU manager can choose any CPU to run the workload except the reserved CPUs. In order to guarantee that your workload will run on the isolated CPU:   1. The union of reserved CPUs and isolated CPUs should include all online CPUs   2. The isolated CPUs field should be the complementary to reserved CPUs field
                    Type:         string
                  Reserved:
                    Description:  Reserved defines a set of CPUs that will not be used for any container workloads initiated by kubelet.
                    Type:         string
                Required:
                  isolated
                  reserved   <-------

Verified successfully.

Comment 10 errata-xmlrpc 2022-03-10 19:34:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10 low-latency extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:0640


Note You need to log in before you can comment on or make changes to this bug.