Bug 1880282

Summary:	performance-addon-operator should be part of control plane, and run on master nodes
Product:	OpenShift Container Platform	Reporter:	Francesco Romani <fromani>
Component:	Performance Addon Operator	Assignee:	Marcel Apfelbaum <mapfelba>
Status:	CLOSED ERRATA	QA Contact:	Gowrishankar Rajaiyan <grajaiya>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.6	CC:	aos-bugs, fromani, grajaiya, mapfelba, marcel, rolove
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:42:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Francesco Romani 2020-09-18 07:55:09 UTC

Description of problem:
The performance-addon-operator should be considered part of the control plane, and should run on the master nodes. Currently, this is just not enforced, and we observed the operator running on worker nodes.


Version-Release number of selected component (if applicable):
<= 4.6


How reproducible:
100%


Steps to Reproduce:
1. install performance-addon-operator, any released version

Actual results:
The operator runs on worker nodes

Expected results:
The operator should run on master nodes.

Comment 1 Robert Love 2020-09-18 14:58:29 UTC

If we cannot guarantee the cores that we promise for low latency workloads then it's a problem we should fix for the release.

Comment 2 Marcel Apfelbaum 2020-09-21 06:54:48 UTC

PAO will use https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity to ensure is scheduled to master nodes.

Comment 3 Marcel Apfelbaum 2020-09-21 11:50:17 UTC

PR: https://github.com/openshift-kni/performance-addon-operators/pull/351

Comment 5 Gowrishankar Rajaiyan 2020-10-01 05:02:25 UTC

The test cluster setup has 6 nodes (3 masters & 3 workers)

[root@dell-r730-028 ~]# oc get node
NAME       STATUS   ROLES               AGE   VERSION
master-0   Ready    master              13h   v1.19.0+b4ffb45
master-1   Ready    master              13h   v1.19.0+b4ffb45
master-2   Ready    master              13h   v1.19.0+b4ffb45
worker-0   Ready    worker,worker-cnf   13h   v1.19.0+b4ffb45
worker-1   Ready    worker,worker-cnf   13h   v1.19.0+b4ffb45
worker-2   Ready    worker              13h   v1.19.0+b4ffb45
[root@dell-r730-028 ~]#


When the performance-operator is installed it respects the nodeAffinity defined in its operator spec:

<snip>
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: Exists
</snip>

And based on nodeAffinity, the kube-scheduler evaluates 6 nodes but filters 3 nodes (masters) as feasible for pod deployment. We can see that in openshift-kube-scheduler logs below:

[root@dell-r730-028 ~]# oc logs openshift-kube-scheduler-master-1 -n openshift-kube-scheduler | grep performance-operator
I0930 17:52:21.386853       1 scheduler.go:597] "Successfully bound pod to node" pod="openshift-performance-addon/performance-operator-d964d967f-rbw24" node="master-2" evaluatedNodes=6 feasibleNodes=3
[root@dell-r730-028 ~]#


[root@dell-r730-028 ~]# oc get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP           NODE       NOMINATED NODE   READINESS GATES
performance-operator-d964d967f-rbw24   1/1     Running   0          10h   10.129.0.7   master-2   <none>           <none>
[root@dell-r730-028 ~]#


The above verifies that Performance Addon Operator is deployed on master node as expected.

Comment 8 errata-xmlrpc 2020-10-27 16:42:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196