Bug 1880282

Summary: performance-addon-operator should be part of control plane, and run on master nodes
Product: OpenShift Container Platform Reporter: Francesco Romani <fromani>
Component: Performance Addon OperatorAssignee: Marcel Apfelbaum <mapfelba>
Status: CLOSED ERRATA QA Contact: Gowrishankar Rajaiyan <grajaiya>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.6CC: aos-bugs, fromani, grajaiya, mapfelba, marcel, rolove
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:42:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Francesco Romani 2020-09-18 07:55:09 UTC
Description of problem:
The performance-addon-operator should be considered part of the control plane, and should run on the master nodes. Currently, this is just not enforced, and we observed the operator running on worker nodes.


Version-Release number of selected component (if applicable):
<= 4.6


How reproducible:
100%


Steps to Reproduce:
1. install performance-addon-operator, any released version

Actual results:
The operator runs on worker nodes

Expected results:
The operator should run on master nodes.

Comment 1 Robert Love 2020-09-18 14:58:29 UTC
If we cannot guarantee the cores that we promise for low latency workloads then it's a problem we should fix for the release.

Comment 2 Marcel Apfelbaum 2020-09-21 06:54:48 UTC
PAO will use https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity to ensure is scheduled to master nodes.

Comment 5 Gowrishankar Rajaiyan 2020-10-01 05:02:25 UTC
The test cluster setup has 6 nodes (3 masters & 3 workers)

[root@dell-r730-028 ~]# oc get node
NAME       STATUS   ROLES               AGE   VERSION
master-0   Ready    master              13h   v1.19.0+b4ffb45
master-1   Ready    master              13h   v1.19.0+b4ffb45
master-2   Ready    master              13h   v1.19.0+b4ffb45
worker-0   Ready    worker,worker-cnf   13h   v1.19.0+b4ffb45
worker-1   Ready    worker,worker-cnf   13h   v1.19.0+b4ffb45
worker-2   Ready    worker              13h   v1.19.0+b4ffb45
[root@dell-r730-028 ~]#


When the performance-operator is installed it respects the nodeAffinity defined in its operator spec:

<snip>
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: Exists
</snip>

And based on nodeAffinity, the kube-scheduler evaluates 6 nodes but filters 3 nodes (masters) as feasible for pod deployment. We can see that in openshift-kube-scheduler logs below:

[root@dell-r730-028 ~]# oc logs openshift-kube-scheduler-master-1 -n openshift-kube-scheduler | grep performance-operator
I0930 17:52:21.386853       1 scheduler.go:597] "Successfully bound pod to node" pod="openshift-performance-addon/performance-operator-d964d967f-rbw24" node="master-2" evaluatedNodes=6 feasibleNodes=3
[root@dell-r730-028 ~]#


[root@dell-r730-028 ~]# oc get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP           NODE       NOMINATED NODE   READINESS GATES
performance-operator-d964d967f-rbw24   1/1     Running   0          10h   10.129.0.7   master-2   <none>           <none>
[root@dell-r730-028 ~]#


The above verifies that Performance Addon Operator is deployed on master node as expected.

Comment 8 errata-xmlrpc 2020-10-27 16:42:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196