Bug 1891856

Summary: ocs-metrics-exporter pod should have tolerations for OCS taint
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: ocs-operatorAssignee: Jose A. Rivera <jarrpa>
Status: CLOSED ERRATA QA Contact: Shrivaibavi Raghaventhiran <sraghave>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: ebenahar, madam, muagarwa, ocs-bugs, sostapov, uchapaga
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.6.0-149.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-17 06:25:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neha Berry 2020-10-27 14:20:05 UTC
Description of problem:
-----------------------------
OCs 4.6:  The newly added pod "ocs-metrics-exporter-xxx" should have toleration specified for OCS taints, similar to other OCS pods and operator pods in openshift-storage namespace

Current toleration in ocs-metrics-exporter pod
=============================
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists

P.S: the spec-> tolerations section doesn't even exist in the deployment.apps of the above pod as it does not have any custom toleration added to it.

OCS specific toleration from rook-ceph-operator pod
----------------------------

 tolerations:
  - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    operator: Equal
    value: "true"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCS = 4.6.0-144.ci 
OCP = 4.6.0-0.nightly-2020-10-22-034051

How reproducible:
==================
Always

Steps to Reproduce:
----------------------
1. Install COS operator of 4.6 from Operator Hub
2. Check for the pods. Following pods would be running (even before Storage cluster creation stage)
 noobaa-operator, rook-operator, ocs-operator and ocs-,etrics-exporter(new in OCS 4.6)

3. Check the tolerations added to the new ocs-metrics pod. it does not include OCS taint specific toleration



Actual results:
----------------------
OCS taint related toleration is absent

Expected results:
---------------------
this should be added under toleration for ocs-metrics pod and deployment.apps

 - effect: NoSchedule
    key: node.ocs.openshift.io/storage
    operator: Equal
    value: "true"


Additional info:
-----------------

Tue Oct 27 14:07:38 UTC 2020
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-144.ci   OpenShift Container Storage   4.6.0-144.ci              Succeeded
--------------
=======PODS ======
NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
noobaa-operator-f7789cf94-wp74l         1/1     Running   0          52s   10.131.1.213   compute-1   <none>           <none>
ocs-metrics-exporter-576f474c87-9r7bv   1/1     Running   0          52s   10.129.3.104   compute-2   <none>           <none>
ocs-operator-686fd84dd7-6l45s           1/1     Running   0          52s   10.129.3.102   compute-2   <none>           <none>
rook-ceph-operator-7558fcf89c-wmjr4     1/1     Running   0          52s   10.129.3.103   compute-2   <none>           <none>



$ oc get all
NAME                                        READY   STATUS    RESTARTS   AGE
pod/noobaa-operator-f7789cf94-wp74l         1/1     Running   0          74s
pod/ocs-metrics-exporter-576f474c87-9r7bv   1/1     Running   0          74s
pod/ocs-operator-686fd84dd7-6l45s           1/1     Running   0          74s
pod/rook-ceph-operator-7558fcf89c-wmjr4     1/1     Running   0          74s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/noobaa-operator        1/1     1            1           76s
deployment.apps/ocs-metrics-exporter   1/1     1            1           76s
deployment.apps/ocs-operator           1/1     1            1           76s
deployment.apps/rook-ceph-operator     1/1     1            1           76s

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/noobaa-operator-f7789cf94         1         1         1       77s
replicaset.apps/ocs-metrics-exporter-576f474c87   1         1         1       77s
replicaset.apps/ocs-operator-686fd84dd7           1         1         1       77s
replicaset.apps/rook-ceph-operator-7558fcf89c     1         1         1       77s
[nberry@localhost oct27-144.ci]$

Comment 5 Shrivaibavi Raghaventhiran 2020-11-04 10:09:16 UTC
Tested on infra nodes setup:

The ocs-metrics-exporter pod was having toleration but was running on non OCS nodes
I respinned the ocs-metrics it still runs on the same node and not migrated to infra nodes
since the ocs-metrics-exporter pod have toleration for ocs-taints it should run on infra nodes after respin was my expectation.

If the above is not expected, Please clarify on other ways to verify the behavior. Raising the need info on the same 
@neha @umanga

Versions:
----------
4.6.0-0.nightly-2020-10-14-095718
ocs-operator.v4.6.0-152.ci

Console output:
----------------
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.6.0-152.ci   OpenShift Container Storage   4.6.0-152.ci   ocs-operator.v4.6.0-144.ci   Succeeded

$ oc get nodes --show-labels | grep ocs
compute-0         Ready    infra,worker   20d   v1.19.0+d59ce34   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2
compute-1         Ready    infra,worker   20d   v1.19.0+d59ce34   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
compute-2         Ready    infra,worker   20d   v1.19.0+d59ce34   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1

ocs-metrics-exporter
==============
            f:tolerations: {}
    manager: olm
    operation: Update
    time: "2020-11-03T10:23:35Z"
  name: ocs-metrics-exporter
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: ocs-operator.v4.6.0-152.ci
    uid: 5b89c4e1-b273-4dac-83f1-698db1184a1f
  resourceVersion: "28789890"
  selfLink: /apis/apps/v1/namespaces/openshift-storage/deployments/ocs-metrics-exporter
  uid: 6ff5e5ca-c57d-4e0f-8ac9-db487c29d787
spec:
--
      tolerations:
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-10-28T07:23:45Z"
    lastUpdateTime: "2020-10-28T07:23:45Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2020-10-15T09:04:03Z"
    lastUpdateTime: "2020-11-03T10:23:02Z"
    message: ReplicaSet "ocs-metrics-exporter-6d9867695b" has successfully progressed.
 
$ oc get nodes
NAME              STATUS   ROLES          AGE   VERSION
compute-0         Ready    infra,worker   19d   v1.19.0+d59ce34
compute-1         Ready    infra,worker   19d   v1.19.0+d59ce34
compute-2         Ready    infra,worker   19d   v1.19.0+d59ce34
compute-3         Ready    worker         19d   v1.19.0+d59ce34
compute-4         Ready    worker         19d   v1.19.0+d59ce34
compute-5         Ready    worker         19d   v1.19.0+d59ce34
control-plane-0   Ready    master         19d   v1.19.0+d59ce34
control-plane-1   Ready    master         19d   v1.19.0+d59ce34
control-plane-2   Ready    master         19d   v1.19.0+d59ce34
(python38) [sraghave@localhost ~]$
(python38) [sraghave@localhost ~]$ oc get pods -n openshift-storage -o wide | grep ocs-metrics
ocs-metrics-exporter-6d9867695b-f4gft                             1/1     Running   0          21h     10.128.3.130   compute-4   <none>           <none>
 
$ oc delete pod ocs-metrics-exporter-6d9867695b-f4gft -n openshift-storage
pod "ocs-metrics-exporter-6d9867695b-f4gft" deleted
 
$ oc get pods -n openshift-storage -o wide | grep ocs-metrics
ocs-metrics-exporter-6d9867695b-q2bqg                             1/1     Running   0          39s     10.128.2.89    compute-4   <none>           <none>

Comment 6 umanga 2020-11-04 15:46:32 UTC
This is expected. That's all taints and tolerations can do.
It could run on infra nodes. If we want to ensure that it does, we need Node Affinities and that's a different issue.
This BZ is verified as per Comment 5.

Comment 7 Shrivaibavi Raghaventhiran 2020-11-05 12:13:30 UTC
Test environment:
-------------------
Infra labelled and OCS tainted nodes

Test steps:
-----------
1. ocs-metrics pod was running on non-ocs node
2. Cordoned the non-ocs workers
3. Respinned the ocs-metrics-exporter pod
3. The ocs-metrics exporter pod started running on ocs-node

Console output:
---------------
$ oc get pods -n openshift-storage -o wide | grep ocs-metrics
ocs-metrics-exporter-6d9867695b-q2bqg                             1/1     Running   0          28h     10.128.2.89    compute-4   <none>           <none>

$ oc delete pod ocs-metrics-exporter-6d9867695b-q2bqg -n openshift-storage
pod "ocs-metrics-exporter-6d9867695b-q2bqg" deleted

$ oc get pods -n openshift-storage -o wide | grep ocs-metrics
ocs-metrics-exporter-6d9867695b-6cscf                             1/1     Running   0          18s     10.131.0.28    compute-0   <none>           <none>

$ oc get nodes
NAME              STATUS                     ROLES          AGE   VERSION
compute-0         Ready                      infra,worker   21d   v1.19.0+d59ce34
compute-1         Ready                      infra,worker   21d   v1.19.0+d59ce34
compute-2         Ready                      infra,worker   21d   v1.19.0+d59ce34
compute-3         Ready,SchedulingDisabled   worker         21d   v1.19.0+d59ce34
compute-4         Ready,SchedulingDisabled   worker         21d   v1.19.0+d59ce34
compute-5         Ready,SchedulingDisabled   worker         21d   v1.19.0+d59ce34
control-plane-0   Ready                      master         21d   v1.19.0+d59ce34
control-plane-1   Ready                      master         21d   v1.19.0+d59ce34
control-plane-2   Ready                      master         21d   v1.19.0+d59ce34


With the above verifications and based on comment #5 and #6 moving this BZ to verified state.

Comment 10 errata-xmlrpc 2020-12-17 06:25:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605