Description of problem: ----------------------------- OCs 4.6: The newly added pod "ocs-metrics-exporter-xxx" should have toleration specified for OCS taints, similar to other OCS pods and operator pods in openshift-storage namespace Current toleration in ocs-metrics-exporter pod ============================= tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists P.S: the spec-> tolerations section doesn't even exist in the deployment.apps of the above pod as it does not have any custom toleration added to it. OCS specific toleration from rook-ceph-operator pod ---------------------------- tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 Version-Release number of selected component (if applicable): ------------------------------------------------------------- OCS = 4.6.0-144.ci OCP = 4.6.0-0.nightly-2020-10-22-034051 How reproducible: ================== Always Steps to Reproduce: ---------------------- 1. Install COS operator of 4.6 from Operator Hub 2. Check for the pods. Following pods would be running (even before Storage cluster creation stage) noobaa-operator, rook-operator, ocs-operator and ocs-,etrics-exporter(new in OCS 4.6) 3. Check the tolerations added to the new ocs-metrics pod. it does not include OCS taint specific toleration Actual results: ---------------------- OCS taint related toleration is absent Expected results: --------------------- this should be added under toleration for ocs-metrics pod and deployment.apps - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" Additional info: ----------------- Tue Oct 27 14:07:38 UTC 2020 -------------- ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-144.ci OpenShift Container Storage 4.6.0-144.ci Succeeded -------------- =======PODS ====== NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES noobaa-operator-f7789cf94-wp74l 1/1 Running 0 52s 10.131.1.213 compute-1 <none> <none> ocs-metrics-exporter-576f474c87-9r7bv 1/1 Running 0 52s 10.129.3.104 compute-2 <none> <none> ocs-operator-686fd84dd7-6l45s 1/1 Running 0 52s 10.129.3.102 compute-2 <none> <none> rook-ceph-operator-7558fcf89c-wmjr4 1/1 Running 0 52s 10.129.3.103 compute-2 <none> <none> $ oc get all NAME READY STATUS RESTARTS AGE pod/noobaa-operator-f7789cf94-wp74l 1/1 Running 0 74s pod/ocs-metrics-exporter-576f474c87-9r7bv 1/1 Running 0 74s pod/ocs-operator-686fd84dd7-6l45s 1/1 Running 0 74s pod/rook-ceph-operator-7558fcf89c-wmjr4 1/1 Running 0 74s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/noobaa-operator 1/1 1 1 76s deployment.apps/ocs-metrics-exporter 1/1 1 1 76s deployment.apps/ocs-operator 1/1 1 1 76s deployment.apps/rook-ceph-operator 1/1 1 1 76s NAME DESIRED CURRENT READY AGE replicaset.apps/noobaa-operator-f7789cf94 1 1 1 77s replicaset.apps/ocs-metrics-exporter-576f474c87 1 1 1 77s replicaset.apps/ocs-operator-686fd84dd7 1 1 1 77s replicaset.apps/rook-ceph-operator-7558fcf89c 1 1 1 77s [nberry@localhost oct27-144.ci]$
Tested on infra nodes setup: The ocs-metrics-exporter pod was having toleration but was running on non OCS nodes I respinned the ocs-metrics it still runs on the same node and not migrated to infra nodes since the ocs-metrics-exporter pod have toleration for ocs-taints it should run on infra nodes after respin was my expectation. If the above is not expected, Please clarify on other ways to verify the behavior. Raising the need info on the same @neha @umanga Versions: ---------- 4.6.0-0.nightly-2020-10-14-095718 ocs-operator.v4.6.0-152.ci Console output: ---------------- $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-152.ci OpenShift Container Storage 4.6.0-152.ci ocs-operator.v4.6.0-144.ci Succeeded $ oc get nodes --show-labels | grep ocs compute-0 Ready infra,worker 20d v1.19.0+d59ce34 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2 compute-1 Ready infra,worker 20d v1.19.0+d59ce34 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0 compute-2 Ready infra,worker 20d v1.19.0+d59ce34 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1 ocs-metrics-exporter ============== f:tolerations: {} manager: olm operation: Update time: "2020-11-03T10:23:35Z" name: ocs-metrics-exporter namespace: openshift-storage ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: ClusterServiceVersion name: ocs-operator.v4.6.0-152.ci uid: 5b89c4e1-b273-4dac-83f1-698db1184a1f resourceVersion: "28789890" selfLink: /apis/apps/v1/namespaces/openshift-storage/deployments/ocs-metrics-exporter uid: 6ff5e5ca-c57d-4e0f-8ac9-db487c29d787 spec: -- tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" status: availableReplicas: 1 conditions: - lastTransitionTime: "2020-10-28T07:23:45Z" lastUpdateTime: "2020-10-28T07:23:45Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2020-10-15T09:04:03Z" lastUpdateTime: "2020-11-03T10:23:02Z" message: ReplicaSet "ocs-metrics-exporter-6d9867695b" has successfully progressed. $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready infra,worker 19d v1.19.0+d59ce34 compute-1 Ready infra,worker 19d v1.19.0+d59ce34 compute-2 Ready infra,worker 19d v1.19.0+d59ce34 compute-3 Ready worker 19d v1.19.0+d59ce34 compute-4 Ready worker 19d v1.19.0+d59ce34 compute-5 Ready worker 19d v1.19.0+d59ce34 control-plane-0 Ready master 19d v1.19.0+d59ce34 control-plane-1 Ready master 19d v1.19.0+d59ce34 control-plane-2 Ready master 19d v1.19.0+d59ce34 (python38) [sraghave@localhost ~]$ (python38) [sraghave@localhost ~]$ oc get pods -n openshift-storage -o wide | grep ocs-metrics ocs-metrics-exporter-6d9867695b-f4gft 1/1 Running 0 21h 10.128.3.130 compute-4 <none> <none> $ oc delete pod ocs-metrics-exporter-6d9867695b-f4gft -n openshift-storage pod "ocs-metrics-exporter-6d9867695b-f4gft" deleted $ oc get pods -n openshift-storage -o wide | grep ocs-metrics ocs-metrics-exporter-6d9867695b-q2bqg 1/1 Running 0 39s 10.128.2.89 compute-4 <none> <none>
This is expected. That's all taints and tolerations can do. It could run on infra nodes. If we want to ensure that it does, we need Node Affinities and that's a different issue. This BZ is verified as per Comment 5.
Test environment: ------------------- Infra labelled and OCS tainted nodes Test steps: ----------- 1. ocs-metrics pod was running on non-ocs node 2. Cordoned the non-ocs workers 3. Respinned the ocs-metrics-exporter pod 3. The ocs-metrics exporter pod started running on ocs-node Console output: --------------- $ oc get pods -n openshift-storage -o wide | grep ocs-metrics ocs-metrics-exporter-6d9867695b-q2bqg 1/1 Running 0 28h 10.128.2.89 compute-4 <none> <none> $ oc delete pod ocs-metrics-exporter-6d9867695b-q2bqg -n openshift-storage pod "ocs-metrics-exporter-6d9867695b-q2bqg" deleted $ oc get pods -n openshift-storage -o wide | grep ocs-metrics ocs-metrics-exporter-6d9867695b-6cscf 1/1 Running 0 18s 10.131.0.28 compute-0 <none> <none> $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready infra,worker 21d v1.19.0+d59ce34 compute-1 Ready infra,worker 21d v1.19.0+d59ce34 compute-2 Ready infra,worker 21d v1.19.0+d59ce34 compute-3 Ready,SchedulingDisabled worker 21d v1.19.0+d59ce34 compute-4 Ready,SchedulingDisabled worker 21d v1.19.0+d59ce34 compute-5 Ready,SchedulingDisabled worker 21d v1.19.0+d59ce34 control-plane-0 Ready master 21d v1.19.0+d59ce34 control-plane-1 Ready master 21d v1.19.0+d59ce34 control-plane-2 Ready master 21d v1.19.0+d59ce34 With the above verifications and based on comment #5 and #6 moving this BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605