Bug 2122980

Summary: [GSS]toleration for "non-ocs" taints OpenShift Data Foundation pods
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: odf-operatorAssignee: Utkarsh Srivastava <usrivast>
Status: CLOSED CURRENTRELEASE QA Contact: Vishakha Kathole <vkathole>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.10CC: bkunal, djuran, mhackett, midzik, mrajanna, muagarwa, ndevos, nigoyal, ocs-bugs, odf-bz-bot, rar, sostapov, sraghave, tdesala, tnielsen, uchapaga, usrivast
Target Milestone: ---Keywords: Regression
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.12.0-65 Doc Type: Bug Fix
Doc Text:
Cause: ODF Operator did not allow custom configuration for the subscription. Consequence: Users could not modify the OLM subscriptions controlled by the ODF operator which also implied that users couldn't custom taints to the ODF child-deployments. Fix: ODF Operator now respects the custom config if the user provides one.
Story Points: ---
Clone Of:
: 2125147 (view as bug list) Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2125147    

Description khover 2022-08-31 13:26:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When applying non-ocs taints in ODF 4.10 csi-addons-controller-manager pods cannot schedule.

# oc get subs odf-operator -o yaml | grep -A7 config
  config:
    tolerations:
    - effect: NoSchedule
      key: nodename
      operator: Equal
      value: "true"
  installPlanApproval: Manual
  name: odf-operator

# oc adm taint nodes -l cluster.ocs.openshift.io/openshift-storage= nodename=true:NoSchedule

# oc delete pod csi-addons-controller-manager-7656cbcf45-gzqjm

# oc get pods | grep -v Running
NAME                                                              READY   STATUS      RESTARTS      AGE
csi-addons-controller-manager-7656cbcf45-45r7f                    0/2     Pending     0             7s

# oc get pods csi-addons-controller-manager-7656cbcf45-45r7f -o yaml | grep -A8 status
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-31T12:55:37Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

# oc get csv odf-csi-addons-operator.v4.10.5 -o yaml | grep -A6 conditions
  conditions:
  - lastTransitionTime: "2022-08-31T13:08:25Z"
    lastUpdateTime: "2022-08-31T13:08:25Z"
    message: 'installing: waiting for deployment csi-addons-controller-manager to
      become ready: deployment "csi-addons-controller-manager" not available: Deployment
      does not have minimum availability.'
    phase: Pending



Version of all relevant components (if applicable):

ocs-operator.v4.10.5              OpenShift Container Storage   4.10.5    ocs-operator.v4.10.4              Succeeded
odf-csi-addons-operator.v4.10.5   CSI Addons                    4.10.5    odf-csi-addons-operator.v4.10.4   Installing
odf-operator.v4.10.5              OpenShift Data Foundation     4.10.5    odf-operator.v4.10.4              Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This may be a blocker for any customer using and upgrading to 4.10 non-ocs node taints.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 5 Shrivaibavi Raghaventhiran 2022-09-01 15:55:34 UTC
ODF QE is looking into the issue, Let us know if we can help you in any way. Thanks

Comment 7 khover 2022-09-02 20:08:03 UTC
additional pods affected that rs+deployment+csv/subs are all owned by odf-operator

This solution 4.9+ section does not work.

https://access.redhat.com/articles/6408481

# oc get subs odf-operator -o yaml | grep -A7 config
  config:
    tolerations:
    - effect: NoSchedule
      key: nodename
      operator: Equal
      value: "true"
  installPlanApproval: Manual
  name: odf-operator


noobaa-operator-79d9fd4599-5pxss                                  0/1     Pending   0          26m
ocs-metrics-exporter-587db684f-jwgn5                              0/1     Pending   0          26m
ocs-operator-6659bd4695-kbjkr                                     0/1     Pending   0          13m
rook-ceph-operator-66cf469cff-bgv9b                               0/1     Pending   0          11m

# oc get pod noobaa-operator-79d9fd4599-5pxss -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:23:02Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

# oc get pod ocs-metrics-exporter-587db684f-jwgn5 -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:23:03Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending


# oc get pod ocs-operator-6659bd4695-kbjkr -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:36:06Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending


# oc get pod rook-ceph-operator-66cf469cff-bgv9b -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:38:29Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

Comment 8 umanga 2022-09-06 04:01:40 UTC
(In reply to khover from comment #7)
> additional pods affected that rs+deployment+csv/subs are all owned by
> odf-operator
> 
> This solution 4.9+ section does not work.
> 
> https://access.redhat.com/articles/6408481
> 
> # oc get subs odf-operator -o yaml | grep -A7 config
>   config:
>     tolerations:
>     - effect: NoSchedule
>       key: nodename
>       operator: Equal
>       value: "true"
>   installPlanApproval: Manual
>   name: odf-operator
> 
> 
This should also be applied to other subs (mcg-operator-* and ocs-operator-*) in openshift-storage namespace. Looks like it wasn't applied.
Can you check that?

Comment 24 khover 2022-09-07 15:06:41 UTC
When I apply the workaround step by step and delete the pods I get the following result.

As Bipin also observed:

There is less likely that odf-operator-controller-manager pod can be manually restarted because we are setting the replica to zero. For sure, if it gets restarted due for any reason, we will have to fix all the subscriptions again.

I even observed that restarting odf-console updates the replica count to 1 and bring back the odf-operator-controller-manager  pods.


**Dependency: odf-operator-controller-manager replicas=0 state

NAME                                                              READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-98759dfbb-gw5k9                     2/2     Running   0          20s
csi-cephfsplugin-dw9z6                                            3/3     Running   0          76m
csi-cephfsplugin-provisioner-6596b9c55f-2gll7                     6/6     Running   0          76m
csi-cephfsplugin-provisioner-6596b9c55f-dk79h                     6/6     Running   0          76m
csi-cephfsplugin-qzds2                                            3/3     Running   0          76m
csi-cephfsplugin-x4sch                                            3/3     Running   0          76m
csi-rbdplugin-2j8dh                                               4/4     Running   0          76m
csi-rbdplugin-6flhc                                               4/4     Running   0          76m
csi-rbdplugin-9r6jx                                               4/4     Running   0          76m
csi-rbdplugin-provisioner-76494fb89-5dd8p                         7/7     Running   0          76m
csi-rbdplugin-provisioner-76494fb89-lcn9b                         7/7     Running   0          76m
noobaa-core-0                                                     1/1     Running   0          75m
noobaa-db-pg-0                                                    1/1     Running   0          75m
noobaa-endpoint-5744c75459-dvph7                                  1/1     Running   0          76m
noobaa-operator-5cfc45d674-hnhns                                  1/1     Running   0          23m
ocs-metrics-exporter-55b94f5d76-qxcb5                             1/1     Running   0          76m
ocs-operator-65b67f8674-9cs94                                     1/1     Running   0          25m
odf-console-5d4c666646-wxh7g                                      1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal-74c8djl2m   1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal-65c6hk8f9   1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal-66b49gxrv   1/1     Running   0          76m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b464946m4559   2/2     Running   0          76m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6986c8b56ggpx   2/2     Running   0          76m
rook-ceph-mgr-a-7786fb48d6-kwpgd                                  2/2     Running   0          76m
rook-ceph-mon-a-85f67c488-j74wv                                   2/2     Running   0          76m
rook-ceph-mon-b-5b7fb6b994-4stph                                  2/2     Running   0          76m
rook-ceph-mon-c-55794bb4dd-drjlr                                  2/2     Running   0          76m
rook-ceph-operator-845fc866fd-8glr2                               1/1     Running   0          76m
rook-ceph-osd-0-7d676996b8-bjj7r                                  2/2     Running   0          76m
rook-ceph-osd-1-cf59c5655-p97mk                                   2/2     Running   0          76m
rook-ceph-osd-2-fc8bf9c74-6z89m                                   2/2     Running   0          76m
rook-ceph-tools-5b87f59449-xq2bx                                  1/1     Running   0          76m

# oc get deployments
NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
csi-addons-controller-manager                           1/1     1            1           6d
csi-cephfsplugin-provisioner                            2/2     2            2           6d
csi-rbdplugin-provisioner                               2/2     2            2           6d
noobaa-endpoint                                         1/1     1            1           6d
noobaa-operator                                         1/1     1            1           6d
ocs-metrics-exporter                                    1/1     1            1           6d
ocs-operator                                            1/1     1            1           6d
odf-console                                             1/1     1            1           6d
odf-operator-controller-manager                         0/0     0            0           6d
rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal   1/1     1            1           6d
rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal   1/1     1            1           6d
rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal   1/1     1            1           6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a       1/1     1            1           6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b       1/1     1            1           6d
rook-ceph-mgr-a                                         1/1     1            1           6d
rook-ceph-mon-a                                         1/1     1            1           6d
rook-ceph-mon-b                                         1/1     1            1           6d
rook-ceph-mon-c                                         1/1     1            1           6d
rook-ceph-operator                                      1/1     1            1           6d
rook-ceph-osd-0                                         1/1     1            1           6d
rook-ceph-osd-1                                         1/1     1            1           6d
rook-ceph-osd-2                                         1/1     1            1           6d
rook-ceph-tools                                         1/1     1            1           5d12h