Bug 2122980 - [GSS]toleration for "non-ocs" taints OpenShift Data Foundation pods
Summary: [GSS]toleration for "non-ocs" taints OpenShift Data Foundation pods
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ODF 4.12.0
Assignee: Utkarsh Srivastava
QA Contact: Vishakha Kathole
URL:
Whiteboard:
Depends On:
Blocks: 2125147
TreeView+ depends on / blocked
 
Reported: 2022-08-31 13:26 UTC by khover
Modified: 2023-08-09 17:00 UTC (History)
17 users (show)

Fixed In Version: 4.12.0-65
Doc Type: Bug Fix
Doc Text:
Cause: ODF Operator did not allow custom configuration for the subscription. Consequence: Users could not modify the OLM subscriptions controlled by the ODF operator which also implied that users couldn't custom taints to the ODF child-deployments. Fix: ODF Operator now respects the custom config if the user provides one.
Clone Of:
: 2125147 (view as bug list)
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 6571 0 None Merged Test taint toleration 2022-12-21 06:23:16 UTC
Github red-hat-storage odf-operator pull 247 0 None open Prevent subscription config override if it is not empty 2022-09-07 09:21:41 UTC
Red Hat Bugzilla 2059105 1 None None None 2024-06-27 07:14:10 UTC

Description khover 2022-08-31 13:26:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When applying non-ocs taints in ODF 4.10 csi-addons-controller-manager pods cannot schedule.

# oc get subs odf-operator -o yaml | grep -A7 config
  config:
    tolerations:
    - effect: NoSchedule
      key: nodename
      operator: Equal
      value: "true"
  installPlanApproval: Manual
  name: odf-operator

# oc adm taint nodes -l cluster.ocs.openshift.io/openshift-storage= nodename=true:NoSchedule

# oc delete pod csi-addons-controller-manager-7656cbcf45-gzqjm

# oc get pods | grep -v Running
NAME                                                              READY   STATUS      RESTARTS      AGE
csi-addons-controller-manager-7656cbcf45-45r7f                    0/2     Pending     0             7s

# oc get pods csi-addons-controller-manager-7656cbcf45-45r7f -o yaml | grep -A8 status
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-31T12:55:37Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

# oc get csv odf-csi-addons-operator.v4.10.5 -o yaml | grep -A6 conditions
  conditions:
  - lastTransitionTime: "2022-08-31T13:08:25Z"
    lastUpdateTime: "2022-08-31T13:08:25Z"
    message: 'installing: waiting for deployment csi-addons-controller-manager to
      become ready: deployment "csi-addons-controller-manager" not available: Deployment
      does not have minimum availability.'
    phase: Pending



Version of all relevant components (if applicable):

ocs-operator.v4.10.5              OpenShift Container Storage   4.10.5    ocs-operator.v4.10.4              Succeeded
odf-csi-addons-operator.v4.10.5   CSI Addons                    4.10.5    odf-csi-addons-operator.v4.10.4   Installing
odf-operator.v4.10.5              OpenShift Data Foundation     4.10.5    odf-operator.v4.10.4              Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This may be a blocker for any customer using and upgrading to 4.10 non-ocs node taints.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 5 Shrivaibavi Raghaventhiran 2022-09-01 15:55:34 UTC
ODF QE is looking into the issue, Let us know if we can help you in any way. Thanks

Comment 7 khover 2022-09-02 20:08:03 UTC
additional pods affected that rs+deployment+csv/subs are all owned by odf-operator

This solution 4.9+ section does not work.

https://access.redhat.com/articles/6408481

# oc get subs odf-operator -o yaml | grep -A7 config
  config:
    tolerations:
    - effect: NoSchedule
      key: nodename
      operator: Equal
      value: "true"
  installPlanApproval: Manual
  name: odf-operator


noobaa-operator-79d9fd4599-5pxss                                  0/1     Pending   0          26m
ocs-metrics-exporter-587db684f-jwgn5                              0/1     Pending   0          26m
ocs-operator-6659bd4695-kbjkr                                     0/1     Pending   0          13m
rook-ceph-operator-66cf469cff-bgv9b                               0/1     Pending   0          11m

# oc get pod noobaa-operator-79d9fd4599-5pxss -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:23:02Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

# oc get pod ocs-metrics-exporter-587db684f-jwgn5 -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:23:03Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending


# oc get pod ocs-operator-6659bd4695-kbjkr -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:36:06Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending


# oc get pod rook-ceph-operator-66cf469cff-bgv9b -o yaml | grep -A8 status:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-09-02T19:38:29Z"
    message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that
      the pod didn''t tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

Comment 8 umanga 2022-09-06 04:01:40 UTC
(In reply to khover from comment #7)
> additional pods affected that rs+deployment+csv/subs are all owned by
> odf-operator
> 
> This solution 4.9+ section does not work.
> 
> https://access.redhat.com/articles/6408481
> 
> # oc get subs odf-operator -o yaml | grep -A7 config
>   config:
>     tolerations:
>     - effect: NoSchedule
>       key: nodename
>       operator: Equal
>       value: "true"
>   installPlanApproval: Manual
>   name: odf-operator
> 
> 
This should also be applied to other subs (mcg-operator-* and ocs-operator-*) in openshift-storage namespace. Looks like it wasn't applied.
Can you check that?

Comment 24 khover 2022-09-07 15:06:41 UTC
When I apply the workaround step by step and delete the pods I get the following result.

As Bipin also observed:

There is less likely that odf-operator-controller-manager pod can be manually restarted because we are setting the replica to zero. For sure, if it gets restarted due for any reason, we will have to fix all the subscriptions again.

I even observed that restarting odf-console updates the replica count to 1 and bring back the odf-operator-controller-manager  pods.


**Dependency: odf-operator-controller-manager replicas=0 state

NAME                                                              READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-98759dfbb-gw5k9                     2/2     Running   0          20s
csi-cephfsplugin-dw9z6                                            3/3     Running   0          76m
csi-cephfsplugin-provisioner-6596b9c55f-2gll7                     6/6     Running   0          76m
csi-cephfsplugin-provisioner-6596b9c55f-dk79h                     6/6     Running   0          76m
csi-cephfsplugin-qzds2                                            3/3     Running   0          76m
csi-cephfsplugin-x4sch                                            3/3     Running   0          76m
csi-rbdplugin-2j8dh                                               4/4     Running   0          76m
csi-rbdplugin-6flhc                                               4/4     Running   0          76m
csi-rbdplugin-9r6jx                                               4/4     Running   0          76m
csi-rbdplugin-provisioner-76494fb89-5dd8p                         7/7     Running   0          76m
csi-rbdplugin-provisioner-76494fb89-lcn9b                         7/7     Running   0          76m
noobaa-core-0                                                     1/1     Running   0          75m
noobaa-db-pg-0                                                    1/1     Running   0          75m
noobaa-endpoint-5744c75459-dvph7                                  1/1     Running   0          76m
noobaa-operator-5cfc45d674-hnhns                                  1/1     Running   0          23m
ocs-metrics-exporter-55b94f5d76-qxcb5                             1/1     Running   0          76m
ocs-operator-65b67f8674-9cs94                                     1/1     Running   0          25m
odf-console-5d4c666646-wxh7g                                      1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal-74c8djl2m   1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal-65c6hk8f9   1/1     Running   0          76m
rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal-66b49gxrv   1/1     Running   0          76m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b464946m4559   2/2     Running   0          76m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6986c8b56ggpx   2/2     Running   0          76m
rook-ceph-mgr-a-7786fb48d6-kwpgd                                  2/2     Running   0          76m
rook-ceph-mon-a-85f67c488-j74wv                                   2/2     Running   0          76m
rook-ceph-mon-b-5b7fb6b994-4stph                                  2/2     Running   0          76m
rook-ceph-mon-c-55794bb4dd-drjlr                                  2/2     Running   0          76m
rook-ceph-operator-845fc866fd-8glr2                               1/1     Running   0          76m
rook-ceph-osd-0-7d676996b8-bjj7r                                  2/2     Running   0          76m
rook-ceph-osd-1-cf59c5655-p97mk                                   2/2     Running   0          76m
rook-ceph-osd-2-fc8bf9c74-6z89m                                   2/2     Running   0          76m
rook-ceph-tools-5b87f59449-xq2bx                                  1/1     Running   0          76m

# oc get deployments
NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
csi-addons-controller-manager                           1/1     1            1           6d
csi-cephfsplugin-provisioner                            2/2     2            2           6d
csi-rbdplugin-provisioner                               2/2     2            2           6d
noobaa-endpoint                                         1/1     1            1           6d
noobaa-operator                                         1/1     1            1           6d
ocs-metrics-exporter                                    1/1     1            1           6d
ocs-operator                                            1/1     1            1           6d
odf-console                                             1/1     1            1           6d
odf-operator-controller-manager                         0/0     0            0           6d
rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal   1/1     1            1           6d
rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal   1/1     1            1           6d
rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal   1/1     1            1           6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a       1/1     1            1           6d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b       1/1     1            1           6d
rook-ceph-mgr-a                                         1/1     1            1           6d
rook-ceph-mon-a                                         1/1     1            1           6d
rook-ceph-mon-b                                         1/1     1            1           6d
rook-ceph-mon-c                                         1/1     1            1           6d
rook-ceph-operator                                      1/1     1            1           6d
rook-ceph-osd-0                                         1/1     1            1           6d
rook-ceph-osd-1                                         1/1     1            1           6d
rook-ceph-osd-2                                         1/1     1            1           6d
rook-ceph-tools                                         1/1     1            1           5d12h


Note You need to log in before you can comment on or make changes to this bug.