Bug 2122980
| Summary: | [GSS]toleration for "non-ocs" taints OpenShift Data Foundation pods | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover | |
| Component: | odf-operator | Assignee: | Utkarsh Srivastava <usrivast> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Vishakha Kathole <vkathole> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.10 | CC: | bkunal, djuran, mhackett, midzik, mrajanna, muagarwa, ndevos, nigoyal, ocs-bugs, odf-bz-bot, rar, sostapov, sraghave, tdesala, tnielsen, uchapaga, usrivast | |
| Target Milestone: | --- | Keywords: | Regression | |
| Target Release: | ODF 4.12.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | 4.12.0-65 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: ODF Operator did not allow custom configuration for the subscription.
Consequence: Users could not modify the OLM subscriptions controlled by the ODF operator which also implied that users couldn't custom taints to the ODF child-deployments.
Fix: ODF Operator now respects the custom config if the user provides one.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2125147 (view as bug list) | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2125147 | |||
ODF QE is looking into the issue, Let us know if we can help you in any way. Thanks additional pods affected that rs+deployment+csv/subs are all owned by odf-operator This solution 4.9+ section does not work. https://access.redhat.com/articles/6408481 # oc get subs odf-operator -o yaml | grep -A7 config config: tolerations: - effect: NoSchedule key: nodename operator: Equal value: "true" installPlanApproval: Manual name: odf-operator noobaa-operator-79d9fd4599-5pxss 0/1 Pending 0 26m ocs-metrics-exporter-587db684f-jwgn5 0/1 Pending 0 26m ocs-operator-6659bd4695-kbjkr 0/1 Pending 0 13m rook-ceph-operator-66cf469cff-bgv9b 0/1 Pending 0 11m # oc get pod noobaa-operator-79d9fd4599-5pxss -o yaml | grep -A8 status: status: conditions: - lastProbeTime: null lastTransitionTime: "2022-09-02T19:23:02Z" message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending # oc get pod ocs-metrics-exporter-587db684f-jwgn5 -o yaml | grep -A8 status: status: conditions: - lastProbeTime: null lastTransitionTime: "2022-09-02T19:23:03Z" message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending # oc get pod ocs-operator-6659bd4695-kbjkr -o yaml | grep -A8 status: status: conditions: - lastProbeTime: null lastTransitionTime: "2022-09-02T19:36:06Z" message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending # oc get pod rook-ceph-operator-66cf469cff-bgv9b -o yaml | grep -A8 status: status: conditions: - lastProbeTime: null lastTransitionTime: "2022-09-02T19:38:29Z" message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending (In reply to khover from comment #7) > additional pods affected that rs+deployment+csv/subs are all owned by > odf-operator > > This solution 4.9+ section does not work. > > https://access.redhat.com/articles/6408481 > > # oc get subs odf-operator -o yaml | grep -A7 config > config: > tolerations: > - effect: NoSchedule > key: nodename > operator: Equal > value: "true" > installPlanApproval: Manual > name: odf-operator > > This should also be applied to other subs (mcg-operator-* and ocs-operator-*) in openshift-storage namespace. Looks like it wasn't applied. Can you check that? When I apply the workaround step by step and delete the pods I get the following result. As Bipin also observed: There is less likely that odf-operator-controller-manager pod can be manually restarted because we are setting the replica to zero. For sure, if it gets restarted due for any reason, we will have to fix all the subscriptions again. I even observed that restarting odf-console updates the replica count to 1 and bring back the odf-operator-controller-manager pods. **Dependency: odf-operator-controller-manager replicas=0 state NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-98759dfbb-gw5k9 2/2 Running 0 20s csi-cephfsplugin-dw9z6 3/3 Running 0 76m csi-cephfsplugin-provisioner-6596b9c55f-2gll7 6/6 Running 0 76m csi-cephfsplugin-provisioner-6596b9c55f-dk79h 6/6 Running 0 76m csi-cephfsplugin-qzds2 3/3 Running 0 76m csi-cephfsplugin-x4sch 3/3 Running 0 76m csi-rbdplugin-2j8dh 4/4 Running 0 76m csi-rbdplugin-6flhc 4/4 Running 0 76m csi-rbdplugin-9r6jx 4/4 Running 0 76m csi-rbdplugin-provisioner-76494fb89-5dd8p 7/7 Running 0 76m csi-rbdplugin-provisioner-76494fb89-lcn9b 7/7 Running 0 76m noobaa-core-0 1/1 Running 0 75m noobaa-db-pg-0 1/1 Running 0 75m noobaa-endpoint-5744c75459-dvph7 1/1 Running 0 76m noobaa-operator-5cfc45d674-hnhns 1/1 Running 0 23m ocs-metrics-exporter-55b94f5d76-qxcb5 1/1 Running 0 76m ocs-operator-65b67f8674-9cs94 1/1 Running 0 25m odf-console-5d4c666646-wxh7g 1/1 Running 0 76m rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal-74c8djl2m 1/1 Running 0 76m rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal-65c6hk8f9 1/1 Running 0 76m rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal-66b49gxrv 1/1 Running 0 76m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7b464946m4559 2/2 Running 0 76m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6986c8b56ggpx 2/2 Running 0 76m rook-ceph-mgr-a-7786fb48d6-kwpgd 2/2 Running 0 76m rook-ceph-mon-a-85f67c488-j74wv 2/2 Running 0 76m rook-ceph-mon-b-5b7fb6b994-4stph 2/2 Running 0 76m rook-ceph-mon-c-55794bb4dd-drjlr 2/2 Running 0 76m rook-ceph-operator-845fc866fd-8glr2 1/1 Running 0 76m rook-ceph-osd-0-7d676996b8-bjj7r 2/2 Running 0 76m rook-ceph-osd-1-cf59c5655-p97mk 2/2 Running 0 76m rook-ceph-osd-2-fc8bf9c74-6z89m 2/2 Running 0 76m rook-ceph-tools-5b87f59449-xq2bx 1/1 Running 0 76m # oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE csi-addons-controller-manager 1/1 1 1 6d csi-cephfsplugin-provisioner 2/2 2 2 6d csi-rbdplugin-provisioner 2/2 2 2 6d noobaa-endpoint 1/1 1 1 6d noobaa-operator 1/1 1 1 6d ocs-metrics-exporter 1/1 1 1 6d ocs-operator 1/1 1 1 6d odf-console 1/1 1 1 6d odf-operator-controller-manager 0/0 0 0 6d rook-ceph-crashcollector-ip-10-0-141-133.ec2.internal 1/1 1 1 6d rook-ceph-crashcollector-ip-10-0-153-201.ec2.internal 1/1 1 1 6d rook-ceph-crashcollector-ip-10-0-170-200.ec2.internal 1/1 1 1 6d rook-ceph-mds-ocs-storagecluster-cephfilesystem-a 1/1 1 1 6d rook-ceph-mds-ocs-storagecluster-cephfilesystem-b 1/1 1 1 6d rook-ceph-mgr-a 1/1 1 1 6d rook-ceph-mon-a 1/1 1 1 6d rook-ceph-mon-b 1/1 1 1 6d rook-ceph-mon-c 1/1 1 1 6d rook-ceph-operator 1/1 1 1 6d rook-ceph-osd-0 1/1 1 1 6d rook-ceph-osd-1 1/1 1 1 6d rook-ceph-osd-2 1/1 1 1 6d rook-ceph-tools 1/1 1 1 5d12h |
Description of problem (please be detailed as possible and provide log snippests): When applying non-ocs taints in ODF 4.10 csi-addons-controller-manager pods cannot schedule. # oc get subs odf-operator -o yaml | grep -A7 config config: tolerations: - effect: NoSchedule key: nodename operator: Equal value: "true" installPlanApproval: Manual name: odf-operator # oc adm taint nodes -l cluster.ocs.openshift.io/openshift-storage= nodename=true:NoSchedule # oc delete pod csi-addons-controller-manager-7656cbcf45-gzqjm # oc get pods | grep -v Running NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-7656cbcf45-45r7f 0/2 Pending 0 7s # oc get pods csi-addons-controller-manager-7656cbcf45-45r7f -o yaml | grep -A8 status status: conditions: - lastProbeTime: null lastTransitionTime: "2022-08-31T12:55:37Z" message: '0/6 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 3 node(s) had taint {nodename: true}, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable # oc get csv odf-csi-addons-operator.v4.10.5 -o yaml | grep -A6 conditions conditions: - lastTransitionTime: "2022-08-31T13:08:25Z" lastUpdateTime: "2022-08-31T13:08:25Z" message: 'installing: waiting for deployment csi-addons-controller-manager to become ready: deployment "csi-addons-controller-manager" not available: Deployment does not have minimum availability.' phase: Pending Version of all relevant components (if applicable): ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Installing odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This may be a blocker for any customer using and upgrading to 4.10 non-ocs node taints. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: