Bug 1992472
| Summary: | How to add toleration to OCS pods for any non OCS taints? | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Bipin Kunal <bkunal> | |
| Component: | rook | Assignee: | Subham Rai <srai> | |
| Status: | CLOSED ERRATA | QA Contact: | Shrivaibavi Raghaventhiran <sraghave> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.8 | CC: | assingh, borazem, djuran, ebenahar, fherrman, jelopez, jrivera, kbg, madam, muagarwa, nbecker, ocs-bugs, odf-bz-bot, prpandey, sapillai, sraghave, srai, tdesala, tnielsen, vumrao | |
| Target Milestone: | --- | |||
| Target Release: | ODF 4.9.0 | |||
| Hardware: | All | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | v4.9.0-158.ci | Doc Type: | Bug Fix | |
| Doc Text: |
.Adding toleration to OpenShift Container Storage pods for any non OpenShift Container Storage taints
Previously, pods could not be scheduled on non OpenShift Container Storage taints as the tolerations were not applied. With this update, tolerations are applied successfully and the pods can be scheduled on non OpenShift Container Storage taints.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1999158 (view as bug list) | Environment: | ||
| Last Closed: | 2021-12-13 17:44:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2005937 | |||
| Bug Blocks: | 1999158 | |||
|
Description
Bipin Kunal
2021-08-11 07:43:22 UTC
*** Bug 1949553 has been marked as a duplicate of this bug. *** Hi Bipul, try `storageClassDeviceSets` instead of `storageDeviceSets` In Rook, we set placement for osds/prepare-osd under `storageClassDeviceSets` see this `StorageClassDeviceSets []StorageClassDeviceSet `json:"storageClassDeviceSets,omitempty"` https://github.com/rook/rook/blob/master/pkg/apis/ceph.rook.io/v1/types.go#L1913 sorry (In reply to subham from comment #11) > Hi Bipul, sorry Bipin > > try `storageClassDeviceSets` instead of `storageDeviceSets` > In Rook, we set placement for osds/prepare-osd under `storageClassDeviceSets` > > see this `StorageClassDeviceSets []StorageClassDeviceSet > `json:"storageClassDeviceSets,omitempty"` > https://github.com/rook/rook/blob/master/pkg/apis/ceph.rook.io/v1/types. > go#L1913 the workaround that I commented on the google chat(link mentioned in c6) will not work as it is reading the `"osd"` key if `noPlacement` will true and `supportTSC` is false which is not the case. When using a StorageCluster to create and manage a CephCluster, you *MUST NOT* try and modify the CephCluster CR directly. ocs-operator will always revert any and all changes on each iteration of its reconcile loop. To have any changes persist, you must either: 1. Only interact with the StorageCluster, or 2. Scale the ocs-operator Deployment to 0 so the operator is no longer running. By default we specify three Placements: all, mon, and arbiter. See: https://github.com/openshift/ocs-operator/blob/release-4.8/controllers/storagecluster/cephcluster.go#L306-L308 With the "all" Placement, rook-ceph-operator should be trying to merge it with any other Placements for more specific components, giving preference to the values in the more specific Placements. As such, even *if* we specify or generate Placements for the osd and osd-prepare Pods, the values in "all" (specifically for Tolerations) should be included in the Placements calculated by rook-ceph-operator. For completeness, here is where we generate the OSD Placements: https://github.com/openshift/ocs-operator/blob/release-4.8/controllers/storagecluster/cephcluster.go#L523-L624 @tnielsen Could you have a look to see if ocs-operator is doing something wrong? If not we may have a bug with StorageClassDeviceSets in Rook. Bipin Can you attach the full CephCluster CR? It will help understand what Rook is actually trying to reconcile. If the tolerations are specified both on the "all" and the storageClassDeviceSet placement or preparePlacement, only one of them will be applied. I believe the storageClassDeviceSet tolerations will have higher precedence than the "all" tolerations. But the tolerations are not merged. Only nodeAffinity is merged for OSDs with "all". Setting the needinfo on Vaibhavi to provide the requested information in Comment16 From the cluster.yaml attached, I see that the tolerations are specified both in "all" and under the storageClassDeviceSets, and the tolerations from "all" are not being applied.
all:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
storageClassDeviceSets:
- count: 3
encrypted: true
name: ocs-deviceset-localblock-0
placement:
tolerations:
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
Subham Can you take a look at merging the tolerations when specified in both places? ApplyToPodSpec() only takes one, then ignores the tolerations from "all". But similar to the node affinity if onlyApplyOSDPlacement: true, we would not want to merge it.
Yes Bipin, it is available in the latest 4.9 downstream build Can you please let us know the steps to perform on OCS 4.9 ? (In reply to Shrivaibavi Raghaventhiran from comment #29) > Can you please let us know the steps to perform on OCS 4.9 ? 1. Add taints on the nodes(ex: oc adm taint nodes node1 xyz=true:NoSchedule) 2. Add tolerations under placements in storagecluser.yaml Note: 1) If you want the toleration to be applied on all the ceph pods(like MON, OSD, MGR), then you can directly add toleration under placements.ALL. 2) If you just want to add toleration for OSD, then add toleration under Storage.StorageClassDeviceSets.Placement Also see https://bugzilla.redhat.com/show_bug.cgi?id=1992472#c20 for the placement that was causing the tolerations specified in "all" to not be applied. could be an issue with OCS when we only pass `tolerations` for `mds` in the storageClusters spec and don't pass any podAntiAffinity along with it.
For example: Below spec is added to StorageCluster yaml
```
placement:
all:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
mds:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
```
This is causing nil pointer exception while ensuring CephFileSystem in ocs (https://github.com/red-hat-storage/ocs-operator/blob/32158124bba496f625d6c2f01c31affde8713fa7/controllers/storagecluster/placement.go#L66)
Error:
```
{"level":"info","ts":1631864507.7231574,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-175-201.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1c"}
{"level":"info","ts":1631864507.7231665,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-139-228.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1a"}
{"level":"info","ts":1631864507.7235525,"logger":"controllers.StorageCluster","msg":"Restoring original CephBlockPool.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"openshift-storage/ocs-storagecluster-cephblockpool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x163b1b9]
goroutine 1200 [running]:
github.com/openshift/ocs-operator/controllers/storagecluster.getPlacement(0xc000877180, 0x1a76153, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/remote-source/app/controllers/storagecluster/placement.go:66 +0x299
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).newCephFilesystemInstances(0xc0005ee3c0, 0xc000877180, 0xc00012a008, 0x1ce8cc0, 0xc000401680, 0x0, 0x0)
/remote-source/app/controllers/storagecluster/cephfilesystem.go:42 +0x1fd
github.com/openshift/ocs-operator/controllers/storagecluster.(*ocsCephFilesystems).ensureCreated(0x283e6d0, 0xc0005ee3c0, 0xc000877180, 0x0, 0x0)
/remote-source/app/controllers/storagecluster/cephfilesystem.go:68 +0x85
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases(0xc0005ee3c0, 0xc000877180, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0x0, 0x0, 0xc000877180, 0x0)
/remote-source/app/controllers/storagecluster/reconcile.go:375 +0xc7f
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile(0xc0005ee3c0, 0x1cc58d8, 0xc002414030, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0xc002414000, 0x0, 0x0, ...)
/remote-source/app/controllers/storagecluster/reconcile.go:160 +0x6c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000178d20, 0x1cc5830, 0xc000d37900, 0x18d32c0, 0xc000af77a0)
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000178d20, 0x1cc5830, 0xc000d37900, 0xc000a75f00)
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002bd090, 0xc000178d20, 0x1cc5830, 0xc000d37900)
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```
We need a fix from the OCS operator as tracked in the new BZ for applying the tolerations for MDS: https://bugzilla.redhat.com/show_bug.cgi?id=2005937. The CSI driver can have tolerations applied in the operator settings override as mentioned previously. This issue tracks the fix for merging the tolerations for all other Ceph daemons besides the mds. @Bipin Anything else needed from this BZ besides QE validation? Fix can be verified only after BZ #2005937 is fixed. Tested environment:
-------------------
VMWARE 3M, 3W
Versions:
----------
OCP - 4.9.0-0.nightly-2021-10-01-034521
ODF - odf-operator.v4.9.0-164.ci
Steps Performed :
------------------
1. Tainted all nodes masters and workers with taint 'xyz'
2. Edited storagecluster yaml with below values
placement:
all:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
mds:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
noobaa-core:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
rgw:
tolerations:
- effect: NoSchedule
key: xyz
operator: Equal
value: "true"
- effect: NoSchedule
key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
3. Noticed the pods under openshift-storage respin after updating the storagecluster yaml (Expected)
4. Also edited rook-ceph-config with below values
CSI_PLUGIN_TOLERATIONS: |2-
- key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
effect: NoSchedule
- key: xyz
operator: Equal
value: "true"
effect: NoSchedule
CSI_PROVISIONER_TOLERATIONS: |2-
- key: node.ocs.openshift.io/storage
operator: Equal
value: "true"
effect: NoSchedule
- key: xyz
operator: Equal
value: "true"
effect: NoSchedule
Few Points to be noted:
------------------------
1. Before editing the storagecluster yaml there were ocs tolerations present on almost every pod (except noobaa-operator,odf-console, odf-operator-controller-manager) But after applying non-ocs tolerations via storagecluster the existing OCS taint was overriden on pods which comes under "all" placements group.
>> so current solution is to apply both default and new tolerations on storagecluster yaml as mentioned in step-2
2. Pods odf-console odf-operator-controller-manager and noobaa-operator doesn't have OCS tolerations on them by default
3. Where to set tolerations for non-ocs taint for tool-box pod ??
4. When setting tolerations under "placements 'all'" osd-prepare pods are not updated with 'xyz' tolerations
5. Setting tolerations for operators is currently unknown
Console o/p
------------
After editing storagecluster yaml below are the pods which are not updated
$ oc get pods -n openshift-storage | grep 4d1h
ocs-metrics-exporter-7566789b65-5n25b 1/1 Running 0 4d1h
ocs-operator-8588554d5-dlrzp 1/1 Running 0 4d1h
odf-console-5c7446d49f-nk7v7 1/1 Running 0 4d1h
odf-operator-controller-manager-67fc478859-fj6rm 2/2 Running 8 (7h41m ago) 4d1h
rook-ceph-operator-749d46bd8-5jbg4 1/1 Running 0 4d1h
rook-ceph-osd-prepare-ocs-deviceset-0-data-05xb9n--1-rvdx7 0/1 Completed 0 4d1h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0hgnq9--1-t5gl4 0/1 Completed 0 4d1h
rook-ceph-osd-prepare-ocs-deviceset-2-data-0sdsd2--1-bckwq 0/1 Completed 0 4d1h
rook-ceph-tools-f57d97cc6-q4thh 1/1 Running 0 4d1h
Storagecluster yaml before edit : http://pastebin.test.redhat.com/998781
Storagecluster yaml after edit : http://pastebin.test.redhat.com/998782
Console o/p command before[1] and after edit[2]:
for i in $(oc get deployment.apps -n openshift-storage|awk '{print$1}') ; do echo $i; echo "==============" ;oc -n openshift-storage get deployment.apps $i -o yaml | grep -v NAME| grep tolerations -A 16 ; done
[1] http://pastebin.test.redhat.com/998783
[2] http://pastebin.test.redhat.com/998784
(In reply to Shrivaibavi Raghaventhiran from comment #57) > Tested environment: > ------------------- > VMWARE 3M, 3W > > Versions: > ---------- > OCP - 4.9.0-0.nightly-2021-10-01-034521 > ODF - odf-operator.v4.9.0-164.ci > > Steps Performed : > ------------------ > 1. Tainted all nodes masters and workers with taint 'xyz' > > 2. Edited storagecluster yaml with below values > > placement: > all: > tolerations: > - effect: NoSchedule > key: xyz > operator: Equal > value: "true" > - effect: NoSchedule > key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > mds: > tolerations: > - effect: NoSchedule > key: xyz > operator: Equal > value: "true" > - effect: NoSchedule > key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > noobaa-core: > tolerations: > - effect: NoSchedule > key: xyz > operator: Equal > value: "true" > - effect: NoSchedule > key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > rgw: > tolerations: > - effect: NoSchedule > key: xyz > operator: Equal > value: "true" > - effect: NoSchedule > key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > > 3. Noticed the pods under openshift-storage respin after updating the > storagecluster yaml (Expected) > > 4. Also edited rook-ceph-config with below values > > CSI_PLUGIN_TOLERATIONS: |2- > > - key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > effect: NoSchedule > - key: xyz > operator: Equal > value: "true" > effect: NoSchedule > CSI_PROVISIONER_TOLERATIONS: |2- > > - key: node.ocs.openshift.io/storage > operator: Equal > value: "true" > effect: NoSchedule > - key: xyz > operator: Equal > value: "true" > effect: NoSchedule > > > Few Points to be noted: > ------------------------ > > 1. Before editing the storagecluster yaml there were ocs tolerations present > on almost every pod (except noobaa-operator,odf-console, > odf-operator-controller-manager) But after applying non-ocs tolerations via > storagecluster the existing OCS taint was overriden on pods which comes > under "all" placements group. > >> so current solution is to apply both default and new tolerations on storagecluster yaml as mentioned in step-2 > > 2. Pods odf-console odf-operator-controller-manager and noobaa-operator > doesn't have OCS tolerations on them by default > > 3. Where to set tolerations for non-ocs taint for tool-box pod ?? I'll check and update for rook-ceph-operator and toolbox. > > 4. When setting tolerations under "placements 'all'" osd-prepare pods are > not updated with 'xyz' tolerations > > 5. Setting tolerations for operators is currently unknown I'll update > > Console o/p > ------------ > After editing storagecluster yaml below are the pods which are not updated > > $ oc get pods -n openshift-storage | grep 4d1h > ocs-metrics-exporter-7566789b65-5n25b 1/1 > Running 0 4d1h > ocs-operator-8588554d5-dlrzp 1/1 > Running 0 4d1h > odf-console-5c7446d49f-nk7v7 1/1 > Running 0 4d1h > odf-operator-controller-manager-67fc478859-fj6rm 2/2 > Running 8 (7h41m ago) 4d1h > rook-ceph-operator-749d46bd8-5jbg4 1/1 > Running 0 4d1h > rook-ceph-osd-prepare-ocs-deviceset-0-data-05xb9n--1-rvdx7 0/1 > Completed 0 4d1h > rook-ceph-osd-prepare-ocs-deviceset-1-data-0hgnq9--1-t5gl4 0/1 > Completed 0 4d1h > rook-ceph-osd-prepare-ocs-deviceset-2-data-0sdsd2--1-bckwq 0/1 > Completed 0 4d1h > rook-ceph-tools-f57d97cc6-q4thh 1/1 > Running 0 4d1h > According to your comment on steps performed`1. Tainted all nodes masters and workers with taint 'xyz'` I see `rook-ceph-osd-prepare` pods are running. > Storagecluster yaml before edit : http://pastebin.test.redhat.com/998781 > Storagecluster yaml after edit : http://pastebin.test.redhat.com/998782 > > Console o/p command before[1] and after edit[2]: > for i in $(oc get deployment.apps -n openshift-storage|awk '{print$1}') ; do > echo $i; echo "==============" ;oc -n openshift-storage get deployment.apps > $i -o yaml | grep -v NAME| grep tolerations -A 16 ; done so this command is for deployments only and `osd-prepare` is job so I don't see enough logs to confirm toleration didn't applied to `rook-ceph-osd-prepare` pod and can I get prepare pod yaml and operator logs to confirm. I have tested multiple tims with rook upstream toleration are applied on prepare pod and other pods. > [1] http://pastebin.test.redhat.com/998783 > [2] http://pastebin.test.redhat.com/998784 (In reply to Subham Rai from comment #58) > (In reply to Shrivaibavi Raghaventhiran from comment #57) > > Tested environment: > > ------------------- > > VMWARE 3M, 3W > > > > Versions: > > ---------- > > OCP - 4.9.0-0.nightly-2021-10-01-034521 > > ODF - odf-operator.v4.9.0-164.ci > > > > Steps Performed : > > ------------------ > > 1. Tainted all nodes masters and workers with taint 'xyz' > > > > 2. Edited storagecluster yaml with below values > > > > placement: > > all: > > tolerations: > > - effect: NoSchedule > > key: xyz > > operator: Equal > > value: "true" > > - effect: NoSchedule > > key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > mds: > > tolerations: > > - effect: NoSchedule > > key: xyz > > operator: Equal > > value: "true" > > - effect: NoSchedule > > key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > noobaa-core: > > tolerations: > > - effect: NoSchedule > > key: xyz > > operator: Equal > > value: "true" > > - effect: NoSchedule > > key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > rgw: > > tolerations: > > - effect: NoSchedule > > key: xyz > > operator: Equal > > value: "true" > > - effect: NoSchedule > > key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > > > 3. Noticed the pods under openshift-storage respin after updating the > > storagecluster yaml (Expected) > > > > 4. Also edited rook-ceph-config with below values > > > > CSI_PLUGIN_TOLERATIONS: |2- > > > > - key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > effect: NoSchedule > > - key: xyz > > operator: Equal > > value: "true" > > effect: NoSchedule > > CSI_PROVISIONER_TOLERATIONS: |2- > > > > - key: node.ocs.openshift.io/storage > > operator: Equal > > value: "true" > > effect: NoSchedule > > - key: xyz > > operator: Equal > > value: "true" > > effect: NoSchedule > > > > > > Few Points to be noted: > > ------------------------ > > > > 1. Before editing the storagecluster yaml there were ocs tolerations present > > on almost every pod (except noobaa-operator,odf-console, > > odf-operator-controller-manager) But after applying non-ocs tolerations via > > storagecluster the existing OCS taint was overriden on pods which comes > > under "all" placements group. > > >> so current solution is to apply both default and new tolerations on storagecluster yaml as mentioned in step-2 > > > > 2. Pods odf-console odf-operator-controller-manager and noobaa-operator > > doesn't have OCS tolerations on them by default > > > > 3. Where to set tolerations for non-ocs taint for tool-box pod ?? > I'll check and update for rook-ceph-operator and toolbox. > > > > 4. When setting tolerations under "placements 'all'" osd-prepare pods are > > not updated with 'xyz' tolerations > > > > 5. Setting tolerations for operators is currently unknown > I'll update > > > > Console o/p > > ------------ > > After editing storagecluster yaml below are the pods which are not updated > > > > $ oc get pods -n openshift-storage | grep 4d1h > > ocs-metrics-exporter-7566789b65-5n25b 1/1 > > Running 0 4d1h > > ocs-operator-8588554d5-dlrzp 1/1 > > Running 0 4d1h > > odf-console-5c7446d49f-nk7v7 1/1 > > Running 0 4d1h > > odf-operator-controller-manager-67fc478859-fj6rm 2/2 > > Running 8 (7h41m ago) 4d1h > > rook-ceph-operator-749d46bd8-5jbg4 1/1 > > Running 0 4d1h > > rook-ceph-osd-prepare-ocs-deviceset-0-data-05xb9n--1-rvdx7 0/1 > > Completed 0 4d1h > > rook-ceph-osd-prepare-ocs-deviceset-1-data-0hgnq9--1-t5gl4 0/1 > > Completed 0 4d1h > > rook-ceph-osd-prepare-ocs-deviceset-2-data-0sdsd2--1-bckwq 0/1 > > Completed 0 4d1h > > rook-ceph-tools-f57d97cc6-q4thh 1/1 > > Running 0 4d1h > > > According to your comment on steps performed`1. Tainted all nodes masters > and workers with taint 'xyz'` I see `rook-ceph-osd-prepare` pods are > running. > > Storagecluster yaml before edit : http://pastebin.test.redhat.com/998781 > > Storagecluster yaml after edit : http://pastebin.test.redhat.com/998782 > > > > Console o/p command before[1] and after edit[2]: > > for i in $(oc get deployment.apps -n openshift-storage|awk '{print$1}') ; do > > echo $i; echo "==============" ;oc -n openshift-storage get deployment.apps > > $i -o yaml | grep -v NAME| grep tolerations -A 16 ; done > > so this command is for deployments only and `osd-prepare` is job so I don't > see enough logs to confirm toleration didn't applied to > `rook-ceph-osd-prepare` pod and can I get prepare pod yaml and operator logs > to confirm. I have tested multiple tims with rook upstream toleration are > applied on prepare pod and other pods. > > [1] http://pastebin.test.redhat.com/998783 > > [2] http://pastebin.test.redhat.com/998784 I checked the pod yaml too, But did not see tolerations applied for 'xyz' taint. ``` tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 ``` Pod yaml: http://pastebin.test.redhat.com/998865 Let me know if you need separate bz to track this. For other logs i have sent the live cluster details in gchat. After talking with Sebastien about how we reconcile `osd-prepare`. So it is expected that updating the placement will not update the `OSD-prepare` pod as it is just a Job that has finished his job. If we want to test this scenario on the live cluster(where we already have storagecluster) we first have to add a new OSD and the new `OSD-prepare pod will have the latest toleration. But the older `OSD-preapre` pod still will not update. Version used: ODF - 4.9.0-164.ci OCP - 4.9.0-0.nightly-2021-09-27-105859 Test steps: ----------- 1. Added tolerations for taint 'xyz' in subscription (for operators), storagecluster(for ocs-pods) and configmap rook-ceph-operator-config (for csi plugin and provisioners pods) 2. Add-capacity by editing storagecluster (increasing the count to 2 in StorageDeviceSets) 3. Respinned operator pods and other pods 4. Rebooted nodes one by one All the above test steps passed and No issues noticed. The newly added osds also had tolerations and were up and running, node reboots did not cause any issue as well all tolerations were intact. With the above verifications moving the BZ to verified. Tool-box was in the Pending state because it did not have any tolerations, This will be tracked in a separate BZ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086 |