2005937 – Not able to add toleration for MDS pods via StorageCluster yaml

Bug 2005937 - Not able to add toleration for MDS pods via StorageCluster yaml

Summary: Not able to add toleration for MDS pods via StorageCluster yaml

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Subham Rai
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1992472 2005970
TreeView+	depends on / blocked

Reported:	2021-09-20 14:04 UTC by Santosh Pillai
Modified:	2023-08-09 17:00 UTC (History)
CC List:	10 users (show)
Fixed In Version:	v4.9.0-164.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2005970 (view as bug list)
Environment:
Last Closed:	2021-12-13 17:46:30 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1350	None	open	add check for PodAntiAffinity not nil	2021-09-28 06:17:06 UTC
Github	red-hat-storage ocs-operator pull 1351	None	open	Bug 2005937: [release-4.9] add check for PodAntiAffinity not nil	2021-09-28 15:46:06 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:46:50 UTC

Description Santosh Pillai 2021-09-20 14:04:41 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Use case: to allow OCS pods to run on the worker nodes which have some non-OCS taints. 

OCS panics when trying to add tolerations for `mds` component in the StorageCluster.yaml to allow OCS pods to run on the worker nodes which have some non-OCS taints. 

For example: Below spec is added to StorageCluster yaml

```
placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
```


This is causing nil pointer exception while ensuring CephFileSystem in ocs (https://github.com/red-hat-storage/ocs-operator/blob/32158124bba496f625d6c2f01c31affde8713fa7/controllers/storagecluster/placement.go#L66) 


Error:

```
{"level":"info","ts":1631864507.7231574,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-175-201.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1c"}
{"level":"info","ts":1631864507.7231665,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-139-228.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1a"}
{"level":"info","ts":1631864507.7235525,"logger":"controllers.StorageCluster","msg":"Restoring original CephBlockPool.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"openshift-storage/ocs-storagecluster-cephblockpool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x163b1b9]

goroutine 1200 [running]:
github.com/openshift/ocs-operator/controllers/storagecluster.getPlacement(0xc000877180, 0x1a76153, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/remote-source/app/controllers/storagecluster/placement.go:66 +0x299
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).newCephFilesystemInstances(0xc0005ee3c0, 0xc000877180, 0xc00012a008, 0x1ce8cc0, 0xc000401680, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephfilesystem.go:42 +0x1fd
github.com/openshift/ocs-operator/controllers/storagecluster.(*ocsCephFilesystems).ensureCreated(0x283e6d0, 0xc0005ee3c0, 0xc000877180, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephfilesystem.go:68 +0x85
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases(0xc0005ee3c0, 0xc000877180, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0x0, 0x0, 0xc000877180, 0x0)
	/remote-source/app/controllers/storagecluster/reconcile.go:375 +0xc7f
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile(0xc0005ee3c0, 0x1cc58d8, 0xc002414030, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0xc002414000, 0x0, 0x0, ...)
	/remote-source/app/controllers/storagecluster/reconcile.go:160 +0x6c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000178d20, 0x1cc5830, 0xc000d37900, 0x18d32c0, 0xc000af77a0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000178d20, 0x1cc5830, 0xc000d37900, 0xc000a75f00)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002bd090, 0xc000178d20, 0x1cc5830, 0xc000d37900)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425

```



Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup OCS/ODF cluster 
2. Add some non OCS taints to the nodes.
3. Update StorageCluster.yaml to add toleration for taints added in step 2
4. For example:
 ```
   placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
  ```



Actual results: OCS panics and does not pass this toleration to the ceph filesystem resource and mds pods remain in pending state.


Expected results: OCS should update the ceph filesystem resource with correct toleration


Additional info: check BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1992472 for more additional information.

Comment 6 Martin Bukatovic 2021-09-23 15:57:55 UTC

QE will verify this bug using the following procedure, suggested during bug triage meeting today:

- setup OCS/ODF cluster
- update StorageCluster CR to add toleration for taints as shown in the bug description

QE don't need to configure taints on nodes in any way to reproduce the problem, since the
operator used to crash when placement tolerations as shown in the bug description were
provided in storagecluster yaml.

Comment 18 Shrivaibavi Raghaventhiran 2021-10-06 13:18:13 UTC

Tested environment:
-------------------
VMWARE 3M, 3W

Versions:
----------
OCP - 4.9.0-0.nightly-2021-10-01-034521
ODF - odf-operator.v4.9.0-164.ci

Steps Performed :
------------------
1. Tainted all nodes masters and workers with taint 'xyz'

2. Edited storagecluster yaml with below values

    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

    rgw:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

3. Noticed the pods under openshift-storage respin after updating the storagecluster yaml (Expected)

4. Respinned MDS and RGW pods and it came up and in Running state

No issues seen in the above mentioned pods

Its working fine for MDS and RGW pod, Hence moving the BZ to verified state.

Comment 21 errata-xmlrpc 2021-12-13 17:46:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.