Bug 2005937

Summary: Not able to add toleration for MDS pods via StorageCluster yaml
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Santosh Pillai <sapillai>
Component: ocs-operatorAssignee: Subham Rai <srai>
Status: CLOSED ERRATA QA Contact: Shrivaibavi Raghaventhiran <sraghave>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: bkunal, djuran, madam, mbukatov, muagarwa, nberry, ocs-bugs, odf-bz-bot, sostapov, srai
Target Milestone: ---   
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-164.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2005970 (view as bug list) Environment:
Last Closed: 2021-12-13 17:46:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1992472, 2005970    

Description Santosh Pillai 2021-09-20 14:04:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Use case: to allow OCS pods to run on the worker nodes which have some non-OCS taints. 

OCS panics when trying to add tolerations for `mds` component in the StorageCluster.yaml to allow OCS pods to run on the worker nodes which have some non-OCS taints. 

For example: Below spec is added to StorageCluster yaml

```
placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
```


This is causing nil pointer exception while ensuring CephFileSystem in ocs (https://github.com/red-hat-storage/ocs-operator/blob/32158124bba496f625d6c2f01c31affde8713fa7/controllers/storagecluster/placement.go#L66) 


Error:

```
{"level":"info","ts":1631864507.7231574,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-175-201.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1c"}
{"level":"info","ts":1631864507.7231665,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-139-228.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1a"}
{"level":"info","ts":1631864507.7235525,"logger":"controllers.StorageCluster","msg":"Restoring original CephBlockPool.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"openshift-storage/ocs-storagecluster-cephblockpool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x163b1b9]

goroutine 1200 [running]:
github.com/openshift/ocs-operator/controllers/storagecluster.getPlacement(0xc000877180, 0x1a76153, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/remote-source/app/controllers/storagecluster/placement.go:66 +0x299
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).newCephFilesystemInstances(0xc0005ee3c0, 0xc000877180, 0xc00012a008, 0x1ce8cc0, 0xc000401680, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephfilesystem.go:42 +0x1fd
github.com/openshift/ocs-operator/controllers/storagecluster.(*ocsCephFilesystems).ensureCreated(0x283e6d0, 0xc0005ee3c0, 0xc000877180, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephfilesystem.go:68 +0x85
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases(0xc0005ee3c0, 0xc000877180, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0x0, 0x0, 0xc000877180, 0x0)
	/remote-source/app/controllers/storagecluster/reconcile.go:375 +0xc7f
github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile(0xc0005ee3c0, 0x1cc58d8, 0xc002414030, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0xc002414000, 0x0, 0x0, ...)
	/remote-source/app/controllers/storagecluster/reconcile.go:160 +0x6c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000178d20, 0x1cc5830, 0xc000d37900, 0x18d32c0, 0xc000af77a0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000178d20, 0x1cc5830, 0xc000d37900, 0xc000a75f00)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002bd090, 0xc000178d20, 0x1cc5830, 0xc000d37900)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425

```



Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Setup OCS/ODF cluster 
2. Add some non OCS taints to the nodes.
3. Update StorageCluster.yaml to add toleration for taints added in step 2
4. For example:
 ```
   placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
  ```



Actual results: OCS panics and does not pass this toleration to the ceph filesystem resource and mds pods remain in pending state.


Expected results: OCS should update the ceph filesystem resource with correct toleration


Additional info: check BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1992472 for more additional information.

Comment 6 Martin Bukatovic 2021-09-23 15:57:55 UTC
QE will verify this bug using the following procedure, suggested during bug triage meeting today:

- setup OCS/ODF cluster
- update StorageCluster CR to add toleration for taints as shown in the bug description

QE don't need to configure taints on nodes in any way to reproduce the problem, since the
operator used to crash when placement tolerations as shown in the bug description were
provided in storagecluster yaml.

Comment 18 Shrivaibavi Raghaventhiran 2021-10-06 13:18:13 UTC
Tested environment:
-------------------
VMWARE 3M, 3W

Versions:
----------
OCP - 4.9.0-0.nightly-2021-10-01-034521
ODF - odf-operator.v4.9.0-164.ci

Steps Performed :
------------------
1. Tainted all nodes masters and workers with taint 'xyz'

2. Edited storagecluster yaml with below values

    mds:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

    rgw:
      tolerations:
      - effect: NoSchedule
        key: xyz
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: node.ocs.openshift.io/storage
        operator: Equal
        value: "true"

3. Noticed the pods under openshift-storage respin after updating the storagecluster yaml (Expected)

4. Respinned MDS and RGW pods and it came up and in Running state

No issues seen in the above mentioned pods

Its working fine for MDS and RGW pod, Hence moving the BZ to verified state.

Comment 21 errata-xmlrpc 2021-12-13 17:46:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086