Description of problem (please be detailed as possible and provide log snippests): Use case: to allow OCS pods to run on the worker nodes which have some non-OCS taints. OCS panics when trying to add tolerations for `mds` component in the StorageCluster.yaml to allow OCS pods to run on the worker nodes which have some non-OCS taints. For example: Below spec is added to StorageCluster yaml ``` placement: all: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" mds: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" ``` This is causing nil pointer exception while ensuring CephFileSystem in ocs (https://github.com/red-hat-storage/ocs-operator/blob/32158124bba496f625d6c2f01c31affde8713fa7/controllers/storagecluster/placement.go#L66) Error: ``` {"level":"info","ts":1631864507.7231574,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-175-201.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1c"} {"level":"info","ts":1631864507.7231665,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-139-228.ec2.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-1a"} {"level":"info","ts":1631864507.7235525,"logger":"controllers.StorageCluster","msg":"Restoring original CephBlockPool.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"openshift-storage/ocs-storagecluster-cephblockpool"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x163b1b9] goroutine 1200 [running]: github.com/openshift/ocs-operator/controllers/storagecluster.getPlacement(0xc000877180, 0x1a76153, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /remote-source/app/controllers/storagecluster/placement.go:66 +0x299 github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).newCephFilesystemInstances(0xc0005ee3c0, 0xc000877180, 0xc00012a008, 0x1ce8cc0, 0xc000401680, 0x0, 0x0) /remote-source/app/controllers/storagecluster/cephfilesystem.go:42 +0x1fd github.com/openshift/ocs-operator/controllers/storagecluster.(*ocsCephFilesystems).ensureCreated(0x283e6d0, 0xc0005ee3c0, 0xc000877180, 0x0, 0x0) /remote-source/app/controllers/storagecluster/cephfilesystem.go:68 +0x85 github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases(0xc0005ee3c0, 0xc000877180, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0x0, 0x0, 0xc000877180, 0x0) /remote-source/app/controllers/storagecluster/reconcile.go:375 +0xc7f github.com/openshift/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile(0xc0005ee3c0, 0x1cc58d8, 0xc002414030, 0xc000fcf6c8, 0x11, 0xc000fcf6b0, 0x12, 0xc002414000, 0x0, 0x0, ...) /remote-source/app/controllers/storagecluster/reconcile.go:160 +0x6c5 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000178d20, 0x1cc5830, 0xc000d37900, 0x18d32c0, 0xc000af77a0) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000178d20, 0x1cc5830, 0xc000d37900, 0xc000a75f00) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002bd090, 0xc000178d20, 0x1cc5830, 0xc000d37900) /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425 ``` Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Setup OCS/ODF cluster 2. Add some non OCS taints to the nodes. 3. Update StorageCluster.yaml to add toleration for taints added in step 2 4. For example: ``` placement: all: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" mds: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" ``` Actual results: OCS panics and does not pass this toleration to the ceph filesystem resource and mds pods remain in pending state. Expected results: OCS should update the ceph filesystem resource with correct toleration Additional info: check BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1992472 for more additional information.
QE will verify this bug using the following procedure, suggested during bug triage meeting today: - setup OCS/ODF cluster - update StorageCluster CR to add toleration for taints as shown in the bug description QE don't need to configure taints on nodes in any way to reproduce the problem, since the operator used to crash when placement tolerations as shown in the bug description were provided in storagecluster yaml.
Tested environment: ------------------- VMWARE 3M, 3W Versions: ---------- OCP - 4.9.0-0.nightly-2021-10-01-034521 ODF - odf-operator.v4.9.0-164.ci Steps Performed : ------------------ 1. Tainted all nodes masters and workers with taint 'xyz' 2. Edited storagecluster yaml with below values mds: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" rgw: tolerations: - effect: NoSchedule key: xyz operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" 3. Noticed the pods under openshift-storage respin after updating the storagecluster yaml (Expected) 4. Respinned MDS and RGW pods and it came up and in Running state No issues seen in the above mentioned pods Its working fine for MDS and RGW pod, Hence moving the BZ to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086