Bug 1930466

Summary:	OCS-Operator waiting on ceph cluster to initialize before starting noobaa.
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	scott2
Component:	ocs-operator	Assignee:	umanga <uchapaga>
Status:	CLOSED NEXTRELEASE	QA Contact:	Raz Tamir <ratamir>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	madam, muagarwa, ocs-bugs, scott2, sostapov, uchapaga
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-01 10:14:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description scott2 2021-02-18 23:29:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After 2 our of the 3 original OCS nodes failed, followed the directions listed listed below to remove the failed nodes and add new nodes:

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html/scaling_storage/scaling-out-storage-capacity_rhocs#adding-a-node-using-a-local-storage-device_rhocs

The ocs-operator at the time of failure was running on one of the failed nodes, and thus the ocs-operator deployment tried to reschedule it to a healthy node.  I had to delete the configmap lock that was preventing it from being rescheduled, but am unable to get the ocs-operator pod to become ready.

# ocs-operator logs
{"level":"info","ts":"2021-02-18T23:06:19.395Z","logger":"cmd","msg":"Go Version: go1.15.5"}
{"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"operator-sdk Version: v0.17.0"}
{"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"Running in development mode: false"}
{"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":"2021-02-18T23:06:22.217Z","logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":"2021-02-18T23:06:22.222Z","logger":"leader","msg":"Became the leader."}
{"level":"info","ts":"2021-02-18T23:06:25.032Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2021-02-18T23:06:25.032Z","logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":"2021-02-18T23:06:25.043Z","logger":"cmd","msg":"OCSInitialization resource already exists"}
{"level":"info","ts":"2021-02-18T23:06:25.043Z","logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"storagecluster-controller","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"ocsinitialization-controller","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"persistentvolume-controller","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-02-18T23:06:25.144Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"persistentvolume-controller"}
{"level":"info","ts":"2021-02-18T23:06:25.245Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"persistentvolume-controller","worker count":1}
{"level":"info","ts":"2021-02-18T23:06:25.245Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"ocsinitialization-controller"}
{"level":"info","ts":"2021-02-18T23:06:25.345Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"ocsinitialization-controller","worker count":1}
{"level":"info","ts":"2021-02-18T23:06:25.345Z","logger":"controller_ocsinitialization","msg":"Reconciling OCSInitialization","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.349Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.360Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph-csi SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.371Z","logger":"controller_ocsinitialization","msg":"Updating noobaa SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.397Z","logger":"controller_ocsinitialization","msg":"Reconciling OCSInitialization","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.399Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.410Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph-csi SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.417Z","logger":"controller_ocsinitialization","msg":"Updating noobaa SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"}
{"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"storagecluster-controller"}
{"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"storagecluster-controller","worker count":1}
{"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:25.876Z","logger":"controller_storagecluster","msg":"Restoring original cephObjectStore ocs-storagecluster-cephobjectstore","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:26.006Z","logger":"controller_storagecluster","msg":"Restoring original cephObjectStoreUser ocs-storagecluster-cephobjectstoreuser","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:26.120Z","logger":"controller_storagecluster","msg":"Restoring original cephBlockPool ocs-storagecluster-cephblockpool","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:26.234Z","logger":"controller_storagecluster","msg":"Restoring original cephFilesystem ocs-storagecluster-cephfilesystem","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:26.368Z","logger":"controller_storagecluster","msg":"Waiting on ceph cluster to initialize before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2021-02-18T23:06:26.503Z","logger":"controller_storagecluster","msg":"Reconciling metrics exporter service","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}
{"level":"info","ts":"2021-02-18T23:06:26.610Z","logger":"controller_storagecluster","msg":"Reconciling metrics exporter service monitor","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}



Version of all relevant components (if applicable): 
OCS 4.6.2
OCP 4.6.15


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?  
No storage is available for the OCP cluster.


Is there any workaround available to the best of your knowledge?  
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCS
2. Kill two RHCOS VMs running the OCS nodes
3. Remove ocs-operator configmap lock so ocs-operator may be reprovisioned to healthy node


Actual results:
ocs-operator pod never obtains ready status


Expected results:
ocs-operator pod obtains ready status to properly fix ceph-mon-* ceph-ods-*


Additional info:

Comment 2 scott2 2021-02-18 23:36:05 UTC

I'll add that the rook-ceph-mon-a/b are in pending status (as they were on the failed nodes).  rook-ceph-mon-c is in healthy status.

rook-ceph-mon-a-6bbfbf5999-ddmgm                                  0/1     Pending            0          143m
rook-ceph-mon-b-755568566d-4sgsj                                  0/1     Pending            0          24h
rook-ceph-mon-c-5cf658f954-mfr2r                                  1/1     Running            13         14d


But normally the nodeSelector on those deployments are managed by ocs-operator...which is waiting for ceph to finish initializing...

This is circular...

Comment 3 Jose A. Rivera 2021-02-23 16:58:48 UTC

This is all expected behavior.

The whole ConfigMap lock deletion thing is a known problem that will be resolved for OCS 4.7 (it's something out of our control, it's part of the framework we're using and we're upgrading it for OCS 4.7).

As far as the ocs-operator not being Ready, this is intentional as the operator should not report Ready until all StorageClusters (and their components) are healthy. And indeed, NooBaa should not be created until the CephCluster is healthy, since NooBaa relies on a Ceph volume for its operation. You should inspect the CephCluster CR and the rook-ceph-operator logs to determine what is actually going on.

Dealing with failed nodes in Kuberenetes is a pain in general. Pods will remain Pending and/or Terminated until either the exact node comes back healthy or the admin intervenes. In this case, you probably have to force delete any stuck Pods. We're considering ways to address this, but nothing is available for OCS 4.6.

Comment 4 Jose A. Rivera 2021-02-23 16:59:21 UTC

Since this is not a crucial bug, moving to OCS 4.8.

Comment 5 umanga 2021-06-01 08:37:13 UTC

Starting from OCS 4.7 we do not use configmap locks. The operator readiness and waiting for CephCluster before creating NooBaa is working as expected.
So this bug doesn't exist anymore.

Is it critical enough to have a 4.6 only fix? If not, we can close this.