Bug 2096910

Summary:	[GSS][OCS 4.8][External] After controller cluster shutdown/restart, Noobaa object storage is offline
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	khover
Component:	Multi-Cloud Object Gateway	Assignee:	Nimrod Becker <nbecker>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Ben Eli <belimele>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	assingh, dguthrie, etamir, hnallurv, mhackett, nbecker, ocs-bugs, odf-bz-bot
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-19 19:43:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description khover 2022-06-14 14:29:03 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After controller cluster shutdown/restart, Noobaa object storage is offline

Ceph is upstream  version 14.2.12 (2f3caa3b8b3d5c5f2719a1e9d8e7deea5ae1a5c6) nautilus (stable)


buckets cannot be created, existing buckets cannot be accessed


Version of all relevant components (if applicable):

ocs-operator.v4.8.12

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-06-14 14:48:02 UTC

Some steps taken so far.

1. Restart all the pods related to the MCG.

# oc delete pods <noobaa-operator> -n openshift-storage
# oc delete pods <noobaa-core> -n openshift-storage
# oc delete pods <noobaa-endpoint> -n openshift-storage
# oc delete pods <noobaa-db> -n openshift-storage


2. from cu

We determined that the issue (based on my last comment) may have been due to networking issues. Rebooting our nmstate-handler pod seems to have resolved these issues. We were however still left with an alert: 


NooBaaResourceErrorState coming from openshift-monitoring. Rebooting noobaa as per your suggestion seems to have resolved these. 


3. Since: 2022-05-25, the external ceph cluster configured for (storageclass=ocs-external-storagecluster-ceph-rbd) has been unreachable by the ODF operators.

2022-05-25T17:54:52.034749102Z 2022-05-25 17:54:52.034660 I | op-mon: parsing mon endpoints: mon04=10.255.116.14:6789,mon05=10.255.116.15:6789,mon01=10.255.116.11:6789,mon02=10.255.116.12:6789,mon03=10.255.116.13:6789
--SNIP--
2022-05-25T17:55:07.570767568Z {"level":"info","ts":1653501307.5707145,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:08.797488310Z {"level":"info","ts":1653501308.7973511,"logger":"controllers.StorageCluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:10.396425979Z {"level":"info","ts":1653501310.3963194,"logger":"controllers.StorageCluster","msg":"Status Update Error","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","StatusUpdateErr":"Could not update storagecluster status"}
2022-05-25T17:55:13.584460672Z 2022-05-25 17:55:13.584305 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2022-05-25T17:55:28.849772969Z 2022-05-25 17:55:28.849655 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. timed out
2022-05-25T17:55:28.849772969Z . : timed out
2022-05-25T17:55:28.849772969Z . 
2022-05-25T17:55:58.751895972Z 2022-05-25 17:55:58.751755 W | op-mon: failed to check mon health. failed to get external mon quorum status: mon quorum status failed: timed out


4. from cu on 6/6/22

Connectivity to ceph seems to be working fine: 


$ ceph --id provisioner-moc-rbd-1 df  | sed -n -e '/POOL/p' -e '/moc_rbd_1/p'
POOLS:
    POOL                         ID     PGS      STORED      OBJECTS     USED        %USED     MAX AVAIL
    moc_rbd_1                    28       64     3.7 TiB     976.51k     3.7 TiB      0.14       906 TiB
Our PV/PVC's are created successfully using this storageclass.