Bug 2096910 - [GSS][OCS 4.8][External] After controller cluster shutdown/restart, Noobaa object storage is offline
Summary: [GSS][OCS 4.8][External] After controller cluster shutdown/restart, Noobaa ob...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nimrod Becker
QA Contact: Ben Eli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-14 14:29 UTC by khover
Modified: 2023-08-09 16:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-19 19:43:24 UTC
Embargoed:


Attachments (Terms of Use)

Description khover 2022-06-14 14:29:03 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After controller cluster shutdown/restart, Noobaa object storage is offline

Ceph is upstream  version 14.2.12 (2f3caa3b8b3d5c5f2719a1e9d8e7deea5ae1a5c6) nautilus (stable)


buckets cannot be created, existing buckets cannot be accessed


Version of all relevant components (if applicable):

ocs-operator.v4.8.12

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-06-14 14:48:02 UTC
Some steps taken so far.

1. Restart all the pods related to the MCG.

# oc delete pods <noobaa-operator> -n openshift-storage
# oc delete pods <noobaa-core> -n openshift-storage
# oc delete pods <noobaa-endpoint> -n openshift-storage
# oc delete pods <noobaa-db> -n openshift-storage


2. from cu

We determined that the issue (based on my last comment) may have been due to networking issues. Rebooting our nmstate-handler pod seems to have resolved these issues. We were however still left with an alert: 


NooBaaResourceErrorState coming from openshift-monitoring. Rebooting noobaa as per your suggestion seems to have resolved these. 


3. Since: 2022-05-25, the external ceph cluster configured for (storageclass=ocs-external-storagecluster-ceph-rbd) has been unreachable by the ODF operators.

2022-05-25T17:54:52.034749102Z 2022-05-25 17:54:52.034660 I | op-mon: parsing mon endpoints: mon04=10.255.116.14:6789,mon05=10.255.116.15:6789,mon01=10.255.116.11:6789,mon02=10.255.116.12:6789,mon03=10.255.116.13:6789
--SNIP--
2022-05-25T17:55:07.570767568Z {"level":"info","ts":1653501307.5707145,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:08.797488310Z {"level":"info","ts":1653501308.7973511,"logger":"controllers.StorageCluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:10.396425979Z {"level":"info","ts":1653501310.3963194,"logger":"controllers.StorageCluster","msg":"Status Update Error","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","StatusUpdateErr":"Could not update storagecluster status"}
2022-05-25T17:55:13.584460672Z 2022-05-25 17:55:13.584305 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2022-05-25T17:55:28.849772969Z 2022-05-25 17:55:28.849655 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. timed out
2022-05-25T17:55:28.849772969Z . : timed out
2022-05-25T17:55:28.849772969Z . 
2022-05-25T17:55:58.751895972Z 2022-05-25 17:55:58.751755 W | op-mon: failed to check mon health. failed to get external mon quorum status: mon quorum status failed: timed out


4. from cu on 6/6/22

Connectivity to ceph seems to be working fine: 


$ ceph --id provisioner-moc-rbd-1 df  | sed -n -e '/POOL/p' -e '/moc_rbd_1/p'
POOLS:
    POOL                         ID     PGS      STORED      OBJECTS     USED        %USED     MAX AVAIL
    moc_rbd_1                    28       64     3.7 TiB     976.51k     3.7 TiB      0.14       906 TiB
Our PV/PVC's are created successfully using this storageclass.


Note You need to log in before you can comment on or make changes to this bug.