Bug 2096910
| Summary: | [GSS][OCS 4.8][External] After controller cluster shutdown/restart, Noobaa object storage is offline | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover |
| Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ben Eli <belimele> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | assingh, dguthrie, etamir, hnallurv, mhackett, nbecker, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-07-19 19:43:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
khover
2022-06-14 14:29:03 UTC
Some steps taken so far.
1. Restart all the pods related to the MCG.
# oc delete pods <noobaa-operator> -n openshift-storage
# oc delete pods <noobaa-core> -n openshift-storage
# oc delete pods <noobaa-endpoint> -n openshift-storage
# oc delete pods <noobaa-db> -n openshift-storage
2. from cu
We determined that the issue (based on my last comment) may have been due to networking issues. Rebooting our nmstate-handler pod seems to have resolved these issues. We were however still left with an alert:
NooBaaResourceErrorState coming from openshift-monitoring. Rebooting noobaa as per your suggestion seems to have resolved these.
3. Since: 2022-05-25, the external ceph cluster configured for (storageclass=ocs-external-storagecluster-ceph-rbd) has been unreachable by the ODF operators.
2022-05-25T17:54:52.034749102Z 2022-05-25 17:54:52.034660 I | op-mon: parsing mon endpoints: mon04=10.255.116.14:6789,mon05=10.255.116.15:6789,mon01=10.255.116.11:6789,mon02=10.255.116.12:6789,mon03=10.255.116.13:6789
--SNIP--
2022-05-25T17:55:07.570767568Z {"level":"info","ts":1653501307.5707145,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:08.797488310Z {"level":"info","ts":1653501308.7973511,"logger":"controllers.StorageCluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
2022-05-25T17:55:10.396425979Z {"level":"info","ts":1653501310.3963194,"logger":"controllers.StorageCluster","msg":"Status Update Error","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","StatusUpdateErr":"Could not update storagecluster status"}
2022-05-25T17:55:13.584460672Z 2022-05-25 17:55:13.584305 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2022-05-25T17:55:28.849772969Z 2022-05-25 17:55:28.849655 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. timed out
2022-05-25T17:55:28.849772969Z . : timed out
2022-05-25T17:55:28.849772969Z .
2022-05-25T17:55:58.751895972Z 2022-05-25 17:55:58.751755 W | op-mon: failed to check mon health. failed to get external mon quorum status: mon quorum status failed: timed out
4. from cu on 6/6/22
Connectivity to ceph seems to be working fine:
$ ceph --id provisioner-moc-rbd-1 df | sed -n -e '/POOL/p' -e '/moc_rbd_1/p'
POOLS:
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
moc_rbd_1 28 64 3.7 TiB 976.51k 3.7 TiB 0.14 906 TiB
Our PV/PVC's are created successfully using this storageclass.
|