Description of problem (please be detailed as possible and provide log snippests): After controller cluster shutdown/restart, Noobaa object storage is offline Ceph is upstream version 14.2.12 (2f3caa3b8b3d5c5f2719a1e9d8e7deea5ae1a5c6) nautilus (stable) buckets cannot be created, existing buckets cannot be accessed Version of all relevant components (if applicable): ocs-operator.v4.8.12 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Some steps taken so far. 1. Restart all the pods related to the MCG. # oc delete pods <noobaa-operator> -n openshift-storage # oc delete pods <noobaa-core> -n openshift-storage # oc delete pods <noobaa-endpoint> -n openshift-storage # oc delete pods <noobaa-db> -n openshift-storage 2. from cu We determined that the issue (based on my last comment) may have been due to networking issues. Rebooting our nmstate-handler pod seems to have resolved these issues. We were however still left with an alert: NooBaaResourceErrorState coming from openshift-monitoring. Rebooting noobaa as per your suggestion seems to have resolved these. 3. Since: 2022-05-25, the external ceph cluster configured for (storageclass=ocs-external-storagecluster-ceph-rbd) has been unreachable by the ODF operators. 2022-05-25T17:54:52.034749102Z 2022-05-25 17:54:52.034660 I | op-mon: parsing mon endpoints: mon04=10.255.116.14:6789,mon05=10.255.116.15:6789,mon01=10.255.116.11:6789,mon02=10.255.116.12:6789,mon03=10.255.116.13:6789 --SNIP-- 2022-05-25T17:55:07.570767568Z {"level":"info","ts":1653501307.5707145,"logger":"controllers.StorageCluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} 2022-05-25T17:55:08.797488310Z {"level":"info","ts":1653501308.7973511,"logger":"controllers.StorageCluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} 2022-05-25T17:55:10.396425979Z {"level":"info","ts":1653501310.3963194,"logger":"controllers.StorageCluster","msg":"Status Update Error","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","StatusUpdateErr":"Could not update storagecluster status"} 2022-05-25T17:55:13.584460672Z 2022-05-25 17:55:13.584305 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1 2022-05-25T17:55:28.849772969Z 2022-05-25 17:55:28.849655 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. timed out 2022-05-25T17:55:28.849772969Z . : timed out 2022-05-25T17:55:28.849772969Z . 2022-05-25T17:55:58.751895972Z 2022-05-25 17:55:58.751755 W | op-mon: failed to check mon health. failed to get external mon quorum status: mon quorum status failed: timed out 4. from cu on 6/6/22 Connectivity to ceph seems to be working fine: $ ceph --id provisioner-moc-rbd-1 df | sed -n -e '/POOL/p' -e '/moc_rbd_1/p' POOLS: POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL moc_rbd_1 28 64 3.7 TiB 976.51k 3.7 TiB 0.14 906 TiB Our PV/PVC's are created successfully using this storageclass.