Bug 2141915

Summary:

Storagecluster moves to an error state after odf installation, although all pods are running in the openshift-storage namespace

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Bhavana <bhavanax.kasula>

Component:

rook

Assignee:

Travis Nielsen <tnielsen>

Status:

CLOSED NOTABUG

QA Contact:

Martin Bukatovic <mbukatov>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.11

CC:

ishwaryax.munesh, jarrpa, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-12-07 15:00:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
oc describe storagecluster -n openshift-storage	none
oc get storagecluster -n openshift-storage	none
ocs-must-gather-logs	none
cephcluster_storagecluster_status	none
getpods	none
storage cluster describe	none
ceph_health_no_osd	none

Description Bhavana 2022-11-11 05:43:37 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Storagecluster moves to an error state after odf installation. All the pods in the openshift-storage namespace are running. 

Version of all relevant components (if applicable):
4.11

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Installed the OCP cluster on the Baremetal platform (version: 4.11)

2. Install Local storage Operator and ODF Operator via UI

3. Added 2nvme drives and 1 optane drive to each of the worker nodes and created PV's for local-data, metadata,mon-pods,wal and created Storagecluster

4. The StorageCluster status is showing in the Error phase.


Actual results:
storagecluster status is in Error phase and the describe command of this storage cluster showing this error: Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met

Expected results:
storagecluster status should be in Ready state

Additional info:

Comment 3 Bhavana 2022-11-11 05:50:51 UTC

I will attach the must-gather logs to the ticket shortly

Comment 4 Nitin Goyal 2022-11-11 05:56:07 UTC

Hello Bhavana,

Recociler is waiting for few storage classes to be ready and will be resolved in few minutes once all storageclasses are ready. We have already fixed this bug in 4.12. Marking this as a duplicate

*** This bug has been marked as a duplicate of bug 2004027 ***

Comment 5 Bhavana 2022-11-11 06:22:27 UTC

Please find the dropbox link for must-gather logs:

https://www.dropbox.com/s/dmvg1cr5nl3wxmt/must-gather.tar.gz?dl=0

Comment 6 Bhavana 2022-11-11 06:23:44 UTC

Created attachment 1923719 [details]
oc describe storagecluster -n openshift-storage

Comment 7 Bhavana 2022-11-11 06:26:51 UTC

Created attachment 1923720 [details]
oc get storagecluster -n openshift-storage

Comment 8 Nitin Goyal 2022-11-11 06:30:13 UTC

Hello there, Bhavana. You appear to have added the OCP must gather. Could you please collect OCS as well?

Comment 9 Bhavana 2022-11-11 08:45:48 UTC

Please find the dropbox link for OCS must-gather logs:

https://www.dropbox.com/s/dqv5y2d4kub6iy7/ocs-must-gather.tar.gz?dl=0

Comment 10 Nitin Goyal 2022-11-11 10:15:33 UTC

Hello there, Bhavana. Even though this is not an OCS must gather, could you please confirm that you uploaded the correct one?

Comment 11 Bhavana 2022-11-11 10:48:01 UTC

Created attachment 1923756 [details]
ocs-must-gather-logs

Comment 13 Travis Nielsen 2022-11-11 19:25:11 UTC

The rook operator log shows that while the mon pods are running, they are not reaching quorum.

2022-11-11T10:08:17.781446300Z 2022-11-11 10:08:17.781359 I | op-mon: mons running: [c b a]
2022-11-11T10:08:32.914686743Z 2022-11-11 10:08:32.914549 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum c: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum


This is usually an indication that the network configuration is not allowing the mons to communicate with each other or the operator cannot communicate with the mons. Here is an upstream topic to help troubleshoot the mon quorum:
https://rook.io/docs/rook/latest/Troubleshooting/ceph-common-issues/#monitors-are-the-only-pods-running

Comment 14 Ishwarya Munesh 2022-11-14 16:58:33 UTC

Hi Travis,
Thanks for your inputs. We did verify the network configuration to check if operator is reaching mon and it returned 'ceph v2' as mentioned in the troubleshoot guide. 
So we went ahead and did a ODF cleanup and re-installed it again. This time we were able to install ODF and Storage cluster successfully without issues and ceph health is OK with 3 OSDs added as expected.

In one another cluster where we saw the storagecluster in error phase, we performed same steps for ODF cleanup and re-installed it again. But we still see the storagecluster in 'Error' phase and ceph health is in WARN as no OSDs were added. 
We dont see any errors in mon-a/b/c pods and 'oc get cephcluster' shows 'Cluster installed successfully'. 
One observation was ceph-rgw pod keeps changing to 'CrashLoopBackOff' state and there are three crash controller pods for each worker. 
We have placed the must-gather of this cluster in dropbox - https://www.dropbox.com/home/scale_out_cluster

Attached the screenshots of our observation. 
Could you please provide your inputs on this.

Thanks
Ishwarya M

Comment 15 Ishwarya Munesh 2022-11-14 17:08:48 UTC

Created attachment 1924283 [details]
cephcluster_storagecluster_status

Comment 16 Ishwarya Munesh 2022-11-14 17:35:29 UTC

Created attachment 1924285 [details]
getpods

Comment 17 Ishwarya Munesh 2022-11-14 17:36:12 UTC

Created attachment 1924286 [details]
storage cluster describe

Comment 18 Ishwarya Munesh 2022-11-14 17:38:56 UTC

Created attachment 1924287 [details]
ceph_health_no_osd

Comment 19 Travis Nielsen 2022-11-14 23:00:39 UTC

Since there are zero OSDs, the rgw pod won't be able to run, though I would have expected that rook wouldn't even create the rgw pod if no OSDs were created.

I can't seem to access the must-gather in dropbox. The URL looks like your personal folder, rather than a publicly accessible folder. 
The operator log must show some cause for no OSDs being created.

Comment 21 Ishwarya Munesh 2022-11-16 12:38:00 UTC

Hi Travis,
Here is the link for must-gather from dropbox - https://www.dropbox.com/s/50s2bbqjfchy0qm/must-gather.logs.zip?dl=0
And for ocs must-gather - https://www.dropbox.com/s/e4owrgcw1fwq27w/ocs_must-gather.zip?dl=0. 
Currently the state of the cluster is changed. We once again performed a complete cleanup of ODF including deleting the openshift-local-storage namespace and installed ODF. This time we did not see any issues and storagecluster came up in ready state and OSDs were added in ceph without issues.

In the other OCP 4.11 cluster, after cleaning up and reinstall of ODF, the ceph was healthy and all OSDs were up. But after sometime we noticed that the ceph health went to WARN state showing 'Slow OSD heartbeats on back' error. Not sure what causes this error. 
ODF must gather logs are placed here - https://www.dropbox.com/s/8cuuqb06chl6b8t/odf_must_gather.tar.gz?dl=0
Hope you should be able to access the must gather files now.

Thanks
Ishwarya M

Comment 22 Travis Nielsen 2022-11-16 23:15:51 UTC

Great to hear the cluster is coming up now with OSDs and everything looked healthy.

I'm not sure about that warning about slow OSDs, though I see this upstream topic that points to slow network or related config issue:
https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks

Comment 23 Travis Nielsen 2022-12-06 15:14:25 UTC

Anything else or shall we close this issue?

Comment 24 Ishwarya Munesh 2022-12-07 05:01:49 UTC

Hi Travis, this ticket can be closed.

Comment 25 Red Hat Bugzilla 2023-12-08 04:31:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days