Bug 2141915
| Summary: | Storagecluster moves to an error state after odf installation, although all pods are running in the openshift-storage namespace | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Bhavana <bhavanax.kasula> | ||||||||||||||||
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> | ||||||||||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Martin Bukatovic <mbukatov> | ||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||
| Version: | 4.11 | CC: | ishwaryax.munesh, jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov | ||||||||||||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||||
| Target Release: | --- | Flags: | tnielsen:
needinfo?
(bhavanax.kasula) tnielsen: needinfo? (bhavanax.kasula) |
||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | 2022-12-07 15:00:07 UTC | Type: | Bug | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Bhavana
2022-11-11 05:43:37 UTC
I will attach the must-gather logs to the ticket shortly Hello Bhavana, Recociler is waiting for few storage classes to be ready and will be resolved in few minutes once all storageclasses are ready. We have already fixed this bug in 4.12. Marking this as a duplicate *** This bug has been marked as a duplicate of bug 2004027 *** Please find the dropbox link for must-gather logs: https://www.dropbox.com/s/dmvg1cr5nl3wxmt/must-gather.tar.gz?dl=0 Created attachment 1923719 [details]
oc describe storagecluster -n openshift-storage
Created attachment 1923720 [details]
oc get storagecluster -n openshift-storage
Hello there, Bhavana. You appear to have added the OCP must gather. Could you please collect OCS as well? Please find the dropbox link for OCS must-gather logs: https://www.dropbox.com/s/dqv5y2d4kub6iy7/ocs-must-gather.tar.gz?dl=0 Hello there, Bhavana. Even though this is not an OCS must gather, could you please confirm that you uploaded the correct one? Created attachment 1923756 [details]
ocs-must-gather-logs
The rook operator log shows that while the mon pods are running, they are not reaching quorum. 2022-11-11T10:08:17.781446300Z 2022-11-11 10:08:17.781359 I | op-mon: mons running: [c b a] 2022-11-11T10:08:32.914686743Z 2022-11-11 10:08:32.914549 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum c: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum This is usually an indication that the network configuration is not allowing the mons to communicate with each other or the operator cannot communicate with the mons. Here is an upstream topic to help troubleshoot the mon quorum: https://rook.io/docs/rook/latest/Troubleshooting/ceph-common-issues/#monitors-are-the-only-pods-running Hi Travis, Thanks for your inputs. We did verify the network configuration to check if operator is reaching mon and it returned 'ceph v2' as mentioned in the troubleshoot guide. So we went ahead and did a ODF cleanup and re-installed it again. This time we were able to install ODF and Storage cluster successfully without issues and ceph health is OK with 3 OSDs added as expected. In one another cluster where we saw the storagecluster in error phase, we performed same steps for ODF cleanup and re-installed it again. But we still see the storagecluster in 'Error' phase and ceph health is in WARN as no OSDs were added. We dont see any errors in mon-a/b/c pods and 'oc get cephcluster' shows 'Cluster installed successfully'. One observation was ceph-rgw pod keeps changing to 'CrashLoopBackOff' state and there are three crash controller pods for each worker. We have placed the must-gather of this cluster in dropbox - https://www.dropbox.com/home/scale_out_cluster Attached the screenshots of our observation. Could you please provide your inputs on this. Thanks Ishwarya M Created attachment 1924283 [details]
cephcluster_storagecluster_status
Created attachment 1924285 [details]
getpods
Created attachment 1924286 [details]
storage cluster describe
Created attachment 1924287 [details]
ceph_health_no_osd
Since there are zero OSDs, the rgw pod won't be able to run, though I would have expected that rook wouldn't even create the rgw pod if no OSDs were created. I can't seem to access the must-gather in dropbox. The URL looks like your personal folder, rather than a publicly accessible folder. The operator log must show some cause for no OSDs being created. Hi Travis, Here is the link for must-gather from dropbox - https://www.dropbox.com/s/50s2bbqjfchy0qm/must-gather.logs.zip?dl=0 And for ocs must-gather - https://www.dropbox.com/s/e4owrgcw1fwq27w/ocs_must-gather.zip?dl=0. Currently the state of the cluster is changed. We once again performed a complete cleanup of ODF including deleting the openshift-local-storage namespace and installed ODF. This time we did not see any issues and storagecluster came up in ready state and OSDs were added in ceph without issues. In the other OCP 4.11 cluster, after cleaning up and reinstall of ODF, the ceph was healthy and all OSDs were up. But after sometime we noticed that the ceph health went to WARN state showing 'Slow OSD heartbeats on back' error. Not sure what causes this error. ODF must gather logs are placed here - https://www.dropbox.com/s/8cuuqb06chl6b8t/odf_must_gather.tar.gz?dl=0 Hope you should be able to access the must gather files now. Thanks Ishwarya M Great to hear the cluster is coming up now with OSDs and everything looked healthy. I'm not sure about that warning about slow OSDs, though I see this upstream topic that points to slow network or related config issue: https://docs.ceph.com/en/latest/rados/operations/monitoring/#network-performance-checks Anything else or shall we close this issue? Hi Travis, this ticket can be closed. |