Bug 2169631
| Summary: | cephFS MDS crashed and cephFS volumes fail to attach | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | akgunjal <akgunjal> | ||||||||||||||||||||||
| Component: | ceph | Assignee: | Venky Shankar <vshankar> | ||||||||||||||||||||||
| ceph sub component: | CephFS | QA Contact: | Elad <ebenahar> | ||||||||||||||||||||||
| Status: | CLOSED WONTFIX | Docs Contact: | |||||||||||||||||||||||
| Severity: | high | ||||||||||||||||||||||||
| Priority: | unspecified | CC: | bniver, gfarnum, mchangir, mduasope, muagarwa, ocs-bugs, odf-bz-bot, sostapov, vshankar | ||||||||||||||||||||||
| Version: | 4.9 | Flags: | vshankar:
needinfo?
(akgunjal) |
||||||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||
| Last Closed: | 2023-03-20 05:25:29 UTC | Type: | Bug | ||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
Created attachment 1943998 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5479fbf6jjcjs
Created attachment 1943999 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7bddf54cxpmn7
Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 4abe4d20-5a51-457a-ab62-4ad6d974971e ClusterVersion: Stable at "4.10.47" ClusterOperators: clusteroperator/authentication is missing clusteroperator/cloud-credential is missing clusteroperator/cluster-autoscaler is missing clusteroperator/config-operator is missing clusteroperator/etcd is missing clusteroperator/insights is missing clusteroperator/machine-api is missing clusteroperator/machine-approver is missing clusteroperator/machine-config is missing Created attachment 1944000 [details]
ceph_status
Hi Akash, The logs uploaded in c#2 and c#3 barely have anything to debug. You mention that the MDS crash, however, I do not see the crash. The `ceph status' command in c#5 shows "1/1 daemons up" in the mds section which implies that ceph-mds daemon is up and active. The reason for the cluster warning is that there are not standby mds daemons (see: "insufficient standby MDS daemons available" in warning). Please share the must-gather logs as that's essential to debug the MDS going into CLBO. Cheers, Venky Created attachment 1944008 [details]
must-gather-aa
Created attachment 1944009 [details]
must-gather-ab
Created attachment 1944010 [details]
must-gather-ac
Created attachment 1944011 [details]
must-gather-ad
Created attachment 1944013 [details]
must-gather-ae
Created attachment 1944015 [details]
must-gather-af
I have attached the ODF must-gather logs by splitting it into smaller files. It can be downloaded and recreated the zip file. I referred this link https://access.redhat.com/solutions/4109 to split and recreate can be done using same link. please download all files with prefix must-gather-a* in the attachments and recreate using command "cat must-gather-a* > odflogs.zip" Hi Akash, (In reply to akgunjal.com from comment #13) > I have attached the ODF must-gather logs by splitting it into smaller files. > It can be downloaded and recreated the zip file. I referred this link > https://access.redhat.com/solutions/4109 to split and recreate can be done > using same link. > > please download all files with prefix must-gather-a* in the attachments and > recreate using command "cat must-gather-a* > odflogs.zip" I cannot find the MDS crash dump. Could you point me to one? Normally, you'd find one via `ceph crash ls' and that is a part of the must-gather logs which isn't present in the logs you shared. Undoing private comment since Akash is unable to view those. Although there are a few thousands of ceph-mon oom-kill logs, I couldn't find even a single ceph-mds oom-kill in the /var/log/messages Following are the oom-kill instances found in 0470-messages.tar.gz/var/log/messages (ceph-mon):966 (ceph):1306 (python3):68 (ibmcloud-storag):7 No trace of stack backtrace found in mds logs either. Akash, Engineering would require the MDS log with the crash backtrace to debug the issue. Please put needinfo on me when the relevant sosreports are available in supoprtshell. Cheers, Venky @vshankar : The MDS pod logs of both pods are already attached in this BZ. And sosreports are available in case https://access.redhat.com/support/cases/#/case/03419217 So I guess all sosreports and must-gather is present now. If any further logs needed post the relevant commands or docs of which logs are needed. Hi Akash, (In reply to akgunjal.com from comment #20) > @vshankar : The MDS pod logs of both pods are already attached in > this BZ. And sosreports are available in case > https://access.redhat.com/support/cases/#/case/03419217 > So I guess all sosreports and must-gather is present now. If any further > logs needed post the relevant commands or docs of which logs are needed. Milind has gone through the logs and those do not have any MDS crash backtraces. I do not understand what is there to debug. @vshankar : We have now used a community tool to fetch the requested core dump and posted in the case here https://access.redhat.com/support/cases/#/case/03419217 Please check if this helps. (In reply to akgunjal.com from comment #22) > @vshankar : We have now used a community tool to fetch the > requested core dump and posted in the case here > https://access.redhat.com/support/cases/#/case/03419217 > Please check if this helps. The core files are not readable, however, it did give an hint on where the crash seems to be happening - ms_dispatch in the call stack. Using that to grep the logs, file ./0470-messages.tar.gz/var/log/messages has: ``` Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: msgr-worker-0[40715]: segfault at 7fd87078fff8 ip 000055da2c58ee69 sp 00007fd870790000 error 6 in ceph-mds[55da2c3ed000+718000] Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: Code: 0f 84 88 fc ff ff 41 0f b6 54 05 00 88 14 07 48 83 c0 01 48 39 c6 75 ee e9 71 fc ff ff 0f 1f 40 00 48 89 ea 4c 89 ee 4c 89 c7 <e8> e2 a8 fb ff 4c 8b 73 40 48 03 6b 48 48 89 6b 48 4d 8d 2c 2e e9 ``` Which is same as the crash in BZ2164385#c50. This BZ should be marked as duplicate of that as the discussions are flowing there. @vshankar : I dont have access to https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50 Can you please provide me access? (In reply to akgunjal.com from comment #24) > @vshankar : I dont have access to > https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50 > Can you please provide me access? You'd need to contact Red Hat BZ support/admin for this. Akash, Please close this bz since its a duplicate as mentioned in c#24? |
Created attachment 1943997 [details] ceph_health_detail Description of problem (please be detailed as possible and provide log snippests): We have deployed ODF 4.9 OCP 4.10 version. I am getting FS_DEGRADED error and rook-ceph-mds-ocs-storagecluster-cephfilesystem-xxx pods in crashloopbackoff state. Version of all relevant components (if applicable): ODF 4.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. We have cephFS based volumes created and cannot attach those to pods in the cluster due to this issue. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy ODF 4.9 on OCP 4.10 cluster 2.ODF is healthy and created cephFS based PVC and able to read/write data. 3.ODF cephFS MDS pods go into crashed state and all cephFS based PVC fail to attach to the pods. Actual results: Fail to attach the cephFS volumes to pod. Expected results: The CephFS volumes should be accessible and mount to pods. Additional info: