Bug 2169631 - cephFS MDS crashed and cephFS volumes fail to attach [NEEDINFO]
Summary: cephFS MDS crashed and cephFS volumes fail to attach
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-14 06:48 UTC by akgunjal@in.ibm.com
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-20 05:25:29 UTC
Embargoed:
vshankar: needinfo? (akgunjal)


Attachments (Terms of Use)
ceph_health_detail (4.24 KB, text/plain)
2023-02-14 06:48 UTC, akgunjal@in.ibm.com
no flags Details
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5479fbf6jjcjs (88.74 KB, text/plain)
2023-02-14 06:49 UTC, akgunjal@in.ibm.com
no flags Details
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7bddf54cxpmn7 (91.82 KB, text/plain)
2023-02-14 06:50 UTC, akgunjal@in.ibm.com
no flags Details
ceph_status (683 bytes, text/plain)
2023-02-14 06:54 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-aa (18.00 MB, application/zip)
2023-02-14 07:21 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-ab (18.00 MB, application/octet-stream)
2023-02-14 07:27 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-ac (18.00 MB, application/octet-stream)
2023-02-14 07:31 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-ad (18.00 MB, application/octet-stream)
2023-02-14 07:35 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-ae (18.00 MB, application/octet-stream)
2023-02-14 07:41 UTC, akgunjal@in.ibm.com
no flags Details
must-gather-af (12.47 MB, application/octet-stream)
2023-02-14 07:44 UTC, akgunjal@in.ibm.com
no flags Details

Description akgunjal@in.ibm.com 2023-02-14 06:48:54 UTC
Created attachment 1943997 [details]
ceph_health_detail

Description of problem (please be detailed as possible and provide log
snippests):
We have deployed ODF 4.9 OCP 4.10 version. I am getting FS_DEGRADED error and rook-ceph-mds-ocs-storagecluster-cephfilesystem-xxx pods in crashloopbackoff state.

Version of all relevant components (if applicable):
ODF 4.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. We have cephFS based volumes created and cannot attach those to pods in the cluster due to this issue.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy ODF 4.9 on OCP 4.10 cluster
2.ODF is healthy and created cephFS based PVC and able to read/write data.
3.ODF cephFS MDS pods go into crashed state and all cephFS based PVC fail to attach to the pods.


Actual results:
Fail to attach the cephFS volumes to pod.

Expected results:
The CephFS volumes should be accessible and mount to pods.

Additional info:

Comment 2 akgunjal@in.ibm.com 2023-02-14 06:49:45 UTC
Created attachment 1943998 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5479fbf6jjcjs

Comment 3 akgunjal@in.ibm.com 2023-02-14 06:50:07 UTC
Created attachment 1943999 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7bddf54cxpmn7

Comment 4 akgunjal@in.ibm.com 2023-02-14 06:50:50 UTC
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 4abe4d20-5a51-457a-ab62-4ad6d974971e
ClusterVersion: Stable at "4.10.47"
ClusterOperators:
	clusteroperator/authentication is missing
	clusteroperator/cloud-credential is missing
	clusteroperator/cluster-autoscaler is missing
	clusteroperator/config-operator is missing
	clusteroperator/etcd is missing
	clusteroperator/insights is missing
	clusteroperator/machine-api is missing
	clusteroperator/machine-approver is missing
	clusteroperator/machine-config is missing

Comment 5 akgunjal@in.ibm.com 2023-02-14 06:54:22 UTC
Created attachment 1944000 [details]
ceph_status

Comment 6 Venky Shankar 2023-02-14 07:03:15 UTC
Hi Akash,

The logs uploaded in c#2 and c#3 barely have anything to debug. You mention that the MDS crash, however, I do not see the crash. The `ceph status' command in c#5 shows "1/1 daemons up" in the mds section which implies that ceph-mds daemon is up and active. The reason for the cluster warning is that there are not standby mds daemons (see: "insufficient standby MDS daemons available" in warning).

Please share the must-gather logs as that's essential to debug the MDS going into CLBO.

Cheers,
Venky

Comment 7 akgunjal@in.ibm.com 2023-02-14 07:21:36 UTC
Created attachment 1944008 [details]
must-gather-aa

Comment 8 akgunjal@in.ibm.com 2023-02-14 07:27:39 UTC
Created attachment 1944009 [details]
must-gather-ab

Comment 9 akgunjal@in.ibm.com 2023-02-14 07:31:35 UTC
Created attachment 1944010 [details]
must-gather-ac

Comment 10 akgunjal@in.ibm.com 2023-02-14 07:35:59 UTC
Created attachment 1944011 [details]
must-gather-ad

Comment 11 akgunjal@in.ibm.com 2023-02-14 07:41:17 UTC
Created attachment 1944013 [details]
must-gather-ae

Comment 12 akgunjal@in.ibm.com 2023-02-14 07:44:01 UTC
Created attachment 1944015 [details]
must-gather-af

Comment 13 akgunjal@in.ibm.com 2023-02-14 07:47:54 UTC
I have attached the ODF must-gather logs by splitting it into smaller files. It can be downloaded and recreated the zip file. I referred this link https://access.redhat.com/solutions/4109 to split and recreate can be done using same link.

please download all files with prefix must-gather-a* in the attachments and recreate using command "cat must-gather-a* > odflogs.zip"

Comment 14 Venky Shankar 2023-02-14 10:06:46 UTC
Hi Akash,

(In reply to akgunjal.com from comment #13)
> I have attached the ODF must-gather logs by splitting it into smaller files.
> It can be downloaded and recreated the zip file. I referred this link
> https://access.redhat.com/solutions/4109 to split and recreate can be done
> using same link.
> 
> please download all files with prefix must-gather-a* in the attachments and
> recreate using command "cat must-gather-a* > odflogs.zip"

I cannot find the MDS crash dump. Could you point me to one? Normally, you'd find one via `ceph crash ls' and that is a part of the must-gather logs which isn't present in the logs you shared.

Comment 15 Venky Shankar 2023-02-14 13:34:57 UTC
Undoing private comment since Akash is unable to view those.

Comment 18 Milind Changire 2023-02-20 08:48:29 UTC
Although there are a few thousands of ceph-mon  oom-kill logs, I couldn't find even a single ceph-mds oom-kill in the /var/log/messages

Following are the oom-kill instances found in 0470-messages.tar.gz/var/log/messages
(ceph-mon):966
(ceph):1306
(python3):68
(ibmcloud-storag):7

No trace of stack backtrace found in mds logs either.

Comment 19 Venky Shankar 2023-02-21 02:58:55 UTC
Akash,

Engineering would require the MDS log with the crash backtrace to debug the issue. Please put needinfo on me when the relevant sosreports are available in supoprtshell.

Cheers,
Venky

Comment 20 akgunjal@in.ibm.com 2023-02-21 10:06:34 UTC
@vshankar : The MDS pod logs of both pods are already attached in this BZ. And sosreports are available in case https://access.redhat.com/support/cases/#/case/03419217
So I guess all sosreports and must-gather is present now. If any further logs needed post the relevant commands or docs of which logs are needed.

Comment 21 Venky Shankar 2023-02-22 02:28:25 UTC
Hi Akash,

(In reply to akgunjal.com from comment #20)
> @vshankar : The MDS pod logs of both pods are already attached in
> this BZ. And sosreports are available in case
> https://access.redhat.com/support/cases/#/case/03419217
> So I guess all sosreports and must-gather is present now. If any further
> logs needed post the relevant commands or docs of which logs are needed.

Milind has gone through the logs and those do not have any MDS crash backtraces. I do not understand what is there to debug.

Comment 22 akgunjal@in.ibm.com 2023-02-22 04:07:53 UTC
@vshankar : We have now used a community tool to fetch the requested core dump and posted in the case here https://access.redhat.com/support/cases/#/case/03419217
Please check if this helps.

Comment 23 Venky Shankar 2023-02-22 05:09:28 UTC
(In reply to akgunjal.com from comment #22)
> @vshankar : We have now used a community tool to fetch the
> requested core dump and posted in the case here
> https://access.redhat.com/support/cases/#/case/03419217
> Please check if this helps.

The core files are not readable, however, it did give an hint on where the crash seems to be happening - ms_dispatch in the call stack. Using that to grep the logs, file ./0470-messages.tar.gz/var/log/messages has:

```
Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: msgr-worker-0[40715]: segfault at 7fd87078fff8 ip 000055da2c58ee69 sp 00007fd870790000 error 6 in ceph-mds[55da2c3ed000+718000]     Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: Code: 0f 84 88 fc ff ff 41 0f b6 54 05 00 88 14 07 48 83 c0 01 48 39 c6 75 ee e9 71 fc ff ff 0f 1f 40 00 48 89 ea 4c 89 ee 4c 89 c7 <e8> e2 a8 fb ff 4c 8b 73 40 48 03 6b 48 48 89 6b 48 4d 8d 2c 2e e9
```

Which is same as the crash in BZ2164385#c50.

This BZ should be marked as duplicate of that as the discussions are flowing there.

Comment 24 akgunjal@in.ibm.com 2023-02-22 06:31:59 UTC
@vshankar : I dont have access to https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50
Can you please provide me access?

Comment 25 Venky Shankar 2023-02-22 07:01:58 UTC
(In reply to akgunjal.com from comment #24)
> @vshankar : I dont have access to
> https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50
> Can you please provide me access?

You'd need to contact Red Hat BZ support/admin for this.

Comment 26 Venky Shankar 2023-03-02 05:31:21 UTC
Akash,

Please close this bz since its a duplicate as mentioned in c#24?


Note You need to log in before you can comment on or make changes to this bug.