Bug 2169631

Summary:

cephFS MDS crashed and cephFS volumes fail to attach

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

akgunjal <akgunjal>

Component:

ceph

Assignee:

Venky Shankar <vshankar>

ceph sub component:

CephFS

QA Contact:

Elad <ebenahar>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

bniver, gfarnum, mchangir, mduasope, muagarwa, ocs-bugs, odf-bz-bot, sostapov, vshankar

Version:

4.9

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-03-20 05:25:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ceph_health_detail	none
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5479fbf6jjcjs	none
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7bddf54cxpmn7	none
ceph_status	none
must-gather-aa	none
must-gather-ab	none
must-gather-ac	none
must-gather-ad	none
must-gather-ae	none
must-gather-af	none

Description akgunjal@in.ibm.com 2023-02-14 06:48:54 UTC

Created attachment 1943997 [details]
ceph_health_detail

Description of problem (please be detailed as possible and provide log
snippests):
We have deployed ODF 4.9 OCP 4.10 version. I am getting FS_DEGRADED error and rook-ceph-mds-ocs-storagecluster-cephfilesystem-xxx pods in crashloopbackoff state.

Version of all relevant components (if applicable):
ODF 4.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. We have cephFS based volumes created and cannot attach those to pods in the cluster due to this issue.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy ODF 4.9 on OCP 4.10 cluster
2.ODF is healthy and created cephFS based PVC and able to read/write data.
3.ODF cephFS MDS pods go into crashed state and all cephFS based PVC fail to attach to the pods.


Actual results:
Fail to attach the cephFS volumes to pod.

Expected results:
The CephFS volumes should be accessible and mount to pods.

Additional info:

Comment 2 akgunjal@in.ibm.com 2023-02-14 06:49:45 UTC

Created attachment 1943998 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5479fbf6jjcjs

Comment 3 akgunjal@in.ibm.com 2023-02-14 06:50:07 UTC

Created attachment 1943999 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7bddf54cxpmn7

Comment 4 akgunjal@in.ibm.com 2023-02-14 06:50:50 UTC

Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 4abe4d20-5a51-457a-ab62-4ad6d974971e
ClusterVersion: Stable at "4.10.47"
ClusterOperators:
	clusteroperator/authentication is missing
	clusteroperator/cloud-credential is missing
	clusteroperator/cluster-autoscaler is missing
	clusteroperator/config-operator is missing
	clusteroperator/etcd is missing
	clusteroperator/insights is missing
	clusteroperator/machine-api is missing
	clusteroperator/machine-approver is missing
	clusteroperator/machine-config is missing

Comment 5 akgunjal@in.ibm.com 2023-02-14 06:54:22 UTC

Created attachment 1944000 [details]
ceph_status

Comment 6 Venky Shankar 2023-02-14 07:03:15 UTC

Hi Akash,

The logs uploaded in c#2 and c#3 barely have anything to debug. You mention that the MDS crash, however, I do not see the crash. The `ceph status' command in c#5 shows "1/1 daemons up" in the mds section which implies that ceph-mds daemon is up and active. The reason for the cluster warning is that there are not standby mds daemons (see: "insufficient standby MDS daemons available" in warning).

Please share the must-gather logs as that's essential to debug the MDS going into CLBO.

Cheers,
Venky

Comment 7 akgunjal@in.ibm.com 2023-02-14 07:21:36 UTC

Created attachment 1944008 [details]
must-gather-aa

Comment 8 akgunjal@in.ibm.com 2023-02-14 07:27:39 UTC

Created attachment 1944009 [details]
must-gather-ab

Comment 9 akgunjal@in.ibm.com 2023-02-14 07:31:35 UTC

Created attachment 1944010 [details]
must-gather-ac

Comment 10 akgunjal@in.ibm.com 2023-02-14 07:35:59 UTC

Created attachment 1944011 [details]
must-gather-ad

Comment 11 akgunjal@in.ibm.com 2023-02-14 07:41:17 UTC

Created attachment 1944013 [details]
must-gather-ae

Comment 12 akgunjal@in.ibm.com 2023-02-14 07:44:01 UTC

Created attachment 1944015 [details]
must-gather-af

Comment 13 akgunjal@in.ibm.com 2023-02-14 07:47:54 UTC

I have attached the ODF must-gather logs by splitting it into smaller files. It can be downloaded and recreated the zip file. I referred this link https://access.redhat.com/solutions/4109 to split and recreate can be done using same link.

please download all files with prefix must-gather-a* in the attachments and recreate using command "cat must-gather-a* > odflogs.zip"

Comment 14 Venky Shankar 2023-02-14 10:06:46 UTC

Hi Akash,

(In reply to akgunjal.com from comment #13)
> I have attached the ODF must-gather logs by splitting it into smaller files.
> It can be downloaded and recreated the zip file. I referred this link
> https://access.redhat.com/solutions/4109 to split and recreate can be done
> using same link.
> 
> please download all files with prefix must-gather-a* in the attachments and
> recreate using command "cat must-gather-a* > odflogs.zip"

I cannot find the MDS crash dump. Could you point me to one? Normally, you'd find one via `ceph crash ls' and that is a part of the must-gather logs which isn't present in the logs you shared.

Comment 15 Venky Shankar 2023-02-14 13:34:57 UTC

Undoing private comment since Akash is unable to view those.

Comment 18 Milind Changire 2023-02-20 08:48:29 UTC

Although there are a few thousands of ceph-mon  oom-kill logs, I couldn't find even a single ceph-mds oom-kill in the /var/log/messages

Following are the oom-kill instances found in 0470-messages.tar.gz/var/log/messages
(ceph-mon):966
(ceph):1306
(python3):68
(ibmcloud-storag):7

No trace of stack backtrace found in mds logs either.

Comment 19 Venky Shankar 2023-02-21 02:58:55 UTC

Akash,

Engineering would require the MDS log with the crash backtrace to debug the issue. Please put needinfo on me when the relevant sosreports are available in supoprtshell.

Cheers,
Venky

Comment 20 akgunjal@in.ibm.com 2023-02-21 10:06:34 UTC

@vshankar : The MDS pod logs of both pods are already attached in this BZ. And sosreports are available in case https://access.redhat.com/support/cases/#/case/03419217
So I guess all sosreports and must-gather is present now. If any further logs needed post the relevant commands or docs of which logs are needed.

Comment 21 Venky Shankar 2023-02-22 02:28:25 UTC

Hi Akash,

(In reply to akgunjal.com from comment #20)
> @vshankar : The MDS pod logs of both pods are already attached in
> this BZ. And sosreports are available in case
> https://access.redhat.com/support/cases/#/case/03419217
> So I guess all sosreports and must-gather is present now. If any further
> logs needed post the relevant commands or docs of which logs are needed.

Milind has gone through the logs and those do not have any MDS crash backtraces. I do not understand what is there to debug.

Comment 22 akgunjal@in.ibm.com 2023-02-22 04:07:53 UTC

@vshankar : We have now used a community tool to fetch the requested core dump and posted in the case here https://access.redhat.com/support/cases/#/case/03419217
Please check if this helps.

Comment 23 Venky Shankar 2023-02-22 05:09:28 UTC

(In reply to akgunjal.com from comment #22)
> @vshankar : We have now used a community tool to fetch the
> requested core dump and posted in the case here
> https://access.redhat.com/support/cases/#/case/03419217
> Please check if this helps.

The core files are not readable, however, it did give an hint on where the crash seems to be happening - ms_dispatch in the call stack. Using that to grep the logs, file ./0470-messages.tar.gz/var/log/messages has:

```
Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: msgr-worker-0[40715]: segfault at 7fd87078fff8 ip 000055da2c58ee69 sp 00007fd870790000 error 6 in ceph-mds[55da2c3ed000+718000]     Jan 25 16:11:11 kube-cao8j9ad0a5sr77e1qd0-keybanksbx-sandbox-00002151 kernel: Code: 0f 84 88 fc ff ff 41 0f b6 54 05 00 88 14 07 48 83 c0 01 48 39 c6 75 ee e9 71 fc ff ff 0f 1f 40 00 48 89 ea 4c 89 ee 4c 89 c7 <e8> e2 a8 fb ff 4c 8b 73 40 48 03 6b 48 48 89 6b 48 4d 8d 2c 2e e9
```

Which is same as the crash in BZ2164385#c50.

This BZ should be marked as duplicate of that as the discussions are flowing there.

Comment 24 akgunjal@in.ibm.com 2023-02-22 06:31:59 UTC

@vshankar : I dont have access to https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50
Can you please provide me access?

Comment 25 Venky Shankar 2023-02-22 07:01:58 UTC

(In reply to akgunjal.com from comment #24)
> @vshankar : I dont have access to
> https://bugzilla.redhat.com/show_bug.cgi?id=2164385#c50
> Can you please provide me access?

You'd need to contact Red Hat BZ support/admin for this.

Comment 26 Venky Shankar 2023-03-02 05:31:21 UTC

Akash,

Please close this bz since its a duplicate as mentioned in c#24?

Comment 29 Red Hat Bugzilla 2023-12-08 04:32:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days