Bug 2141720

Summary:	[cephfs] MDS daemon crashes repeatedly after upgrade of ODF from 4.8 to 4.9
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Matthew Secaur <msecaur>
Component:	ceph	Assignee:	Venky Shankar <vshankar>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bniver, hnallurv, hyelloji, muagarwa, ocs-bugs, odf-bz-bot, vshankar
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-04-10 07:03:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Secaur 2022-11-10 15:46:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Customer upgraded OpenShift from 4.8.36 to 4.9 and also upgraded ODF. After the upgrade, the MDS daemons would crash repeatedly (around 3 times per day). After trying many things with no success, the customer opted to upgrade again from 4.9 to 4.10. The MDS daemons are continuing to crash.

The actual MDS pods show restarts, but they always go back to "Running" state (i.e. they are never in CrashLoopBackOff or other error states).

The error reported by Ceph Crash (now on OCP4.10):

sh-4.4$ ceph crash info 2022-11-09T22:08:40.143859Z_4c04a689-d1d7-457c-97c5-2d579518da57
{
    "assert_condition": "g_conf()->mds_wipe_sessions",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.7/src/mds/journal.cc",
    "assert_func": "void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)",
    "assert_line": 1618,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7f90c0b25700 time 2022-11-09T22:08:40.137925+0000\n/builddir/build/BUILD/ceph-16.2.7/src/mds/journal.cc: 1618: FAILED ceph_assert(g_conf()->mds_wipe_sessions)\n",
    "assert_thread_name": "md_log_replay",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f90cfb3bce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f90d0b4dd4f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f18) [0x7f90d0b4df18]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5ae5) [0x562541693065]",
        "(EUpdate::replay(MDSRank*)+0x40) [0x562541694a80]",
        "(MDLog::_replay_thread()+0xcd1) [0x56254161adb1]",
        "(MDLog::ReplayThread::entry()+0x11) [0x56254131c941]",
        "/lib64/libpthread.so.0(+0x81cf) [0x7f90cfb311cf]",
        "clone()"
    ],
    "ceph_version": "16.2.7-126.el8cp",
    "crash_id": "2022-11-09T22:08:40.143859Z_4c04a689-d1d7-457c-97c5-2d579518da57",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.6 (Ootpa)",
    "os_version_id": "8.6",
    "process_name": "ceph-mds",
    "stack_sig": "52ebd581300a13e6933b6db0f2b6a61d1132bb285ec25dbeb28e31658f657a01",
    "timestamp": "2022-11-09T22:08:40.143859Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6597794f54496",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.62.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Aug 11 12:07:27 EDT 2022"
}

Version of all relevant components (if applicable):
OCP4.8 had no issues. Problems started the same day that OCP was upgraded from 4.8->4.9. After doing another upgrade from 4.9->4.10, the issue persists.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. The MDS daemons crashing can cause the CephFS filesystem to go offline, which impacts applications in the environment.

Is there any workaround available to the best of your knowledge?
None.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1 - just an upgrade

Can this issue reproducible?
Unknown.

Can this issue reproduce from the UI?
Unknown.

If this is a regression, please provide more details to justify this:
Unknown.

Steps to Reproduce:
1. Run OCP 4.8 with OCS.
2. Upgrade to OCP 4.9.
3. Wait for the MDS daemons to start crashing.


Actual results:
MDS daemons crash about three times per day. Occasionally, the CephFS filesystem is corrupted and required repair.

Expected results:
MDS daemons should not crash.

Additional info:
This issue appears to be the same as BZ 2056935.

Here is a list of most of the MDS crashes (the ones closest to the upgrade have been purged already). All of the crashes have basically the same output.

sh-4.4$ ceph crash ls
ID                                                                ENTITY                                   NEW
2022-10-03T18:53:32.381587Z_bdbd732e-ff61-4aa4-9837-83199f84a7c1  mds.ocs-storagecluster-cephfilesystem-b
2022-10-05T10:50:44.778476Z_4f069073-9f97-4ce7-9954-2a9585b467e1  mds.ocs-storagecluster-cephfilesystem-b
2022-10-06T10:40:25.297537Z_93408c06-981a-4e1c-a1c3-1a8b899ab085  mds.ocs-storagecluster-cephfilesystem-b
2022-10-07T17:16:27.262168Z_bda8dc9b-9da1-4f4b-a6af-15c583d8fed4  mds.ocs-storagecluster-cephfilesystem-b
2022-10-07T20:07:47.310891Z_f51f7a12-4174-495c-b362-4fbd00e28816  mds.ocs-storagecluster-cephfilesystem-b
2022-10-08T18:17:31.609481Z_22b9eccc-3ee3-4519-9b48-195173d6271c  mds.ocs-storagecluster-cephfilesystem-b
2022-10-08T22:16:30.086527Z_07963168-d4c1-4ff5-a5cb-d68802442061  mds.ocs-storagecluster-cephfilesystem-b
2022-10-09T02:43:17.565650Z_003b9811-118a-4eae-b456-df9330c3e2c9  mds.ocs-storagecluster-cephfilesystem-b
2022-10-09T05:38:38.911928Z_ee024171-894e-4c88-9286-fdc8313d4d1a  mds.ocs-storagecluster-cephfilesystem-b
2022-10-09T11:46:16.590673Z_9bc5e1ac-cffc-41bc-b0c8-9c3c56f6abe0  mds.ocs-storagecluster-cephfilesystem-b
2022-10-10T00:10:05.999169Z_e06ef522-eec3-4b1d-b626-4554552ecccb  mds.ocs-storagecluster-cephfilesystem-b
2022-10-10T10:08:36.686005Z_db3a983a-acc5-4b69-9678-664f8ee597f0  mds.ocs-storagecluster-cephfilesystem-b
2022-10-11T04:16:33.038254Z_b9950bb6-d1c6-4ea6-af25-757a70cc29b5  mds.ocs-storagecluster-cephfilesystem-b
2022-10-11T07:06:58.985809Z_5a98cd12-1d0c-4fa7-bbcf-c8c0e1af7d06  mds.ocs-storagecluster-cephfilesystem-b
2022-10-11T23:37:26.973810Z_bc8569b7-2a6b-44e2-8dc2-45111f98a52b  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:37:19.459402Z_3a583ace-0cbb-4654-baa3-d6051d9d9dd7  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:37:22.507612Z_5fd6ff2c-101e-4795-a3ad-f63b77d5c687  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:38:31.365456Z_a33fc777-4595-4127-b3f8-edee351ab1b2  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:39:00.190922Z_14cc3bc9-6723-4b21-9502-2ac2f6a6c62a  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:39:49.053235Z_c9d391c8-c22b-4f82-88de-7100024e6367  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:41:17.560776Z_d638be1f-7f6d-4908-b084-6b3fd90ef534  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T13:44:02.872969Z_6a2d47b7-baff-4966-b43e-80258429572a  mds.ocs-storagecluster-cephfilesystem-b
2022-10-12T23:47:48.381060Z_03b19db5-b25a-460c-94ff-49265dbdc7af  mds.ocs-storagecluster-cephfilesystem-b
2022-10-13T07:16:04.379333Z_ecf68561-6ce7-460e-8d16-59b517a25077  mds.ocs-storagecluster-cephfilesystem-b
2022-10-14T01:15:49.768824Z_69431f01-5c92-43e2-9798-4e5eb83c88af  mds.ocs-storagecluster-cephfilesystem-b
2022-10-14T13:11:27.619303Z_2350cdc7-8fdf-4d75-a88d-8c5e64e9ca1c  mds.ocs-storagecluster-cephfilesystem-b
2022-10-15T20:38:39.696705Z_b7829023-a745-4af2-a5bc-d3744ef0ab73  mds.ocs-storagecluster-cephfilesystem-b
2022-10-15T23:22:53.837330Z_052120ad-6a9e-474c-8461-0e0e632806a2  mds.ocs-storagecluster-cephfilesystem-b
2022-10-16T04:09:57.956446Z_2388d4ea-b78f-4d68-87b7-dc71c48da7dd  mds.ocs-storagecluster-cephfilesystem-b
2022-10-16T10:37:19.093492Z_d10e980d-c926-48c9-9ef2-beb3d60367ad  mds.ocs-storagecluster-cephfilesystem-b
2022-10-18T08:03:37.247281Z_83cdcdb0-62bd-4294-bc5f-b1ed622f5fa1  mds.ocs-storagecluster-cephfilesystem-b
2022-10-18T20:45:40.365667Z_d4b6548f-95f6-47a1-95dd-f9b8ceb08101  mds.ocs-storagecluster-cephfilesystem-b
2022-10-19T03:16:32.387364Z_c2698703-b209-4825-8eaa-86e7d1794cda  mds.ocs-storagecluster-cephfilesystem-b
2022-10-22T03:04:33.600885Z_9900ff19-f328-4ac4-a90a-4213474e05c3  mds.ocs-storagecluster-cephfilesystem-b
2022-10-22T07:59:10.834201Z_3eb1ef50-f4fb-451f-978c-24c5a377248a  mds.ocs-storagecluster-cephfilesystem-b
2022-10-22T11:03:32.833396Z_6e8f4f91-3d90-4ba2-abe0-83a190aab285  mds.ocs-storagecluster-cephfilesystem-b
2022-11-02T13:31:48.620757Z_aa8b6658-bcd8-48fb-93be-8312a47a4bca  mds.ocs-storagecluster-cephfilesystem-b
2022-11-03T05:06:23.371895Z_695c2dac-538a-4d7b-8f59-45724fb2d1e2  mds.ocs-storagecluster-cephfilesystem-b
2022-11-03T21:53:19.368170Z_ad1ec006-dfcb-42e8-9316-cb27baa02a40  mds.ocs-storagecluster-cephfilesystem-b
2022-11-04T01:30:11.849954Z_03d13983-7f87-4b99-97e4-476347bc30a4  mds.ocs-storagecluster-cephfilesystem-b
2022-11-04T03:49:25.907894Z_a6d4a662-8520-4005-9e72-a87714d1d058  mds.ocs-storagecluster-cephfilesystem-b
2022-11-04T11:52:09.209665Z_2a9b081b-c89a-4c6f-8745-a5688111f03d  mds.ocs-storagecluster-cephfilesystem-b
2022-11-05T01:51:12.988572Z_3ea8b80b-5ad1-431c-81cd-416e2214c09c  mds.ocs-storagecluster-cephfilesystem-b
2022-11-05T18:00:38.810350Z_8794997b-dc5d-4d55-9c8b-aef79d8bf0f0  mds.ocs-storagecluster-cephfilesystem-b
2022-11-06T15:06:09.812841Z_55ba0505-d200-44bc-a5cd-993a196ae14a  mds.ocs-storagecluster-cephfilesystem-a
2022-11-09T22:08:40.143859Z_4c04a689-d1d7-457c-97c5-2d579518da57  mds.ocs-storagecluster-cephfilesystem-a   *

Comment 10 Venky Shankar 2022-11-18 13:57:53 UTC

Thanks, Steve. I'll have a look.

(keeping NI)

Comment 24 Red Hat Bugzilla 2023-12-08 04:31:11 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days