2056935 – [cee/sd][cephfs] MDS daemon crashes during and after up-gradation of ceph from RHCS 5.0z3 to RHCS 5.0z4

Bug 2056935 - [cee/sd][cephfs] MDS daemon crashes during and after up-gradation of ceph from RHCS 5.0z3 to RHCS 5.0z4

Summary: [cee/sd][cephfs] MDS daemon crashes during and after up-gradation of ceph fro...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.1z1
Assignee:	Venky Shankar
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-22 11:54 UTC by Prasanth M V
Modified:	2022-09-29 13:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-07 15:17:12 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2061672	1	None	None	None	2022-03-16 17:06:41 UTC
Red Hat Issue Tracker	RHCEPH-3564	0	None	None	None	2022-02-22 11:58:22 UTC
Red Hat Knowledge Base (Solution)	6952101	0	None	None	None	2022-05-03 09:52:25 UTC

Description Prasanth M V 2022-02-22 11:54:50 UTC

Description of problem:

Customer was performing the RHCS upgradation activity from RHCS 5.0z3 ceph version 16.2.0-146.el8cp to RHCS 5.0z4 ceph version ceph version 16.2.0-152.el8cp.
But the up-gradation was incompleted or stopped because the MDS daemon got crashed during the up-gradation activity. 

The up-gradation activity was started from the timeframe "Feb  9 15:51" and later MDS daemon asserted with Segmentation Fault in thread "md_log_replay".
This is the first time MDS daemon crashed while upgrading at the timeframe "Feb  9 16:09:36".

Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: *** Caught signal (Segmentation fault) **
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: debug 2022-02-09T15:09:36.553+0000 7fb911bcf700 -1 *** Caught signal (Segmentation fault) **
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


As the MDS daemon is not allowing for the up-gradation customer has removed it and completed the up-gradation of the cluster to ceph versions 16.2.0-152.el8cp.

After customer have removed the failed MDS daemon and completed the up-gradation; Later recreated the new MDS daemon in the timeframe around "Feb 10 13:37". It also got crashed. I can see the new MDS daemon also got asserted with  Segmentation Fault in thread "md_log_replay" around this time frame.

Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: *** Caught signal (Segmentation fault) **
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone()
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: debug 2022-02-10T12:37:21.773+0000 7f7f763b2700 -1 *** Caught signal (Segmentation fault) **
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-t-vulpe /s-osd1-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone()
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


The up-gradation process was done by using the command "#ceph orch upgrade start --image registry.uppmax.uu.se/registry.redhat.ui/rceph/rhceph-5-rhel8:5-103".This is local mirror as cu doesn't have internet access to the system.

The remove operation was done with "ceph orch rm [mds-name]".

Deployed MDS using "ceph orch apply mds".



Version-Release number of selected component (if applicable):

Red Hat Ceph Storage 5.0z3 - 5.0.4   ceph version 16.2.0-146.el8cp pacific (stable)   --> Before Upgrade
Red Hat Ceph Storage 5.0z4 - 5.0.4   ceph version 16.2.0-152.el8cp pacific (stable)   --> After Upgrade

Comment 3 Venky Shankar 2022-02-23 04:43:10 UTC

Will take a look.

Comment 18 Greg Farnum 2022-04-06 15:23:09 UTC

Venky, is there anything more we can do with this bz?

Comment 20 Venky Shankar 2022-04-07 06:49:59 UTC

Prasanth,

Can this bz be closed?

Comment 21 Prasanth M V 2022-04-07 15:14:14 UTC

Hi Venky,

Yes, we can close the BZ as the corresponding case to this BZ has been resolved by the workaround we provided and the case has closed now.

Thanks for your support.

Regards,
Prasanth.M.V

Note You need to log in before you can comment on or make changes to this bug.