Bug 2056935

Summary:	[cee/sd][cephfs] MDS daemon crashes during and after up-gradation of ceph from RHCS 5.0z3 to RHCS 5.0z4
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Prasanth M V <pmv>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	5.0	CC:	amk, ceph-eng-bugs, ceph-qe-bugs, gfarnum, gjose, hyelloji, mmuench, vshankar
Target Milestone:	---	Flags:	hyelloji: needinfo-
Target Release:	5.1z1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-07 15:17:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Prasanth M V 2022-02-22 11:54:50 UTC

Description of problem:

Customer was performing the RHCS upgradation activity from RHCS 5.0z3 ceph version 16.2.0-146.el8cp to RHCS 5.0z4 ceph version ceph version 16.2.0-152.el8cp.
But the up-gradation was incompleted or stopped because the MDS daemon got crashed during the up-gradation activity. 

The up-gradation activity was started from the timeframe "Feb  9 15:51" and later MDS daemon asserted with Segmentation Fault in thread "md_log_replay".
This is the first time MDS daemon crashed while upgrading at the timeframe "Feb  9 16:09:36".

Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: *** Caught signal (Segmentation fault) **
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: debug 2022-02-09T15:09:36.553+0000 7fb911bcf700 -1 *** Caught signal (Segmentation fault) **
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400]
Feb  9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


As the MDS daemon is not allowing for the up-gradation customer has removed it and completed the up-gradation of the cluster to ceph versions 16.2.0-152.el8cp.

After customer have removed the failed MDS daemon and completed the up-gradation; Later recreated the new MDS daemon in the timeframe around "Feb 10 13:37". It also got crashed. I can see the new MDS daemon also got asserted with  Segmentation Fault in thread "md_log_replay" around this time frame.

Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: *** Caught signal (Segmentation fault) **
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone()
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: debug 2022-02-10T12:37:21.773+0000 7f7f763b2700 -1 *** Caught signal (Segmentation fault) **
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-t-vulpe /s-osd1-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a]
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone()
Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


The up-gradation process was done by using the command "#ceph orch upgrade start --image registry.uppmax.uu.se/registry.redhat.ui/rceph/rhceph-5-rhel8:5-103".This is local mirror as cu doesn't have internet access to the system.

The remove operation was done with "ceph orch rm [mds-name]".

Deployed MDS using "ceph orch apply mds".



Version-Release number of selected component (if applicable):

Red Hat Ceph Storage 5.0z3 - 5.0.4   ceph version 16.2.0-146.el8cp pacific (stable)   --> Before Upgrade
Red Hat Ceph Storage 5.0z4 - 5.0.4   ceph version 16.2.0-152.el8cp pacific (stable)   --> After Upgrade

Comment 3 Venky Shankar 2022-02-23 04:43:10 UTC

Will take a look.

Comment 18 Greg Farnum 2022-04-06 15:23:09 UTC

Venky, is there anything more we can do with this bz?

Comment 20 Venky Shankar 2022-04-07 06:49:59 UTC

Prasanth,

Can this bz be closed?

Comment 21 Prasanth M V 2022-04-07 15:14:14 UTC

Hi Venky,

Yes, we can close the BZ as the corresponding case to this BZ has been resolved by the workaround we provided and the case has closed now.

Thanks for your support.

Regards,
Prasanth.M.V