Description of problem: Customer was performing the RHCS upgradation activity from RHCS 5.0z3 ceph version 16.2.0-146.el8cp to RHCS 5.0z4 ceph version ceph version 16.2.0-152.el8cp. But the up-gradation was incompleted or stopped because the MDS daemon got crashed during the up-gradation activity. The up-gradation activity was started from the timeframe "Feb 9 15:51" and later MDS daemon asserted with Segmentation Fault in thread "md_log_replay". This is the first time MDS daemon crashed while upgrading at the timeframe "Feb 9 16:09:36". Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: *** Caught signal (Segmentation fault) ** Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable) Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20] Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400] Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: debug 2022-02-09T15:09:36.553+0000 7fb911bcf700 -1 *** Caught signal (Segmentation fault) ** Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: in thread 7fb911bcf700 thread_name:md_log_replay Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable) Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7fb920df7c20] Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x8ebb400) [0x7fb92ac93400] Feb 9 16:09:36 <host> ceph-<id>-mds-test_root-<host>-syzymc[853437]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. As the MDS daemon is not allowing for the up-gradation customer has removed it and completed the up-gradation of the cluster to ceph versions 16.2.0-152.el8cp. After customer have removed the failed MDS daemon and completed the up-gradation; Later recreated the new MDS daemon in the timeframe around "Feb 10 13:37". It also got crashed. I can see the new MDS daemon also got asserted with Segmentation Fault in thread "md_log_replay" around this time frame. Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: *** Caught signal (Segmentation fault) ** Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone() Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: debug 2022-02-10T12:37:21.773+0000 7f7f763b2700 -1 *** Caught signal (Segmentation fault) ** Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: in thread 7f7f763b2700 thread_name:md_log_replay Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 1: /lib64/libpthread.so.0(+0x12c20) [0x7f7f855dac20] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x3a5e) [0x5649a418c59e] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 3: (EUpdate::replay(MDSRank*)+0x40) [0x5649a4190070] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-t-vulpe /s-osd1-kswjjc[1548667]: 4: (MDLog::_replay_thread()+0xbd9) [0x5649a4117149] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 5: (MDLog::ReplayThread::entry()+0x11) [0x5649a3e1a101] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 6: /lib64/libpthread.so.0(+0x817a) [0x7f7f855d017a] Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: 7: clone() Feb 10 13:37:21 <host> ceph-<id>-mds-test_root-<host>-kswjjc[1548667]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. The up-gradation process was done by using the command "#ceph orch upgrade start --image registry.uppmax.uu.se/registry.redhat.ui/rceph/rhceph-5-rhel8:5-103".This is local mirror as cu doesn't have internet access to the system. The remove operation was done with "ceph orch rm [mds-name]". Deployed MDS using "ceph orch apply mds". Version-Release number of selected component (if applicable): Red Hat Ceph Storage 5.0z3 - 5.0.4 ceph version 16.2.0-146.el8cp pacific (stable) --> Before Upgrade Red Hat Ceph Storage 5.0z4 - 5.0.4 ceph version 16.2.0-152.el8cp pacific (stable) --> After Upgrade
Will take a look.
Venky, is there anything more we can do with this bz?
Prasanth, Can this bz be closed?
Hi Venky, Yes, we can close the BZ as the corresponding case to this BZ has been resolved by the workaround we provided and the case has closed now. Thanks for your support. Regards, Prasanth.M.V