Bug 2002140

Summary: cephfs-mirror: terminating a mirror daemon can cause a crash at times
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: medium Docs Contact: Mary Frances Hull <mhull>
Priority: medium    
Version: 5.0CC: agunn, ceph-eng-bugs, dfuller, pdonnell, sweil, tserlin, vereddy, ymane
Target Milestone: ---   
Target Release: 5.0z1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ceph-16.2.0-133.el8cp Doc Type: Bug Fix
Doc Text:
.Stopping the `cephfs-mirror` daemon can result in an unclean shutdown Previously, the `cephfs-mirror` process would terminate uncleanly due to a race condition during `cephfs-mirror` shutdown process. With this release, the race condition was resolved, and as a result, the `cephfs-mirror` daemon shuts down gracefully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 16:39:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959686    

Description Venky Shankar 2021-09-08 05:03:08 UTC
Seen in this teuthology run which thrashes the mirror daemon for active/active HA test: https://pulpito.ceph.com/vshankar-2021-08-05_02:19:15-fs-wip-cephfs-mirror-ha-active-active-20210802-054956-distro-basic-smithi/

2021-08-05T02:39:35.991+0000 7fddebb3c700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fddebb3c700 thread_name:msgr-worker-1

 ceph version 17.0.0-6593-gede67e63 (ede67e630d11e5f6758fa1e18b166b29d499c421) quincy (dev)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fddf032db20]
 2: (ProtocolV2::send_message(Message*)+0xa1) [0x7fddf14e0f37]
 3: (AsyncConnection::send_message(Message*)+0x813) [0x7fddf14ad0db]
 4: (Connection::send_message2(boost::intrusive_ptr<Message>)+0x1e) [0x7fddf14ade22]
 5: (MonClient::_send_mon_message(boost::intrusive_ptr<Message>)+0x8a) [0x7fddf1588568]
 6: (MonClient::_finish_hunting(int)+0x5f9) [0x7fddf1593eb5]
 7: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v15_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x2ce) [0x7fddf1595400]
 8: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x4b2) [0x7fddf14f15b4]
 9: (ProtocolV2::handle_frame_payload()+0x1f6) [0x7fddf1500130]                                                                                                                                                                                                                                                            
 10: (ProtocolV2::handle_read_frame_dispatch()+0x179) [0x7fddf15003cf]                                                                                                                                                                                                                                                     
 11: (ProtocolV2::_handle_read_frame_epilogue_main()+0xc2) [0x7fddf15005e8]                                                                                                                                                                                                                                                
 12: (ProtocolV2::_handle_read_frame_segment()+0xa6) [0x7fddf1500938]                                                                                                                                                                                                                                                      
 13: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0xc7) [0x7fddf1501f13]                                                                                                                                                     
 14: (CtRxNode<ProtocolV2>::call(ProtocolV2*) const+0x31) [0x7fddf1502621]                                                                                                                                                                                                                                                 
 15: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3b) [0x7fddf14e7eaf]                                                                                                                                                                                                                                                 
 16: /usr/lib64/ceph/libceph-common.so.2(+0x65d495) [0x7fddf14e8495]                                                                                                                                                                                                                                                       
 17: (std::function<void (char*, long)>::operator()(char*, long) const+0x23) [0x7fddf14ae307]                                                                                                                                                                                                                              
 18: (AsyncConnection::process()+0xeb5) [0x7fddf14ac099]                                                                                                                                                                                                                                                                   
 19: (C_handle_read::do_request(unsigned long)+0x16) [0x7fddf14aee24]                                                                                                                                                                                                                                                      
 20: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x594) [0x7fddf150a7a0]                                                                                                                                                                               
 21: /usr/lib64/ceph/libceph-common.so.2(+0x68a4e7) [0x7fddf15154e7]                                                                                                                                                                                                                                                       
 22: (std::function<void ()>::operator()() const+0x12) [0x7fddf1513ba6]                                                                                                                                                                                                                                                    
 23: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x11) [0x7fddf1513bc1]                                                                                                                                                                                              
 24: /lib64/libstdc++.so.6(+0xc2ba3) [0x7fddef562ba3]                                                                                                                                                                                                                                                                      
 25: /lib64/libpthread.so.0(+0x814a) [0x7fddf032314a]                                                                                                                                                                                                                                                                      
 26: clone()                                                                                                                                                                                                                                                                                                               
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.           


Basically, there is a race when the mirror daemon is shutting down and the mirror daemon receiving a fs_map update::

2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: canceling timer task=0x55e35491d4e0
2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: trying to shutdown filesystem={fscid=2, fs_name=cephfs}
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_peer_replayers
2021-08-05T02:39:05.977+0000 7fddf302af80  5 cephfs::mirror::FSMirror shutdown_peer_replayers: shutting down replayer for peer={uuid=3aeddb3f-3d31-4db5-9da0-aeed11538b3c, remote_cluster={client_name=client.mirror_remote, cluster_name=ceph, fs_name=backup_fs}}
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) shutdown
2021-08-05T02:39:05.977+0000 7fddcbafc700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.977+0000 7fddcb2fb700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.977+0000 7fddcc2fd700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_mirror_watcher
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher shutdown
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher unregister_watcher
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::Watcher unregister_watch
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror shutdown_instance_watcher
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher shutdown
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher unregister_watcher
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::Watcher unregister_watch
2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0
2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher remove_instance
2021-08-05T02:39:05.985+0000 7fdde0325700 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=0
2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=0
2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror cleanup
2021-08-05T02:39:06.435+0000 7fddebb3c700 20 cephfs::mirror::ClusterWatcher handle_fsmap
2021-08-05T02:39:06.435+0000 7fddebb3c700  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=2, fs_name=cephfs}]
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 remove_filesystem: fscid=2
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 schedule_update_status
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=2, fs_name=cephfs}
2021-08-05T02:39:07.435+0000 7fdde4b2e700 20 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 update_status: 0 filesystem(s)
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror run: shutdown filesystem={fscid=2, fs_name=cephfs}, r=0
2021-08-05T02:39:35.986+0000 7fddf302af80 20 cephfs::mirror::FSMirror ~FSMirror
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror ~Mirror
2021-08-05T02:39:35.986+0000 7fddebb3c700  5 cephfs::mirror::Mirror mirroring_disabledshutting down
2021-08-05T02:39:35.986+0000 7fddebb3c700  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 ~ServiceDaemon

Comment 11 errata-xmlrpc 2021-11-02 16:39:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105