Description of problem (please be detailed as possible and provide log snippests): When running the whole workload suite [1], test "tests.e2e.workloads.ocp.registry.test_registry_reboot_node.TestRegistryRebootNode" failed on teardown on health check since rook-ceph-mds-ocs-storagecluster-cephfilesystem-a pod crashed. { "cluster_fingerprint": "f1aebac8-0412-45ec-92b7-1043c2cfed3b", "version": "16.2.8-59.el8cp", "commit": "4e10c5fa8a9bc0a421a4dd0833f951ab1cfdcfa7", "timestamp": "2022-06-30T22:56:19.317455+0000", "tag": "", "health": { "status": "HEALTH_WARN", "checks": { "RECENT_CRASH": { "severity": "HEALTH_WARN", "summary": { "message": "1 daemons have recently crashed", "count": 1 }, "detail": [ { "message": "mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-76647b8czhpp5 at 2022-06-30T21:51:01.775596Z" } ], "muted": false } mds pod crashed due to below reason: 2022-06-30T21:50:47.422596523Z debug 2022-06-30T21:50:47.421+0000 7f41de37f900 0 pidfile_write: ignore empty --pid-file 2022-06-30T21:50:47.423097588Z starting mds.ocs-storagecluster-cephfilesystem-a at 2022-06-30T21:50:47.428799237Z debug 2022-06-30T21:50:47.428+0000 7f41cc4ca700 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 52 from mon.1 2022-06-30T21:50:48.412591616Z debug 2022-06-30T21:50:48.412+0000 7f41cc4ca700 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 53 from mon.1 2022-06-30T21:50:48.412655414Z debug 2022-06-30T21:50:48.412+0000 7f41cc4ca700 1 mds.ocs-storagecluster-cephfilesystem-a Monitors have assigned me to become a standby. 2022-06-30T21:50:48.423681377Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700 1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 54 from mon.1 2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700 1 mds.0.0 handle_mds_map i am now mds.83389.0 replaying mds.0.0 2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700 1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay 2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700 1 mds.0.0 replay_start 2022-06-30T21:50:48.438768590Z debug 2022-06-30T21:50:48.438+0000 7f41c64be700 0 mds.0.cache creating system inode with ino:0x100 2022-06-30T21:50:48.438815024Z debug 2022-06-30T21:50:48.438+0000 7f41c64be700 0 mds.0.cache creating system inode with ino:0x1 2022-06-30T21:50:59.055822268Z debug 2022-06-30T21:50:59.055+0000 7f41ce4ce700 1 mds.ocs-storagecluster-cephfilesystem-a asok_command: status {prefix=status} (starting...) 2022-06-30T21:51:01.772802133Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f41cc4ca700 time 2022-06-30T21:51:01.772738+0000 2022-06-30T21:51:01.772802133Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: 281: FAILED ceph_assert(!mds->is_any_replay()) 2022-06-30T21:51:01.774204679Z ceph version 16.2.8-59.el8cp (4e10c5fa8a9bc0a421a4dd0833f951ab1cfdcfa7) pacific (stable) 2022-06-30T21:51:01.774204679Z 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f41d4ef7eb8] 2022-06-30T21:51:01.774204679Z 2: /usr/lib64/ceph/libceph-common.so.2(+0x2780d2) [0x7f41d4ef80d2] 2022-06-30T21:51:01.774204679Z 3: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f) [0x5565f835324f] 2022-06-30T21:51:01.774204679Z 4: (Server::journal_close_session(Session*, int, Context*)+0x78c) [0x5565f80816dc] 2022-06-30T21:51:01.774204679Z 5: (Server::kill_session(Session*, Context*)+0x212) [0x5565f8081e32] 2022-06-30T21:51:01.774204679Z 6: (Server::apply_blocklist()+0x10d) [0x5565f80820ed] 2022-06-30T21:51:01.774204679Z 7: (MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x34) [0x5565f803ea24] 2022-06-30T21:51:01.774204679Z 8: (MDSRankDispatcher::handle_osd_map()+0xf6) [0x5565f803ed66] 2022-06-30T21:51:01.774204679Z 9: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x33b) [0x5565f8028d9b] 2022-06-30T21:51:01.774204679Z 10: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x5565f8029683] 2022-06-30T21:51:01.774204679Z 11: (DispatchQueue::entry()+0x126a) [0x7f41d513feca] 2022-06-30T21:51:01.774204679Z 12: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f41d51f26d1] 2022-06-30T21:51:01.774204679Z 13: /lib64/libpthread.so.0(+0x81cf) [0x7f41d3eda1cf] 2022-06-30T21:51:01.774204679Z 14: clone() 2022-06-30T21:51:01.774236825Z debug *** Caught signal (Aborted) ** 2022-06-30T21:51:01.774236825Z in thread 7f41cc4ca700 thread_name:ms_dispatch 2022-06-30T21:51:01.774254086Z 2022-06-30T21:51:01.772+0000 7f41cc4ca700 -1 /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f41cc4ca700 time 2022-06-30T21:51:01.772738+0000 2022-06-30T21:51:01.774254086Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: 281: FAILED ceph_assert(!mds->is_any_replay()) 2022-06-30T21:51:01.774254086Z TC - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/workloads/ocp/registry/test_registry_reboot_node.py [1]: jenkin job link - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/14170/ [2]: must-gather log link - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/mashetty-wl29/mashetty-wl29_20220630T061327/logs/failed_testcase_ocs_logs_1656572860/test_registry_rolling_reboot_node%5bmaster%5d_ocs_logs/ocs_must_gather/ [3]: Console output- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/mashetty-wl29/mashetty-wl29_20220630T061327/logs/ This testcase does the node reboot one by one in rolling manner. Version of all relevant components (if applicable): ODF-4.11 OCP-4.11 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Run the workload suite 2.when run tests.e2e.workloads.ocp.registry.test_registry_reboot_node.TestRegistryRebootNode test it failed on teardown since mds-a crashed Actual results: rook-ceph-mds-ocs-storagecluster-cephfilesystem-a crashing Expected results: rook-ceph-mds-ocs-storagecluster-cephfilesystem-a pod should not crash. Additional info:
To help understand the impact: - Is this a new test suite? - How often is the crash? Is it rare, frequent, or consistent? - After the mds crashes, does the test continue successfully after the mds comes back up? - This shouldn't affect the functionality since this mds is the standy. From the log? "Monitors have assigned me to become a standby." Scott, could someone from cephfs take a look at the mds crash?
@tnielsen Please find ans below- To help understand the impact: - Is this a new test suite? --> No - How often is the crash? Is it rare, frequent, or consistent? Rare - After the mds crashes, does the test continue successfully after the mds comes back up? Not clear about it.Because cluter get destroyed. - This shouldn't affect the functionality since this mds is the standy. From the log? --> Can you elaborate it?
Hi tnielsen Is there any update on this?
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2105881
Which version of RHCS did it make it to? so we can track if it's fixed in ODF.
(In reply to Travis Nielsen from comment #6) > Which version of RHCS did it make it to? so we can track if it's fixed in > ODF. RHCS 5.2
Thanks, we should be able to move this to ON_QA since the latest 5.2 should be there for the 4.11 build.
This is already in RHCS 5.2, so we just need acks for 4.11