Created attachment 1735080 [details] text of run-ci output A version of tests/e2e/performance/test_small_file_workload.py produces the following message: WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed Attached are: must-gather tar.gz file. shell output modified testcase that reproduces this (python file) warts and all. The cluster-id is 9f44fb85-bc4c-4e7c-b0f4-2c22f2177453 The command that I ran was: run-ci -m performance --cluster-name wusui-scale-dc13 --cluster-path /home/wusui/ocs-ci tests/e2e/performance/test_small_file_workload.py This hampers the ability to develop further workload tests. No known workaround. I figure that user code, should not cause a Ceph daemon crash. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4-ish I have reproduced this multiple times. Steps to Reproduce: Using the py file specified, run the above run-ci command Actual results: pytest fails with Time out waiting for benchmark to complete. Many Ceph unhealthy log.info lines appeared Expected results: pytest pass Additional info:
Created attachment 1735081 [details] pytest file
Created attachment 1735082 [details] pytest file
A copy of the must-gather information is also available in: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/test_ceph_pod_respin_in_scaled_cluster_dir/must-gather.tar.gz
Warren, I assume it's the MDS that's crashing, can you specify which daemon is crashing?
I have collected all the ceph logs that I found on this cluster and copied them to: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/ceph-crash-dumps-wusui.tar @pdonnell
This is the mon crashing, backtrace is: "(()+0x12dd0) [0x7f08a488cdd0]", "(PerfCounters::inc(int, unsigned long)+0x7) [0x7f08a7d995c7]", "(Paxos::restart()+0x2b9) [0x558b716ffc39]", "(Monitor::_reset()+0x268) [0x558b71602e68]", "(Monitor::join_election()+0xdb) [0x558b71602ffb]", "(Elector::bump_epoch(unsigned int)+0x158) [0x558b7169c018]", "(Elector::victory()+0x2ba) [0x558b7169c48a]", "(Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x524) [0x558b7169e114]", "(Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbb7) [0x558b7169f3b7]", "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x11c6) [0x558b7161e266]", "(Monitor::_ms_dispatch(Message*)+0xa23) [0x558b7161efe3]", "(Monitor::ms_dispatch(Message*)+0x2a) [0x558b7165078a]", "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x558b7164cf2a]", "(DispatchQueue::entry()+0x134a) [0x7f08a7dbee3a]", "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f08a7e73821]", "(()+0x82de) [0x7f08a48822de]", "(clone()+0x43) [0x7f08a35b9e83]"
*** This bug has been marked as a duplicate of bug 1896040 ***
Removing my NI on closed bug.