1903011 – tests/e2e/performance test crashes Ceph daemon

Bug 1903011 - tests/e2e/performance test crashes Ceph daemon

Summary: tests/e2e/performance test crashes Ceph daemon

Keywords:
Status:	CLOSED DUPLICATE of bug 1896040
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Warren
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-01 06:08 UTC by Warren
Modified:	2020-12-06 21:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-04 23:37:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
text of run-ci output (418.10 KB, text/plain) 2020-12-01 06:08 UTC, Warren	no flags	Details
pytest file (18.49 KB, text/plain) 2020-12-01 06:10 UTC, Warren	no flags	Details
pytest file (18.49 KB, text/plain) 2020-12-01 06:10 UTC, Warren	no flags	Details
View All

Description Warren 2020-12-01 06:08:37 UTC

Created attachment 1735080 [details]
text of run-ci output

A version of tests/e2e/performance/test_small_file_workload.py produces the following message:

WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed



Attached are:
     must-gather tar.gz file.
     shell output
     modified testcase that reproduces this (python file) warts and all.

The cluster-id is 
     9f44fb85-bc4c-4e7c-b0f4-2c22f2177453

The command that I ran was:
    
run-ci -m performance --cluster-name wusui-scale-dc13 --cluster-path /home/wusui/ocs-ci tests/e2e/performance/test_small_file_workload.py

This hampers the ability to develop further workload tests.
No known workaround.  I figure that user code, should not cause a Ceph daemon crash.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?  4-ish


I have reproduced this multiple times.



Steps to Reproduce:
     Using the py file specified, run the above run-ci command


Actual results:
     pytest fails with Time out waiting for benchmark to complete.  Many Ceph
     unhealthy log.info lines appeared

Expected results:
     pytest pass


Additional info:

Comment 2 Warren 2020-12-01 06:10:00 UTC

Created attachment 1735081 [details]
pytest file

Comment 3 Warren 2020-12-01 06:10:23 UTC

Created attachment 1735082 [details]
pytest file

Comment 15 Warren 2020-12-02 06:39:35 UTC

A copy of the must-gather information is also available in:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/test_ceph_pod_respin_in_scaled_cluster_dir/must-gather.tar.gz

Comment 16 Brett Niver 2020-12-02 14:10:43 UTC

Warren, I assume it's the MDS that's crashing, can you specify which daemon is crashing?

Comment 18 Warren 2020-12-04 22:46:21 UTC

I have collected all the ceph logs that I found on this cluster and copied them to:

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/ceph-crash-dumps-wusui.tar

@pdonnell

Comment 19 Josh Durgin 2020-12-04 23:36:14 UTC

This is the mon crashing, backtrace is:

        "(()+0x12dd0) [0x7f08a488cdd0]",
        "(PerfCounters::inc(int, unsigned long)+0x7) [0x7f08a7d995c7]",
        "(Paxos::restart()+0x2b9) [0x558b716ffc39]",
        "(Monitor::_reset()+0x268) [0x558b71602e68]",
        "(Monitor::join_election()+0xdb) [0x558b71602ffb]",
        "(Elector::bump_epoch(unsigned int)+0x158) [0x558b7169c018]",
        "(Elector::victory()+0x2ba) [0x558b7169c48a]",
        "(Elector::handle_ack(boost::intrusive_ptr<MonOpRequest>)+0x524) [0x558b7169e114]",
        "(Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xbb7) [0x558b7169f3b7]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x11c6) [0x558b7161e266]",
        "(Monitor::_ms_dispatch(Message*)+0xa23) [0x558b7161efe3]",
        "(Monitor::ms_dispatch(Message*)+0x2a) [0x558b7165078a]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2a) [0x558b7164cf2a]",
        "(DispatchQueue::entry()+0x134a) [0x7f08a7dbee3a]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f08a7e73821]",
        "(()+0x82de) [0x7f08a48822de]",
        "(clone()+0x43) [0x7f08a35b9e83]"

Comment 20 Neha Ojha 2020-12-04 23:37:34 UTC


*** This bug has been marked as a duplicate of bug 1896040 ***

Comment 21 Patrick Donnelly 2020-12-06 21:46:12 UTC

Removing my NI on closed bug.

Note You need to log in before you can comment on or make changes to this bug.