Bug 1869411

Summary:	capture full crash information from ceph
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Josh Durgin <jdurgin>
Component:	must-gather	Assignee:	Pulkit Kundra <pkundra>
Status:	CLOSED ERRATA	QA Contact:	Svetlana Avetisyan <savetisy>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.5	CC:	assingh, bhubbard, ebenahar, edonnell, kramdoss, muagarwa, ocs-bugs, pkundra, sabose, savetisy, shan
Target Milestone:	---	Keywords:	Automation
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.6.0-137.ci	Doc Type:	Enhancement
Doc Text:	.Red Hat Ceph Storage crash collection The Ceph crash collection feature has been added to OpenShift Container Storage 4.6. This feature collects backtrace and core dump for properly debugging a Ceph crash. It collects the core dump from every node from the `/var/lib/rook/<namespace>/crash/` folder, and provides outputs of following Ceph commands: * `ceph crash ls` * `ceph crash info <id>`	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:23:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Josh Durgin 2020-08-17 21:59:45 UTC

Ceph captures the recent log and coredump with each crash, in the /var/lbi/ceph/crash directory.

AIUI within OCS, these go to the crash collector pod.

must-gather should scrape these extra log and core files, so we can debug crashes.

Often times the backtrace alone is not enough to determine the cause, and the log and or coredump are needed as well.

Comment 2 Mudit Agarwal 2020-08-18 12:47:01 UTC

Doesn't look like a 4.5 candidate to me, moving it to 4.6. Please retarget if required.

Comment 5 Sébastien Han 2020-09-18 07:39:11 UTC

Josh,

The crashes are already collected by the ceph-crash script, it runs as a deployment on every host where ceph daemons are running.
So whenever a daemon crashes the ceph-crash picks the dump and stores it in the mgr.

So I suppose that if must-gather runs the "ceph crash info" for each "ceph crash ls" that should be enough, right?

Comment 6 Josh Durgin 2020-09-18 13:30:47 UTC

(In reply to leseb from comment #5)
> Josh,
> 
> The crashes are already collected by the ceph-crash script, it runs as a
> deployment on every host where ceph daemons are running.
> So whenever a daemon crashes the ceph-crash picks the dump and stores it in
> the mgr.
> 
> So I suppose that if must-gather runs the "ceph crash info" for each "ceph
> crash ls" that should be enough, right?

That gets the backtrace, but not the log or core dump. ceph-crash stores these in /var/lib/ceph/crash on baremetal. Where does this go with rook?

We need a way to collect these logs and coredumps from customer and QE environments. Many issues are impossible to debug (or identify) without them.

Comment 7 Sébastien Han 2020-09-18 13:35:10 UTC

Hum I thought we had everything in the mgr, why don't we put log/core dumps too? too big?

To answer your question, with Rook, this goes on the host filesystem under /var/lib/rook/<namespace>/crash/

Comment 8 Josh Durgin 2020-09-18 13:59:34 UTC

Yes, coredumps can be multiple GBs, too big for the mgr to store (all the mgr state is stored in the monitor).

Comment 9 Sébastien Han 2020-09-18 14:23:11 UTC

Understood, in that case, the coredumps are available on all the hosts under /var/lib/rook/<namespace>/crash/

Is it enough?

Comment 10 Josh Durgin 2020-09-18 14:53:47 UTC

Grabbing everything from /var/lib/rook/<namespace>/crash/ would work. Pulkit, does that sound good to you?

Comment 12 Sébastien Han 2020-09-21 07:27:58 UTC

Just collect everything from "/var/lib/rook/<namespace>/crash/", it's simple.

Comment 13 krishnaram Karthick 2020-10-07 09:59:08 UTC

Proposing as a blocker. 
We are seeing crashes (1885136, 1869372) occasionally  and they are not being fixed due to lack of proper logs. Having a fix for this is key in understanding and fixing such issues. If such a crash is hit with a customer, we don't have a chance to ask to reproduce.

Comment 17 Mudit Agarwal 2020-10-19 12:40:50 UTC

Backport PR: https://github.com/openshift/ocs-operator/pull/830

Comment 18 Svetlana Avetisyan 2020-10-27 05:50:44 UTC

@Josh Durgin can you please  leave a reproducible steps in the comment section, I am trying to check if bug is solved or not

Comment 19 Josh Durgin 2020-10-28 00:39:10 UTC

(In reply to Svetlana Avetisyan from comment #18)
> @Josh Durgin can you please  leave a reproducible steps in the comment
> section, I am trying to check if bug is solved or not

You can crash a ceph daemon by sending it SIGABRT, e.g. kill -6 on the process.
You should see /host/var/lib/rook/openshift-storage/crash/ on the ceph-crash-collector pod populated with the coredump, log, and crash metadata, and must-gather should be saving all of these files.

Comment 21 Svetlana Avetisyan 2020-11-11 02:33:59 UTC

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/savetisy/musgather.tar.gz

You can find must gather here. 
after executing kill -6 on ceph deamon and collecting logs we can find evidence about ceph crash

Comment 23 Mudit Agarwal 2020-12-01 13:31:15 UTC

Pulkit, please add the doc text

Comment 26 errata-xmlrpc 2020-12-17 06:23:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605