1869411 – capture full crash information from ceph

Bug 1869411 - capture full crash information from ceph

Summary: capture full crash information from ceph

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	must-gather
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Pulkit Kundra
QA Contact:	Svetlana Avetisyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-17 21:59 UTC by Josh Durgin
Modified:	2020-12-17 06:24 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.6.0-137.ci
Doc Type:	Enhancement
Doc Text:	.Red Hat Ceph Storage crash collection The Ceph crash collection feature has been added to OpenShift Container Storage 4.6. This feature collects backtrace and core dump for properly debugging a Ceph crash. It collects the core dump from every node from the `/var/lib/rook/<namespace>/crash/` folder, and provides outputs of following Ceph commands: * `ceph crash ls` * `ceph crash info <id>`
Clone Of:
Environment:
Last Closed:	2020-12-17 06:23:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 818	0	None	closed	must-gather: add more info for ceph crash	2021-01-12 15:03:25 UTC
Red Hat Product Errata	RHSA-2020:5605	0	None	None	None	2020-12-17 06:24:01 UTC

Description Josh Durgin 2020-08-17 21:59:45 UTC

Ceph captures the recent log and coredump with each crash, in the /var/lbi/ceph/crash directory.

AIUI within OCS, these go to the crash collector pod.

must-gather should scrape these extra log and core files, so we can debug crashes.

Often times the backtrace alone is not enough to determine the cause, and the log and or coredump are needed as well.

Comment 2 Mudit Agarwal 2020-08-18 12:47:01 UTC

Doesn't look like a 4.5 candidate to me, moving it to 4.6. Please retarget if required.

Comment 5 Sébastien Han 2020-09-18 07:39:11 UTC

Josh,

The crashes are already collected by the ceph-crash script, it runs as a deployment on every host where ceph daemons are running.
So whenever a daemon crashes the ceph-crash picks the dump and stores it in the mgr.

So I suppose that if must-gather runs the "ceph crash info" for each "ceph crash ls" that should be enough, right?

Comment 6 Josh Durgin 2020-09-18 13:30:47 UTC

(In reply to leseb from comment #5)
> Josh,
> 
> The crashes are already collected by the ceph-crash script, it runs as a
> deployment on every host where ceph daemons are running.
> So whenever a daemon crashes the ceph-crash picks the dump and stores it in
> the mgr.
> 
> So I suppose that if must-gather runs the "ceph crash info" for each "ceph
> crash ls" that should be enough, right?

That gets the backtrace, but not the log or core dump. ceph-crash stores these in /var/lib/ceph/crash on baremetal. Where does this go with rook?

We need a way to collect these logs and coredumps from customer and QE environments. Many issues are impossible to debug (or identify) without them.

Comment 7 Sébastien Han 2020-09-18 13:35:10 UTC

Hum I thought we had everything in the mgr, why don't we put log/core dumps too? too big?

To answer your question, with Rook, this goes on the host filesystem under /var/lib/rook/<namespace>/crash/

Comment 8 Josh Durgin 2020-09-18 13:59:34 UTC

Yes, coredumps can be multiple GBs, too big for the mgr to store (all the mgr state is stored in the monitor).

Comment 9 Sébastien Han 2020-09-18 14:23:11 UTC

Understood, in that case, the coredumps are available on all the hosts under /var/lib/rook/<namespace>/crash/

Is it enough?

Comment 10 Josh Durgin 2020-09-18 14:53:47 UTC

Grabbing everything from /var/lib/rook/<namespace>/crash/ would work. Pulkit, does that sound good to you?

Comment 12 Sébastien Han 2020-09-21 07:27:58 UTC

Just collect everything from "/var/lib/rook/<namespace>/crash/", it's simple.

Comment 13 krishnaram Karthick 2020-10-07 09:59:08 UTC

Proposing as a blocker. 
We are seeing crashes (1885136, 1869372) occasionally  and they are not being fixed due to lack of proper logs. Having a fix for this is key in understanding and fixing such issues. If such a crash is hit with a customer, we don't have a chance to ask to reproduce.

Comment 17 Mudit Agarwal 2020-10-19 12:40:50 UTC

Backport PR: https://github.com/openshift/ocs-operator/pull/830

Comment 18 Svetlana Avetisyan 2020-10-27 05:50:44 UTC

@Josh Durgin can you please  leave a reproducible steps in the comment section, I am trying to check if bug is solved or not

Comment 19 Josh Durgin 2020-10-28 00:39:10 UTC

(In reply to Svetlana Avetisyan from comment #18)
> @Josh Durgin can you please  leave a reproducible steps in the comment
> section, I am trying to check if bug is solved or not

You can crash a ceph daemon by sending it SIGABRT, e.g. kill -6 on the process.
You should see /host/var/lib/rook/openshift-storage/crash/ on the ceph-crash-collector pod populated with the coredump, log, and crash metadata, and must-gather should be saving all of these files.

Comment 21 Svetlana Avetisyan 2020-11-11 02:33:59 UTC

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/savetisy/musgather.tar.gz

You can find must gather here. 
after executing kill -6 on ceph deamon and collecting logs we can find evidence about ceph crash

Comment 23 Mudit Agarwal 2020-12-01 13:31:15 UTC

Pulkit, please add the doc text

Comment 26 errata-xmlrpc 2020-12-17 06:23:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.