Bug 1869411 - capture full crash information from ceph
Summary: capture full crash information from ceph
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: must-gather
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Pulkit Kundra
QA Contact: Svetlana Avetisyan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-17 21:59 UTC by Josh Durgin
Modified: 2020-12-17 06:24 UTC (History)
11 users (show)

Fixed In Version: 4.6.0-137.ci
Doc Type: Enhancement
Doc Text:
.Red Hat Ceph Storage crash collection The Ceph crash collection feature has been added to OpenShift Container Storage 4.6. This feature collects backtrace and core dump for properly debugging a Ceph crash. It collects the core dump from every node from the `/var/lib/rook/<namespace>/crash/` folder, and provides outputs of following Ceph commands: * `ceph crash ls` * `ceph crash info <id>`
Clone Of:
Environment:
Last Closed: 2020-12-17 06:23:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 818 0 None closed must-gather: add more info for ceph crash 2021-01-12 15:03:25 UTC
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:24:01 UTC

Description Josh Durgin 2020-08-17 21:59:45 UTC
Ceph captures the recent log and coredump with each crash, in the /var/lbi/ceph/crash directory.

AIUI within OCS, these go to the crash collector pod.

must-gather should scrape these extra log and core files, so we can debug crashes.

Often times the backtrace alone is not enough to determine the cause, and the log and or coredump are needed as well.

Comment 2 Mudit Agarwal 2020-08-18 12:47:01 UTC
Doesn't look like a 4.5 candidate to me, moving it to 4.6. Please retarget if required.

Comment 5 Sébastien Han 2020-09-18 07:39:11 UTC
Josh,

The crashes are already collected by the ceph-crash script, it runs as a deployment on every host where ceph daemons are running.
So whenever a daemon crashes the ceph-crash picks the dump and stores it in the mgr.

So I suppose that if must-gather runs the "ceph crash info" for each "ceph crash ls" that should be enough, right?

Comment 6 Josh Durgin 2020-09-18 13:30:47 UTC
(In reply to leseb from comment #5)
> Josh,
> 
> The crashes are already collected by the ceph-crash script, it runs as a
> deployment on every host where ceph daemons are running.
> So whenever a daemon crashes the ceph-crash picks the dump and stores it in
> the mgr.
> 
> So I suppose that if must-gather runs the "ceph crash info" for each "ceph
> crash ls" that should be enough, right?

That gets the backtrace, but not the log or core dump. ceph-crash stores these in /var/lib/ceph/crash on baremetal. Where does this go with rook?

We need a way to collect these logs and coredumps from customer and QE environments. Many issues are impossible to debug (or identify) without them.

Comment 7 Sébastien Han 2020-09-18 13:35:10 UTC
Hum I thought we had everything in the mgr, why don't we put log/core dumps too? too big?

To answer your question, with Rook, this goes on the host filesystem under /var/lib/rook/<namespace>/crash/

Comment 8 Josh Durgin 2020-09-18 13:59:34 UTC
Yes, coredumps can be multiple GBs, too big for the mgr to store (all the mgr state is stored in the monitor).

Comment 9 Sébastien Han 2020-09-18 14:23:11 UTC
Understood, in that case, the coredumps are available on all the hosts under /var/lib/rook/<namespace>/crash/

Is it enough?

Comment 10 Josh Durgin 2020-09-18 14:53:47 UTC
Grabbing everything from /var/lib/rook/<namespace>/crash/ would work. Pulkit, does that sound good to you?

Comment 12 Sébastien Han 2020-09-21 07:27:58 UTC
Just collect everything from "/var/lib/rook/<namespace>/crash/", it's simple.

Comment 13 krishnaram Karthick 2020-10-07 09:59:08 UTC
Proposing as a blocker. 
We are seeing crashes (1885136, 1869372) occasionally  and they are not being fixed due to lack of proper logs. Having a fix for this is key in understanding and fixing such issues. If such a crash is hit with a customer, we don't have a chance to ask to reproduce.

Comment 17 Mudit Agarwal 2020-10-19 12:40:50 UTC
Backport PR: https://github.com/openshift/ocs-operator/pull/830

Comment 18 Svetlana Avetisyan 2020-10-27 05:50:44 UTC
@Josh Durgin can you please  leave a reproducible steps in the comment section, I am trying to check if bug is solved or not

Comment 19 Josh Durgin 2020-10-28 00:39:10 UTC
(In reply to Svetlana Avetisyan from comment #18)
> @Josh Durgin can you please  leave a reproducible steps in the comment
> section, I am trying to check if bug is solved or not

You can crash a ceph daemon by sending it SIGABRT, e.g. kill -6 on the process.
You should see /host/var/lib/rook/openshift-storage/crash/ on the ceph-crash-collector pod populated with the coredump, log, and crash metadata, and must-gather should be saving all of these files.

Comment 21 Svetlana Avetisyan 2020-11-11 02:33:59 UTC
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/savetisy/musgather.tar.gz

You can find must gather here. 
after executing kill -6 on ceph deamon and collecting logs we can find evidence about ceph crash

Comment 23 Mudit Agarwal 2020-12-01 13:31:15 UTC
Pulkit, please add the doc text

Comment 26 errata-xmlrpc 2020-12-17 06:23:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.