Bug 1869411
Summary: | capture full crash information from ceph | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Josh Durgin <jdurgin> |
Component: | must-gather | Assignee: | Pulkit Kundra <pkundra> |
Status: | CLOSED ERRATA | QA Contact: | Svetlana Avetisyan <savetisy> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | assingh, bhubbard, ebenahar, edonnell, kramdoss, muagarwa, ocs-bugs, pkundra, sabose, savetisy, shan |
Target Milestone: | --- | Keywords: | Automation |
Target Release: | OCS 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.6.0-137.ci | Doc Type: | Enhancement |
Doc Text: |
.Red Hat Ceph Storage crash collection
The Ceph crash collection feature has been added to OpenShift Container Storage 4.6. This feature collects backtrace and core dump for properly debugging a Ceph crash. It collects the core dump from every node from the `/var/lib/rook/<namespace>/crash/` folder, and provides outputs of following Ceph commands:
* `ceph crash ls`
* `ceph crash info <id>`
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-17 06:23:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Josh Durgin
2020-08-17 21:59:45 UTC
Doesn't look like a 4.5 candidate to me, moving it to 4.6. Please retarget if required. Josh, The crashes are already collected by the ceph-crash script, it runs as a deployment on every host where ceph daemons are running. So whenever a daemon crashes the ceph-crash picks the dump and stores it in the mgr. So I suppose that if must-gather runs the "ceph crash info" for each "ceph crash ls" that should be enough, right? (In reply to leseb from comment #5) > Josh, > > The crashes are already collected by the ceph-crash script, it runs as a > deployment on every host where ceph daemons are running. > So whenever a daemon crashes the ceph-crash picks the dump and stores it in > the mgr. > > So I suppose that if must-gather runs the "ceph crash info" for each "ceph > crash ls" that should be enough, right? That gets the backtrace, but not the log or core dump. ceph-crash stores these in /var/lib/ceph/crash on baremetal. Where does this go with rook? We need a way to collect these logs and coredumps from customer and QE environments. Many issues are impossible to debug (or identify) without them. Hum I thought we had everything in the mgr, why don't we put log/core dumps too? too big? To answer your question, with Rook, this goes on the host filesystem under /var/lib/rook/<namespace>/crash/ Yes, coredumps can be multiple GBs, too big for the mgr to store (all the mgr state is stored in the monitor). Understood, in that case, the coredumps are available on all the hosts under /var/lib/rook/<namespace>/crash/ Is it enough? Grabbing everything from /var/lib/rook/<namespace>/crash/ would work. Pulkit, does that sound good to you? Just collect everything from "/var/lib/rook/<namespace>/crash/", it's simple. Proposing as a blocker. We are seeing crashes (1885136, 1869372) occasionally and they are not being fixed due to lack of proper logs. Having a fix for this is key in understanding and fixing such issues. If such a crash is hit with a customer, we don't have a chance to ask to reproduce. Backport PR: https://github.com/openshift/ocs-operator/pull/830 @Josh Durgin can you please leave a reproducible steps in the comment section, I am trying to check if bug is solved or not (In reply to Svetlana Avetisyan from comment #18) > @Josh Durgin can you please leave a reproducible steps in the comment > section, I am trying to check if bug is solved or not You can crash a ceph daemon by sending it SIGABRT, e.g. kill -6 on the process. You should see /host/var/lib/rook/openshift-storage/crash/ on the ceph-crash-collector pod populated with the coredump, log, and crash metadata, and must-gather should be saving all of these files. http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/savetisy/musgather.tar.gz You can find must gather here. after executing kill -6 on ceph deamon and collecting logs we can find evidence about ceph crash Pulkit, please add the doc text Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |