Ceph captures the recent log and coredump with each crash, in the /var/lbi/ceph/crash directory. AIUI within OCS, these go to the crash collector pod. must-gather should scrape these extra log and core files, so we can debug crashes. Often times the backtrace alone is not enough to determine the cause, and the log and or coredump are needed as well.
Doesn't look like a 4.5 candidate to me, moving it to 4.6. Please retarget if required.
Josh, The crashes are already collected by the ceph-crash script, it runs as a deployment on every host where ceph daemons are running. So whenever a daemon crashes the ceph-crash picks the dump and stores it in the mgr. So I suppose that if must-gather runs the "ceph crash info" for each "ceph crash ls" that should be enough, right?
(In reply to leseb from comment #5) > Josh, > > The crashes are already collected by the ceph-crash script, it runs as a > deployment on every host where ceph daemons are running. > So whenever a daemon crashes the ceph-crash picks the dump and stores it in > the mgr. > > So I suppose that if must-gather runs the "ceph crash info" for each "ceph > crash ls" that should be enough, right? That gets the backtrace, but not the log or core dump. ceph-crash stores these in /var/lib/ceph/crash on baremetal. Where does this go with rook? We need a way to collect these logs and coredumps from customer and QE environments. Many issues are impossible to debug (or identify) without them.
Hum I thought we had everything in the mgr, why don't we put log/core dumps too? too big? To answer your question, with Rook, this goes on the host filesystem under /var/lib/rook/<namespace>/crash/
Yes, coredumps can be multiple GBs, too big for the mgr to store (all the mgr state is stored in the monitor).
Understood, in that case, the coredumps are available on all the hosts under /var/lib/rook/<namespace>/crash/ Is it enough?
Grabbing everything from /var/lib/rook/<namespace>/crash/ would work. Pulkit, does that sound good to you?
Just collect everything from "/var/lib/rook/<namespace>/crash/", it's simple.
Proposing as a blocker. We are seeing crashes (1885136, 1869372) occasionally and they are not being fixed due to lack of proper logs. Having a fix for this is key in understanding and fixing such issues. If such a crash is hit with a customer, we don't have a chance to ask to reproduce.
Backport PR: https://github.com/openshift/ocs-operator/pull/830
@Josh Durgin can you please leave a reproducible steps in the comment section, I am trying to check if bug is solved or not
(In reply to Svetlana Avetisyan from comment #18) > @Josh Durgin can you please leave a reproducible steps in the comment > section, I am trying to check if bug is solved or not You can crash a ceph daemon by sending it SIGABRT, e.g. kill -6 on the process. You should see /host/var/lib/rook/openshift-storage/crash/ on the ceph-crash-collector pod populated with the coredump, log, and crash metadata, and must-gather should be saving all of these files.
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/savetisy/musgather.tar.gz You can find must gather here. after executing kill -6 on ceph deamon and collecting logs we can find evidence about ceph crash
Pulkit, please add the doc text
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605