Bug 2098118
| Summary: | [Tracker for Ceph BZ #2099463] Ceph OSD crashed while running FIO | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Anant Malhotra <anamalho> | |
| Component: | ceph | Assignee: | Prashant Dhange <pdhange> | |
| ceph sub component: | RADOS | QA Contact: | avdhoot <asagare> | |
| Status: | CLOSED WONTFIX | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | amagrawa, asagare, bniver, hnallurv, kmanohar, kramdoss, muagarwa, nojha, odf-bz-bot, pdhange, pdhiran, pnataraj, prsurve, tdesala | |
| Version: | 4.10 | Keywords: | Reopened, Tracking | |
| Target Milestone: | --- | Flags: | pdhange:
needinfo-
|
|
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2099463 (view as bug list) | Environment: | ||
| Last Closed: | 2023-08-14 07:59:54 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2176845, 2185784 | |||
| Bug Blocks: | 2099463 | |||
|
Description
Anant Malhotra
2022-06-17 10:21:09 UTC
Below are the coredump related outputs for every ODF node: ``` `[anamalho@anamalho ~]$ oc debug node/compute-3 Starting pod/compute-3-debug ... To use host binaries, run chroot /host Pod IP: 10.1.161.25 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /var/lib/systemd/coredump/ total 0 sh-4.4# coredumpctl list TIME PID UID GID SIG COREFILE EXE Tue 2022-06-07 18:36:55 UTC 7522 167 167 6 missing /usr/bin/ceph-osd sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... [anamalho@anamalho ~]$ oc debug node/compute-4 Starting pod/compute-4-debug ... To use host binaries, run chroot /host Pod IP: 10.1.160.163 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /var/lib/systemd/coredump/ total 0 sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... [anamalho@anamalho ~]$ oc debug node/compute-5 Starting pod/compute-5-debug ... To use host binaries, run chroot /host Pod IP: 10.1.161.45 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /var/lib/systemd/coredump/ total 0 sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... ``` It looks like the core dumps are missing or not collected. Okay. Can you try to generate coredump for OSD daemon using below steps and see if coredump is getting generated ?
1. rsh to one of the osd pod
# oc rsh <osd-pod>
2. get ceph-osd pid and send SIGSEGV signal to ceph-osd daemon
# ps -aef|grep ceph-osd
# kill -11 <pidof-ceph-osd>
3. login to ODF node hosting osd for which we generated coredump by sending SIGSEGV and verify that we have coredump for osd daemon
# oc debug node/<odf-node>
sh-4.4# chroot /host
sh-4.4# ls -ltr /var/lib/systemd/coredump/
If there is no coredump for osd generated in /var/lib/systemd/coredump/ directory then open BZ for it. Also verify that OSD pod is sharing host namespace ("oc get pod <osd> -o yml" should have "hostPID: true" under "containers" section).
The coredump is getting generated using the above steps. ``` [anamalho@anamalho ~]$ oc rsh rook-ceph-osd-0-7bbd558ccd-t56jx Defaulted container "osd" out of: osd, log-collector, blkdevmapper (init), encryption-open (init), blkdevmapper-encryption (init), encrypted-block-status (init), expand-encrypted-bluefs (init), activate (init), expand-bluefs (init), chown-container-data-dir (init) sh-4.4# ps -aef|grep ceph-osd ceph 445 0 1 Jun14 ? 02:54:37 ceph-osd --foreground --id 0 --fsid cde709b2-fec0-4331-bc51-f50a0ee11237 --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-thin-0-data-0kjltj rack=rack1 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false root 465 0 0 Jun14 pts/0 00:00:00 /bin/bash -x -e -m -c CEPH_CLIENT_ID=ceph-osd.0 PERIODICITY=24h LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph if [ -z "$PERIODICITY" ]; then .PERIODICITY=24h fi # edit the logrotate file to only rotate a specific daemon log # otherwise we will logrotate log files without reloading certain daemons # this might happen when multiple daemons run on the same machine sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE" while true; do .sleep "$PERIODICITY" .echo "starting log rotation" .logrotate --verbose --force "$LOG_ROTATE_CEPH_FILE" .echo "I am going to sleep now, see you in $PERIODICITY" done root 529859 529583 0 06:25 pts/0 00:00:00 grep ceph-osd sh-4.4# kill -11 445 sh-4.4# exit exit [anamalho@anamalho ~]$ oc debug node/compute-4 Starting pod/compute-4-debug ... To use host binaries, run `chroot /host` Pod IP: 10.1.160.163 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /var/lib/systemd/coredump/ total 690236 -rw-r-----. 1 root root 706794214 Jun 23 06:29 core.ceph-osd.167.13e9301b7d5d4c57aff8e735d9c3b64c.4040567.1655965792000000.lz4 sh-4.4# exit exit sh-4.4# exit exit ``` As for the output of `[anamalho@anamalho ~]$ oc get pod rook-ceph-osd-0-7bbd558ccd-t56jx -o yaml` -> there was no param as hostPID in the yaml output. Who's looking at the coredump? (In reply to Yaniv Kaul from comment #7) > Who's looking at the coredump? Hi Yaniv, I am investigating this issue. The recent coredump from comment#6 is generated by manual trigger. The coredump for OSD crash (reported in BZ description) during FIO run is not available. `[anamalho@anamalho ~]$ oc debug node/compute-3 Starting pod/compute-3-debug ... To use host binaries, run chroot /host Pod IP: 10.1.161.25 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -ltr /var/lib/systemd/coredump/ total 0 sh-4.4# coredumpctl list TIME PID UID GID SIG COREFILE EXE Tue 2022-06-07 18:36:55 UTC 7522 167 167 6 missing /usr/bin/ceph-osd <----- corefile is missing on ODF node I will be requesting Anant (BZ reporter) to reproduce this issue again and get us OSD coredump for further investigation. (In reply to Anant Malhotra from comment #6) > The coredump is getting generated using the above steps. > > ``` > [anamalho@anamalho ~]$ oc rsh rook-ceph-osd-0-7bbd558ccd-t56jx > Defaulted container "osd" out of: osd, log-collector, blkdevmapper (init), > encryption-open (init), blkdevmapper-encryption (init), > encrypted-block-status (init), expand-encrypted-bluefs (init), activate > (init), expand-bluefs (init), chown-container-data-dir (init) > sh-4.4# ps -aef|grep ceph-osd > ceph 445 0 1 Jun14 ? 02:54:37 ceph-osd --foreground > --id 0 --fsid cde709b2-fec0-4331-bc51-f50a0ee11237 --setuser ceph --setgroup > ceph --crush-location=root=default host=ocs-deviceset-thin-0-data-0kjltj > rack=rack1 --log-to-stderr=true --err-to-stderr=true > --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug > --default-log-to-file=false --default-mon-cluster-log-to-file=false > --ms-learn-addr-from-peer=false > root 465 0 0 Jun14 pts/0 00:00:00 /bin/bash -x -e -m -c > CEPH_CLIENT_ID=ceph-osd.0 PERIODICITY=24h > LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph if [ -z "$PERIODICITY" ]; then > .PERIODICITY=24h fi # edit the logrotate file to only rotate a specific > daemon log # otherwise we will logrotate log files without reloading certain > daemons # this might happen when multiple daemons run on the same machine > sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE" while true; > do .sleep "$PERIODICITY" .echo "starting log rotation" .logrotate --verbose > --force "$LOG_ROTATE_CEPH_FILE" .echo "I am going to sleep now, see you in > $PERIODICITY" done > root 529859 529583 0 06:25 pts/0 00:00:00 grep ceph-osd > sh-4.4# kill -11 445 > sh-4.4# exit > exit > > > [anamalho@anamalho ~]$ oc debug node/compute-4 > Starting pod/compute-4-debug ... > To use host binaries, run `chroot /host` > Pod IP: 10.1.160.163 > If you don't see a command prompt, try pressing enter. > sh-4.4# chroot /host > sh-4.4# ls -ltr /var/lib/systemd/coredump/ > total 690236 > -rw-r-----. 1 root root 706794214 Jun 23 06:29 > core.ceph-osd.167.13e9301b7d5d4c57aff8e735d9c3b64c.4040567.1655965792000000. > lz4 > sh-4.4# exit > exit > sh-4.4# exit > exit > ``` Thanks Anant. It confirms that there is no issue with coredump generation for OSD daemon in the event of crash. Can you try to reproduce this issue reported in BZ description again and get us coredump if any OSD is still crashing ? Also is this issue consistently reproducible during FIO run ? > > > As for the output of `[anamalho@anamalho ~]$ oc get pod > rook-ceph-osd-0-7bbd558ccd-t56jx -o yaml` -> there was no param as hostPID > in the yaml output. Okay. I will check it on my end. Sure Harish. I tried to reproduce this bug on VM setup with same testcase used by anant. But I issue is not reproduced. Now , I am tryoing to do same scenario on BM like run the IO's for more a week and check the issue is reproducing or not. (In reply to avdhoot from comment #22) > I tried to reproduce this bug on VM setup with same testcase used by anant. > But I issue is not reproduced. > Now , I am tryoing to do same scenario on BM like run the IO's for more a > week and check the issue is reproducing or not. Thanks Avdhoot. Is it fine if we close this BZ for now ? You can re-open it later if you have successful reproducer for this issue. Hi Prashant, I have checked same scenario on BM like run the IO's but I didnot found any crash related thing on cluster. So, yes it isfine if we close this BZ for now. (In reply to avdhoot from comment #24) > Hi Prashant, > > I have checked same scenario on BM like run the IO's but I didnot found any > crash related thing on cluster. > So, yes it isfine if we close this BZ for now. Thanks Avdhoot. I am closing this BZ now. Feel free to reopen it if you encounter this issue again. https://bugzilla.redhat.com/show_bug.cgi?id=2099463 moved to 7.0 Hi Prashant,
Observed same issue in RDR Longevity cluster. Workloads have been running for past 2-3 weeks, hitting OSD crash.
ceph crash ls
ID ENTITY NEW
2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520 osd.1 *
2023-07-14T20:05:04.597812Z_42744795-b51a-4169-82f6-f29639d2e150 osd.0 *
ceph crash info 2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520
{
"backtrace": [
"/lib64/libc.so.6(+0x54df0) [0x7fa4b6eeadf0]",
"/lib64/libc.so.6(+0x9c560) [0x7fa4b6f32560]",
"pthread_mutex_lock()",
"(PG::lock(bool) const+0x2b) [0x559763bee91b]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x45d) [0x559763bc5dfd]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x5597640e7ad3]",
"ceph-osd(+0xa89074) [0x5597640e8074]",
"/lib64/libc.so.6(+0x9f802) [0x7fa4b6f35802]",
"/lib64/libc.so.6(+0x3f450) [0x7fa4b6ed5450]"
],
"ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
"crash_id": "2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520",
"entity_name": "osd.1",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "9.2 (Plow)",
"os_version_id": "9.2",
"process_name": "ceph-osd",
"stack_sig": "5c7afd3067dc17bd22ffd5987b09913e4018bf079244d12c2db1c472317a24d8",
"timestamp": "2023-07-14T20:04:57.339567Z",
"utsname_hostname": "rook-ceph-osd-1-5f946675bc-hhjwk",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-284.16.1.el9_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu May 18 19:03:13 EDT 2023"
}
Could see coredumps in must gather for the nodes
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/must-gather.local.1781396938003127686/quay-io-rhceph-dev-ocs-must-gather-sha256-9ce39944596cbc4966404fb1ceb24be21093a708b1691e78453ab1b9a7a10f7b/ceph/
Complete Must gather logs :-
c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/
c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/
hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/
Live setup is available for debugging
c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/
c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/
hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/
Setting needinfo on Prashant as per comment71 by Keerthana. Had an offline discussion with Prashant and opened new bug to track the issue. Clearing need-info on me |