Bug 1856430
| Summary: | OSD crashed with abort | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Nag Pavan Chilakam <nchilaka> |
| Component: | RADOS | Assignee: | Neha Ojha <nojha> |
| Status: | CLOSED NOTABUG | QA Contact: | Manohar Murthy <mmurthy> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1 | CC: | akupczyk, bhubbard, cbodley, ceph-eng-bugs, dzafman, jdurgin, kbader, kchai, mbenjamin, nojha, rzarzyns, sseshasa, sweil |
| Target Milestone: | z2 | ||
| Target Release: | 4.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-14 21:25:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
[root@rhsqa13 ceph]# ceph health
HEALTH_ERR 1 full osd(s); 2 nearfull osd(s); 5 pool(s) full; 2 scrub errors; Low space hindering backfill (add storage if this doesn't resolve itself): 84 pgs backfill_toofull; Possible data damage: 2 pgs inconsistent; Degraded data redundancy: 548665/2509545 objects degraded (21.863%), 114 pgs degraded, 107 pgs undersized; Full OSDs blocking recovery: 30 pgs recovery_toofull; 1 daemons have recently crashed; too many PGs per OSD (256 > max 250)
[root@rhsqa13 ceph]# ceph -s
cep cluster:
id: 178e9007-0be2-4aed-9a97-c7400ed595f7
health: HEALTH_ERR
1 full osd(s)
2 nearfull osd(s)
5 pool(s) full
2 scrub errors
Low space hindering backfill (add storage if this doesn't resolve itself): 84 pgs backfill_toofull
Possible data damage: 2 pgs inconsistent
Degraded data redundancy: 548665/2509545 objects degraded (21.863%), 114 pgs degraded, 107 pgs undersized
Full OSDs blocking recovery: 30 pgs recovery_toofull
1 daemons have recently crashed
too many PGs per OSD (256 > max 250)
services:
mon: 3 daemons, quorum rhsqa13,rhsqa14,host2 (age 42h)
mgr: rhsqa14(active, since 3d), standbys: rhsqa13, host2
osd: 4 osds: 3 up (since 42h), 3 in (since 42h); 107 remapped pgs
rgw: 2 daemons active (constantine.rgw0, rhs-client44.rgw0)
task status:
data:
pools: 5 pools, 256 pgs
objects: 836.51k objects, 3.2 TiB
usage: 7.5 TiB used, 723 GiB / 8.2 TiB avail
pgs: 548665/2509545 objects degraded (21.863%)
142 active+clean
82 active+undersized+degraded+remapped+backfill_toofull
23 active+recovery_toofull+undersized+degraded+remapped
7 active+recovery_toofull+degraded
2 active+undersized+degraded+remapped+inconsistent+backfill_toofull
[root@rhsqa13 ceph]# ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 8.2 TiB 723 GiB 7.5 TiB 7.5 TiB 91.38
TOTAL 8.2 TiB 723 GiB 7.5 TiB 7.5 TiB 91.38
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
userpool1 1 4.3 TiB 836.29k 10 TiB 100.00 0 B
.rgw.root 2 3.1 KiB 6 1.1 MiB 100.00 0 B
default.rgw.control 3 0 B 8 0 B 0 0 B
default.rgw.meta 4 0 B 0 0 B 0 0 B
default.rgw.log 5 4.6 KiB 206 6 MiB 100.00 0 B
[root@rhsqa13 ceph]#
Note that the IOs went on for 3 days, even though the OSD crashed after about a day or so. I stopped the IOs only now, and above command o/p was taken now. [root@rhs-client44 ceph-ansible]# cat hosts [mons] rhsqa13.lab.eng.blr.redhat.com rhsqa14.lab.eng.blr.redhat.com host2.lab.eng.blr.redhat.com [osds] rhsqa13.lab.eng.blr.redhat.com rhsqa14.lab.eng.blr.redhat.com host2.lab.eng.blr.redhat.com constantine.lab.eng.blr.redhat.com [mgrs] rhsqa13.lab.eng.blr.redhat.com rhsqa14.lab.eng.blr.redhat.com host2.lab.eng.blr.redhat.com [grafana-server] rhs-client44.lab.eng.blr.redhat.com [rgws] #constantine.lab.eng.blr.redhat.com constantine.lab.eng.blr.redhat.com radosgw_interface="enp5s0f0" rhs-client44.lab.eng.blr.redhat.com radosgw_interface="enp7s0f0" ceph-logs, /var/lib/ceph copy, sosreports available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/ceph/bug.1856430/ [root@rhs-client44 ~]# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .rgw.root 1.1 MiB 6 0 18 0 0 4 9 9 KiB 6 6 KiB 0 B 0 B default.rgw.control 0 B 8 0 24 0 0 4 0 0 B 0 0 B 0 B 0 B default.rgw.log 6 MiB 206 0 618 0 0 155 62159 60 MiB 41214 32 KiB 0 B 0 B default.rgw.meta 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B userpool1 10 TiB 836295 0 2508885 0 0 548502 3 3 KiB 838000 3.2 TiB 0 B 0 B Hello, this error means that the OSD has received an I/O error from the disk, which usually means the disk is failing. That's what this message means: "Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!" This typically means you need to replace the disk. based on the logs while it is probable the disk was bad, however, I dont think a crash is the good way of OSD going down. |
Description of problem: ======================= I was running some object creates over the weekend, when an OSD crashed on the ceph cluster [root@rhsqa13 ceph]# ceph crash info 2020-07-11_21:33:18.380234Z_cfd66a23-26c8-43b9-96f0-e3826c614147 { "os_version_id": "7.8", "utsname_machine": "x86_64", "entity_name": "osd.0", "io_error": true, "backtrace": [ "(()+0xf630) [0x7f26a5870630]", "(gsignal()+0x37) [0x7f26a4664387]", "(abort()+0x148) [0x7f26a4665a78]", "(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x55f90057d9d0]", "(KernelDevice::_aio_thread()+0xebe) [0x55f900bc84ce]", "(KernelDevice::AioCompletionThread::entry()+0xd) [0x55f900bcab8d]", "(()+0x7ea5) [0x7f26a5868ea5]", "(clone()+0x6d) [0x7f26a472c8dd]" ], "io_error_optype": 8, "io_error_length": 4096, "assert_line": 534, "utsname_release": "3.10.0-1062.12.1.el7.x86_64", "io_error_offset": 1441381797888, "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc", "io_error_devname": "dm-4", "utsname_sysname": "Linux", "os_version": "7.8 (Maipo)", "os_id": "rhel", "assert_thread_name": "bstore_aio", "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f2698e22700 time 2020-07-12 03:03:18.375513\n/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc: 534: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n", "assert_func": "void KernelDevice::_aio_thread()", "ceph_version": "14.2.8-50.el7cp", "io_error_path": "/var/lib/ceph/osd/ceph-0/block", "os_name": "Red Hat Enterprise Linux Server", "timestamp": "2020-07-11 21:33:18.380234Z", "process_name": "ceph-osd", "utsname_hostname": "rhsqa13.lab.eng.blr.redhat.com", "crash_id": "2020-07-11_21:33:18.380234Z_cfd66a23-26c8-43b9-96f0-e3826c614147", "assert_condition": "abort", "utsname_version": "#1 SMP Thu Dec 12 06:44:49 EST 2019", "io_error_code": -5 } Version-Release number of selected component (if applicable): =================================================== [root@rhsqa13 ceph]# ceph -v ceph version 14.2.8-50.el7cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable) [root@rhsqa13 ceph]# uname -a Linux rhsqa13.lab.eng.blr.redhat.com 3.10.0-1062.12.1.el7.x86_64 #1 SMP Thu Dec 12 06:44:49 EST 2019 x86_64 x86_64 x86_64 GNU/Linux [root@rhsqa13 ceph]# cat /etc/red* Red Hat Enterprise Linux Server release 7.8 (Maipo) [root@rhsqa13 ceph]# How reproducible: =============== hit it once Steps to Reproduce: ===================== 1. installed ceph on a 4 node cluster (configs file in logs) 2. after install was successful, created a replica pool(userpool1) with pg_num =200 3. To setup RGW, selected 2 of the nodes and prepared the config files 4. post that, from one of the RGW gateways, started to put objects using "rados bench" 5. as I was getting a warning to set pgnum to power of 2, changed it to 256 6. From one of the rgwgw, start IO as below >rados bench 200000 write -p userpool1 --no-cleanup Actual results: ============== osd.0 crashed after a day or so. NOTE: Given that I am new to ceph and trying it out, please let me know if you need any logs, details, etc Additional info: