Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1856430

Summary:	OSD crashed with abort
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	RADOS	Assignee:	Neha Ojha <nojha>
Status:	CLOSED NOTABUG	QA Contact:	Manohar Murthy <mmurthy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1	CC:	akupczyk, bhubbard, cbodley, ceph-eng-bugs, dzafman, jdurgin, kbader, kchai, mbenjamin, nojha, rzarzyns, sseshasa, sweil
Target Milestone:	z2
Target Release:	4.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-14 21:25:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2020-07-13 15:55:29 UTC

Description of problem:
=======================
I was running some object creates over the weekend, when an OSD crashed on the ceph cluster

[root@rhsqa13 ceph]# ceph crash info 2020-07-11_21:33:18.380234Z_cfd66a23-26c8-43b9-96f0-e3826c614147
{
    "os_version_id": "7.8", 
    "utsname_machine": "x86_64", 
    "entity_name": "osd.0", 
    "io_error": true, 
    "backtrace": [
        "(()+0xf630) [0x7f26a5870630]", 
        "(gsignal()+0x37) [0x7f26a4664387]", 
        "(abort()+0x148) [0x7f26a4665a78]", 
        "(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x55f90057d9d0]", 
        "(KernelDevice::_aio_thread()+0xebe) [0x55f900bc84ce]", 
        "(KernelDevice::AioCompletionThread::entry()+0xd) [0x55f900bcab8d]", 
        "(()+0x7ea5) [0x7f26a5868ea5]", 
        "(clone()+0x6d) [0x7f26a472c8dd]"
    ], 
    "io_error_optype": 8, 
    "io_error_length": 4096, 
    "assert_line": 534, 
    "utsname_release": "3.10.0-1062.12.1.el7.x86_64", 
    "io_error_offset": 1441381797888, 
    "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc", 
    "io_error_devname": "dm-4", 
    "utsname_sysname": "Linux", 
    "os_version": "7.8 (Maipo)", 
    "os_id": "rhel", 
    "assert_thread_name": "bstore_aio", 
    "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f2698e22700 time 2020-07-12 03:03:18.375513\n/builddir/build/BUILD/ceph-14.2.8/src/os/bluestore/KernelDevice.cc: 534: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n", 
    "assert_func": "void KernelDevice::_aio_thread()", 
    "ceph_version": "14.2.8-50.el7cp", 
    "io_error_path": "/var/lib/ceph/osd/ceph-0/block", 
    "os_name": "Red Hat Enterprise Linux Server", 
    "timestamp": "2020-07-11 21:33:18.380234Z", 
    "process_name": "ceph-osd", 
    "utsname_hostname": "rhsqa13.lab.eng.blr.redhat.com", 
    "crash_id": "2020-07-11_21:33:18.380234Z_cfd66a23-26c8-43b9-96f0-e3826c614147", 
    "assert_condition": "abort", 
    "utsname_version": "#1 SMP Thu Dec 12 06:44:49 EST 2019", 
    "io_error_code": -5
}






Version-Release number of selected component (if applicable):
===================================================
[root@rhsqa13 ceph]# ceph -v
ceph version 14.2.8-50.el7cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)
[root@rhsqa13 ceph]# uname -a
Linux rhsqa13.lab.eng.blr.redhat.com 3.10.0-1062.12.1.el7.x86_64 #1 SMP Thu Dec 12 06:44:49 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@rhsqa13 ceph]# cat /etc/red*
Red Hat Enterprise Linux Server release 7.8 (Maipo)
[root@rhsqa13 ceph]# 



How reproducible:
===============
hit it once


Steps to Reproduce:
=====================
1. installed ceph on a 4 node cluster (configs file in logs)
2. after install was successful,  created a replica pool(userpool1) with pg_num =200
3. To setup RGW, selected 2 of the nodes and prepared the config files
4. post that, from one of the RGW gateways, started to put objects using "rados bench"
5. as I was getting a warning to set pgnum to power of 2, changed it to 256
6. From one of the rgwgw, start IO as below
>rados bench 200000 write -p userpool1 --no-cleanup



Actual results:
==============
osd.0 crashed after a day or so.



NOTE:
Given that I am new to ceph and trying it out, please let me know if you need any logs, details, etc

Additional info:

Comment 1 Nag Pavan Chilakam 2020-07-13 15:56:38 UTC

[root@rhsqa13 ceph]# ceph health
HEALTH_ERR 1 full osd(s); 2 nearfull osd(s); 5 pool(s) full; 2 scrub errors; Low space hindering backfill (add storage if this doesn't resolve itself): 84 pgs backfill_toofull; Possible data damage: 2 pgs inconsistent; Degraded data redundancy: 548665/2509545 objects degraded (21.863%), 114 pgs degraded, 107 pgs undersized; Full OSDs blocking recovery: 30 pgs recovery_toofull; 1 daemons have recently crashed; too many PGs per OSD (256 > max 250)
[root@rhsqa13 ceph]# ceph -s
cep  cluster:
    id:     178e9007-0be2-4aed-9a97-c7400ed595f7
    health: HEALTH_ERR
            1 full osd(s)
            2 nearfull osd(s)
            5 pool(s) full
            2 scrub errors
            Low space hindering backfill (add storage if this doesn't resolve itself): 84 pgs backfill_toofull
            Possible data damage: 2 pgs inconsistent
            Degraded data redundancy: 548665/2509545 objects degraded (21.863%), 114 pgs degraded, 107 pgs undersized
            Full OSDs blocking recovery: 30 pgs recovery_toofull
            1 daemons have recently crashed
            too many PGs per OSD (256 > max 250)
 
  services:
    mon: 3 daemons, quorum rhsqa13,rhsqa14,host2 (age 42h)
    mgr: rhsqa14(active, since 3d), standbys: rhsqa13, host2
    osd: 4 osds: 3 up (since 42h), 3 in (since 42h); 107 remapped pgs
    rgw: 2 daemons active (constantine.rgw0, rhs-client44.rgw0)
 
  task status:
 
  data:
    pools:   5 pools, 256 pgs
    objects: 836.51k objects, 3.2 TiB
    usage:   7.5 TiB used, 723 GiB / 8.2 TiB avail
    pgs:     548665/2509545 objects degraded (21.863%)
             142 active+clean
             82  active+undersized+degraded+remapped+backfill_toofull
             23  active+recovery_toofull+undersized+degraded+remapped
             7   active+recovery_toofull+degraded
             2   active+undersized+degraded+remapped+inconsistent+backfill_toofull
 
[root@rhsqa13 ceph]# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED 
    hdd       8.2 TiB     723 GiB     7.5 TiB      7.5 TiB         91.38 
    TOTAL     8.2 TiB     723 GiB     7.5 TiB      7.5 TiB         91.38 
 
POOLS:
    POOL                    ID     STORED      OBJECTS     USED        %USED      MAX AVAIL 
    userpool1                1     4.3 TiB     836.29k      10 TiB     100.00           0 B 
    .rgw.root                2     3.1 KiB           6     1.1 MiB     100.00           0 B 
    default.rgw.control      3         0 B           8         0 B          0           0 B 
    default.rgw.meta         4         0 B           0         0 B          0           0 B 
    default.rgw.log          5     4.6 KiB         206       6 MiB     100.00           0 B 
[root@rhsqa13 ceph]#

Comment 2 Nag Pavan Chilakam 2020-07-13 15:57:37 UTC

Note that the IOs went on for 3 days, even though the OSD crashed after about a day or so. I stopped the IOs only now, and above command o/p was taken now.

Comment 3 Nag Pavan Chilakam 2020-07-13 15:58:10 UTC

[root@rhs-client44 ceph-ansible]# cat hosts 
[mons]
rhsqa13.lab.eng.blr.redhat.com
rhsqa14.lab.eng.blr.redhat.com
host2.lab.eng.blr.redhat.com
[osds]
rhsqa13.lab.eng.blr.redhat.com
rhsqa14.lab.eng.blr.redhat.com
host2.lab.eng.blr.redhat.com
constantine.lab.eng.blr.redhat.com
[mgrs]
rhsqa13.lab.eng.blr.redhat.com
rhsqa14.lab.eng.blr.redhat.com
host2.lab.eng.blr.redhat.com
[grafana-server]
rhs-client44.lab.eng.blr.redhat.com
[rgws]
#constantine.lab.eng.blr.redhat.com
constantine.lab.eng.blr.redhat.com radosgw_interface="enp5s0f0"
rhs-client44.lab.eng.blr.redhat.com radosgw_interface="enp7s0f0"

Comment 4 Nag Pavan Chilakam 2020-07-13 16:07:14 UTC

ceph-logs, /var/lib/ceph copy, sosreports available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/ceph/bug.1856430/

Comment 5 Nag Pavan Chilakam 2020-07-13 17:02:15 UTC

[root@rhs-client44 ~]# rados df
POOL_NAME              USED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS     RD WR_OPS      WR USED COMPR UNDER COMPR 
.rgw.root           1.1 MiB       6      0      18                  0       0        4      9  9 KiB      6   6 KiB        0 B         0 B 
default.rgw.control     0 B       8      0      24                  0       0        4      0    0 B      0     0 B        0 B         0 B 
default.rgw.log       6 MiB     206      0     618                  0       0      155  62159 60 MiB  41214  32 KiB        0 B         0 B 
default.rgw.meta        0 B       0      0       0                  0       0        0      0    0 B      0     0 B        0 B         0 B 
userpool1            10 TiB  836295      0 2508885                  0       0   548502      3  3 KiB 838000 3.2 TiB        0 B         0 B

Comment 7 Josh Durgin 2020-07-14 21:25:45 UTC

Hello, this error means that the OSD has received an I/O error from the disk, which usually means the disk is failing. That's what this message means:

"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!"

This typically means you need to replace the disk.

Comment 8 Nag Pavan Chilakam 2020-07-15 06:05:59 UTC

based on the logs while it is probable the disk was bad, however, I dont think a crash is the good way of OSD going down.