Bug 1848016 - [RADOS]:OSDs are crashing with aborted in thread_name:ms_dispatch
Summary: [RADOS]:OSDs are crashing with aborted in thread_name:ms_dispatch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z2
: 4.1
Assignee: Neha Ojha
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-17 14:30 UTC by skanta
Modified: 2020-09-30 17:25 UTC (History)
12 users (show)

Fixed In Version: ceph-14.2.8-100.el8cp, ceph-14.2.8-100.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-30 17:25:33 UTC
Embargoed:


Attachments (Terms of Use)
ceph crash info details (76.04 KB, text/plain)
2020-06-17 14:30 UTC, skanta
no flags Details
Log files (328.55 KB, application/gzip)
2020-06-17 14:43 UTC, skanta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 35632 0 None closed tools: Add statfs operation to ceph-objecstore-tool 2021-01-15 15:27:27 UTC
Red Hat Product Errata RHBA-2020:4144 0 None None None 2020-09-30 17:25:55 UTC

Description skanta 2020-06-17 14:30:41 UTC
Created attachment 1697829 [details]
ceph crash info details

Description of problem: 
    Observed that OSDs  are crashing with aborted in thread_name:ms_dispatch

                   
Version-Release number of selected component (if applicable):
  
ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)

How reproducible:
Steps to Reproduce:
1.creating the pool 
    sudo ceph osd pool create test_pool514 128

2.Executing the rados bench write command with 4MB data for 360 seconds on pool

    sudo rados --no-log-to-stderr  -b 4194304 -p test_pool58 bench 360 write

3.Noticed that OSDs are getting crash.Observed that 180GB memory is occupied and we cleaned memory also OSD are down.
  

Actual results:
     OSDs crash 
     

Expected results:

    OSD should not crash

Additional info:

[root@ceph-bharat-1592319447950-node13-osd ceph]# ceph crash ls
ID                                                               ENTITY NEW 
2020-06-16_17:08:55.746313Z_bd255010-4ad7-4ff0-81d9-75bb30d52dbc osd.16  *  
2020-06-16_17:09:00.627586Z_5e9e7d9d-c665-4163-a4b5-0548ab47b7d9 osd.16  *  
2020-06-16_17:09:04.894484Z_aa19e655-0780-4d86-a3a3-d64f38b3f8fc osd.16  *  
2020-06-16_17:09:09.290174Z_5184c0b4-6f65-4ec3-9c4e-19630ddbac3a osd.16  *  
2020-06-16_17:09:17.732047Z_0a989691-a0b9-4217-81d8-5296ae53104d osd.15  *  
2020-06-16_17:09:21.039004Z_4b682265-7be4-4dc6-959e-70c423f66bc6 osd.15  *  
2020-06-16_17:09:24.214661Z_285d46bd-aa97-4628-ac4b-693f8f9abf2a osd.15  *  
2020-06-16_17:09:27.630156Z_d04f9f74-ea52-444a-9e6e-f4bb76440e89 osd.15  *  
2020-06-16_17:09:58.889344Z_1cf66286-c37c-4cb0-8aea-693fe7e6436b osd.1   *  
2020-06-16_17:10:02.426099Z_ed19570a-0740-4bc4-96f9-95e25b79dafc osd.1   *  
2020-06-16_17:10:05.867627Z_50887ca6-90cd-4852-856f-cfb7ef44c40a osd.1   *  
2020-06-16_17:10:09.141085Z_ec537be0-3430-46de-9dc5-c6108a61696f osd.1   *  
2020-06-16_17:10:56.610033Z_b950383d-c2b0-4e0f-a7e5-f1b191af0d8c osd.14  *  
2020-06-16_17:10:56.704379Z_711bc216-4c42-45e7-bfec-9b71ecda11da osd.9   *  
2020-06-16_17:11:00.868746Z_f07d8336-905b-43af-9f30-0a6ded42767f osd.9   *  
2020-06-16_17:11:01.881965Z_20232102-e43a-43a3-977f-202262abd5c1 osd.14  *  
2020-06-16_17:11:03.928712Z_dd3f7a8e-72f4-4e5d-bbac-57e6fe6b76cb osd.9   *  
2020-06-16_17:11:05.078065Z_a1575c56-7f12-45bd-81e4-c077650fe6e5 osd.14  *  
2020-06-16_17:11:06.136871Z_629689db-d771-48d8-aed6-21cbf749a0cc osd.9   *  
2020-06-16_17:11:08.344013Z_b6b4e950-411c-424d-adbd-9d70d30ea028 osd.14  *  
2020-06-16_17:22:38.781667Z_b5889638-e75e-48e9-bee9-a2398e717282 osd.6   *  
2020-06-16_17:23:40.076285Z_cf01ab90-8047-4ff6-9e98-e8adbd4da1ff osd.6   *  
2020-06-16_17:23:42.767327Z_71ef7898-a443-40a1-b3ea-e6dcd4eebb18 osd.6   *  
2020-06-16_17:23:44.804608Z_ca932fc7-9943-4182-a9d4-409186f5ad5c osd.6   *  
[root@ceph-bharat-1592319447950-node13-osd ceph]#


===============================================================================================================================================


[cephuser@ceph-bharat-1592319447950-node9-clientnfs ~]$ sudo ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL      USED        RAW USED     %RAW USED 
    hdd       282 GiB     80 GiB     180 GiB      202 GiB         71.60 
    TOTAL     282 GiB     80 GiB     180 GiB      202 GiB         71.60 
 
POOLS:
    POOL                    ID     STORED      OBJECTS     USED        %USED      MAX AVAIL 
    cephfs_data              1         0 B           0         0 B          0           0 B 
    cephfs_metadata          2     2.3 KiB          22     1.6 MiB     100.00           0 B 
    .rgw.root                3     2.4 KiB           6     1.1 MiB     100.00           0 B 
    default.rgw.control      4         0 B           8         0 B          0           0 B 
    default.rgw.meta         5       493 B           2     512 KiB     100.00           0 B 
    default.rgw.log          6     3.8 KiB         206     6.6 MiB     100.00           0 B 
    rbd                      7         0 B           0         0 B          0           0 B 
    test_pool135             8      64 GiB      14.82k     191 GiB     100.00           0 B 
[cephuser@ceph-bharat-1592319447950-node9-clientnfs ~]$



===============================================================================

[cephuser@ceph-bharat-1592319447950-node9-clientnfs ~]$ sudo ceph versions
{
    "mon": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 22
    },
    "mds": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)": 31
    }
}
[cephuser@ceph-bharat-1592319447950-node9-clientnfs ~]$ 

=============================================================================

Crash dump snippet:-


2020-06-17 08:33:22.638 7f2dbf4f3700 -1 *** Caught signal (Aborted) **
 in thread 7f2dbf4f3700 thread_name:ms_dispatch

 ceph version 14.2.8-68.el7cp (c3d1f04bd7aa9ccc99ffd545ff2c5431b2df316e) nautilus (stable)
 1: (()+0xf630) [0x7f2dd475a630]
 2: (gsignal()+0x37) [0x7f2dd354e387]
 3: (abort()+0x148) [0x7f2dd354fa78]
 4: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x1a5) [0x5573629b8b60]
 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xc8d) [0x557362f0e3ad]
 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x370) [0x557362f230d0]
 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x557362b19d1f]
 8: (OSD::handle_osd_map(MOSDMap*)+0x3234) [0x557362aadd44]
 9: (OSD::_dispatch(Message*)+0xa1) [0x557362abc411]
 10: (OSD::ms_dispatch(Message*)+0x69) [0x557362abc779]
 11: (DispatchQueue::entry()+0x129c) [0x55736336a28c]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x5573631cd47d]
 13: (()+0x7ea5) [0x7f2dd4752ea5]
 14: (clone()+0x6d) [0x7f2dd36168dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
 -1128> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command assert hook 0x55736d426500
 -1127> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command abort hook 0x55736d426500
 -1126> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perfcounters_dump hook 0x55736d426500
 -1125> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command 1 hook 0x55736d426500
 -1124> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perf dump hook 0x55736d426500
 -1123> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perfcounters_schema hook 0x55736d426500
 -1122> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perf histogram dump hook 0x55736d426500
 -1121> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command 2 hook 0x55736d426500
 -1120> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perf schema hook 0x55736d426500
 -1119> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perf histogram schema hook 0x55736d426500
 -1118> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command perf reset hook 0x55736d426500
 -1117> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config show hook 0x55736d426500
 -1116> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config help hook 0x55736d426500
 -1115> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config set hook 0x55736d426500
 -1114> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config unset hook 0x55736d426500
 -1113> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config get hook 0x55736d426500
 -1112> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config diff hook 0x55736d426500
 -1111> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command config diff get hook 0x55736d426500
 -1110> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command log flush hook 0x55736d426500
 -1109> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command log dump hook 0x55736d426500
 -1108> 2020-06-17 08:33:20.123 7f2dd770ca80  5 asok(0x55736d4c6000) register_command log reopen hook 0x55736d426500
 -1107> 2020-06-17 08:33:20.124 7f2dd770ca80  5 asok(0x55736d4c6000) register_command dump_mempools hook 0x55736d4747c8
 -1106> 2020-06-17 08:33:20.128 7f2dd770ca80 10 monclient: get_monmap_and_config

Log files and ceph crash info details  files are attached.

Comment 2 skanta 2020-06-17 14:43:51 UTC
Created attachment 1697832 [details]
Log files

Comment 14 errata-xmlrpc 2020-09-30 17:25:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4144


Note You need to log in before you can comment on or make changes to this bug.