Bug 2246002

Summary: MDS daemons getting crash while creating cephFS volume
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: CephFSAssignee: Rishabh Dave <ridave>
Status: CLOSED CURRENTRELEASE QA Contact: Amarnath <amk>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0CC: adking, amagrawa, amk, ceph-eng-bugs, cephqe-warriors, gfarnum, hyelloji, kdreyer, ngangadh, pdonnell, ridave, vdas, vereddy
Target Milestone: ---Keywords: Automation, Regression, TestBlocker
Target Release: 7.0Flags: ridave: needinfo-
ridave: needinfo-
ridave: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-10-30 16:10:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manisha Saini 2023-10-24 22:54:54 UTC
Description of problem:
====================

MDS daemons getting crash post creating cephfs volume

--------------------
# ceph -s
  cluster:
    id:     1c537092-72b2-11ee-899e-fa163eb88291
    health: HEALTH_WARN
            2 failed cephadm daemon(s)
            insufficient standby MDS daemons available
            10 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum ceph-mani-o7fdxp-node1-installer,ceph-mani-o7fdxp-node3,ceph-mani-o7fdxp-node2 (age 20m)
    mgr: ceph-mani-o7fdxp-node1-installer.zidoxy(active, since 84m), standbys: ceph-mani-o7fdxp-node3.epdzad
    mds: 1/1 daemons up
    osd: 18 osds: 18 up (since 20m), 18 in (since 21m)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 49 pgs
    objects: 24 objects, 451 KiB
    usage:   494 MiB used, 269 GiB / 270 GiB avail
    pgs:     49 active+clean


-----------------

# ceph health detail
HEALTH_WARN 2 failed cephadm daemon(s); insufficient standby MDS daemons available; 10 daemons have recently crashed
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha on ceph-mani-o7fdxp-node1-installer is in error state
    daemon mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw on ceph-mani-o7fdxp-node3 is in error state
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] RECENT_CRASH: 10 daemons have recently crashed
    mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:14.007629Z
    mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:39.938760Z
    mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:19:55.901449Z
    mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:20:11.644713Z
    mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha crashed on host ceph-mani-o7fdxp-node1-installer at 2023-10-24T22:20:27.396603Z
    mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:19:21.148486Z
    mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:20:47.368772Z
    mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:03.613015Z
    mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:19.828147Z
    mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw crashed on host ceph-mani-o7fdxp-node3 at 2023-10-24T22:21:36.090236Z

----------------

# ceph crash ls
ID                                                                ENTITY                                                NEW  
2023-10-24T22:19:14.007629Z_7939c431-6264-4fe8-b1a3-90020172da3b  mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha   *   
2023-10-24T22:19:21.148486Z_a049e9d6-20ac-4f97-aba7-c5b54dd7690c  mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw             *   
2023-10-24T22:19:39.938760Z_9d3f892d-5eec-4020-b9af-a22f85d38c55  mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha   *   
2023-10-24T22:19:55.901449Z_68e82f22-3d6e-44de-9bfa-6327571baa9f  mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha   *   
2023-10-24T22:20:11.644713Z_63851c60-2b91-4f18-ac4b-67f57304c693  mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha   *   
2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05  mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha   *   
2023-10-24T22:20:47.368772Z_daf6ef2c-7dd2-4317-a869-3be915537f5d  mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw             *   
2023-10-24T22:21:03.613015Z_b1610818-b2ce-44ea-b1c4-c313ec28108d  mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw             *   
2023-10-24T22:21:19.828147Z_96e1a843-7879-4835-b649-a42ebdcb0a76  mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw             *   
2023-10-24T22:21:36.090236Z_b823b211-c44c-45fe-8f28-a0654c8088dd  mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw             *   


---------

# ceph crash info 2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7fcc29b9fdf0]",
        "/lib64/libc.so.6(+0xa154c) [0x7fcc29bec54c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7fcc29eeda01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7fcc29ef937c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7fcc29ef93e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7fcc29ef9649]",
        "/usr/bin/ceph-mds(+0xc2c4f) [0x565170eeac4f]",
        "/usr/bin/ceph-mds(+0xc2c73) [0x565170eeac73]",
        "/usr/bin/ceph-mds(+0xc403f) [0x565170eec03f]",
        "(Server::find_idle_sessions()+0x1bb) [0x565170fc7c0b]",
        "(MDSRankDispatcher::tick()+0x25e) [0x565170f7690e]",
        "/usr/bin/ceph-mds(+0x1200fd) [0x565170f480fd]",
        "(CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x15e) [0x7fcc2a30bb7e]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x253441) [0x7fcc2a30c441]",
        "/lib64/libc.so.6(+0x9f802) [0x7fcc29bea802]",
        "/lib64/libc.so.6(+0x3f450) [0x7fcc29b8a450]"
    ],
    "ceph_version": "18.2.0-98.el9cp",
    "crash_id": "2023-10-24T22:20:27.396603Z_c18abc01-d54e-49c0-8269-a0f4a0557d05",
    "entity_name": "mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mds",
    "stack_sig": "c6412d54539859c8fc7f05a3dc8e6e1bf6c32b0bdce9b27d39107a402f4a617a",
    "timestamp": "2023-10-24T22:20:27.396603Z",
    "utsname_hostname": "ceph-mani-o7fdxp-node1-installer",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.30.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023"
}

-------
# ceph orch ls
NAME                       PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager               ?:9093,9094      1/1  7m ago     88m  count:1    
ceph-exporter                               3/3  7m ago     88m  *          
crash                                       3/3  7m ago     88m  *          
grafana                    ?:3000           1/1  7m ago     88m  count:1    
mds.cephfs01                                0/2  7m ago     23m  count:2    
mgr                                         2/2  7m ago     88m  count:2    
mon                                         3/5  7m ago     88m  count:5    
node-exporter              ?:9100           3/3  7m ago     88m  *          
osd.all-available-devices                    18  7m ago     25m  *          
prometheus                 ?:9095           1/1  7m ago     88m  count:1    

--------
# ceph orch ps | grep mds
mds.cephfs01.ceph-mani-o7fdxp-node1-installer.divpha  ceph-mani-o7fdxp-node1-installer                    error             7m ago  23m        -        -  <unknown>        <unknown>     <unknown>     
mds.cephfs01.ceph-mani-o7fdxp-node3.htaqzw            ceph-mani-o7fdxp-node3                              error             5m ago  23m        -        -  <unknown>        <unknown>     <unknown>     


Version-Release number of selected component (if applicable):
=======================

# ceph --version
ceph version 18.2.0-98.el9cp (3f7cc32d87ca3dcf8fd0ace7c3d6c63dd734c195) reef (stable)


How reproducible:
==============
2/2


Steps to Reproduce:
=============
1. Create ceph cluster

# ceph -s
  cluster:
    id:     1c537092-72b2-11ee-899e-fa163eb88291
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-mani-o7fdxp-node1-installer,ceph-mani-o7fdxp-node3,ceph-mani-o7fdxp-node2 (age 2s)
    mgr: ceph-mani-o7fdxp-node1-installer.zidoxy(active, since 64m), standbys: ceph-mani-o7fdxp-node3.epdzad
    osd: 18 osds: 18 up (since 2s), 18 in (since 46s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   1.2 GiB used, 254 GiB / 255 GiB avail
    pgs:     1 active+clean
 
  io:
    recovery: 59 KiB/s, 0 objects/s

2. Create cephfs volume 

# ceph fs volume create cephfs01

# ceph fs volume ls
[
    {
        "name": "cephfs01"
    }
]




Actual results:
===========
MDS daemons crashed 


Expected results:
===========
No crashes should be observed


Additional info:

Comment 3 Manisha Saini 2023-10-25 06:17:57 UTC
This issue is getting hit frequently in our automation runs.
This crash occurs frequently in our NFS testing as well. We encounter this crash when creating a CEPHFS volume for NFS exports. It acts as a critical test blocker for NFS.

Comment 6 Veera Raghava Reddy 2023-10-26 14:11:15 UTC
@pdonnell, @gfarnum Need help to get fix, workaround for https://bugzilla.redhat.com/show_bug.cgi?id=2246002. It is a test blocker for both NFS and CephFS testing.

Comment 8 Rishabh Dave 2023-10-26 17:19:46 UTC
Thanks to Amarnath, we have figured out how to install gdb and are working on this again.

Comment 11 Greg Farnum 2023-10-30 16:10:38 UTC
This was caused by incorrect conflict resolution in https://bugzilla.redhat.com/show_bug.cgi?id=2238663, so it is resolved and that bug is being tracked on its own.