2172660 – ODF Seems Healthy Except $ ceph status is 1 filesystem is offline and 1 mds daemon damaged

Bug 2172660 - ODF Seems Healthy Except $ ceph status is 1 filesystem is offline and 1 mds daemon damaged

Summary: ODF Seems Healthy Except $ ceph status is 1 filesystem is offline and 1 mds d...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-22 22:06 UTC by Craig Wayman
Modified:	2023-08-21 11:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-08 15:08:40 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	58878	0	None	None	None	2023-02-28 13:19:44 UTC
Red Hat Knowledge Base (Solution)	7008188	0	None	None	None	2023-08-21 11:49:08 UTC

Description Craig Wayman 2023-02-22 22:06:45 UTC

Good Afternoon,

  I have an odd situation here. The first I've seen with everything looking the way it does. The customer’s ODF seems healthy. All deployments, pods, all PVs/PVCs are bound, etc. BOTH mds pods are up and running. Basically, ODF looks almost perfect  however...


Description of problem (please be detailed as possible and provide log snippests):

 Ceph is currently in HEALTH_ERR with ceph status showing the following:


sh-4.4$ ceph -s
  cluster:
    id:     <omitted>
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum d,e,f (age 12d)
    mgr: a(active, since 29h)
    mds: 0/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 12d), 3 in (since 12d)

  data:
    volumes: 0/1 healthy, 1 recovering; 1 damaged
    pools:   11 pools, 177 pgs
    objects: 58.35k objects, 85 GiB
    usage:   260 GiB used, 1.2 TiB / 1.5 TiB avail
    pgs:     177 active+clean

sh-4.4$ ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon damaged; 1 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
    fs ocs-storagecluster-cephfilesystem is offline because no MDS is active for it.
[ERR] MDS_DAMAGE: 1 mds daemon damaged
    fs ocs-storagecluster-cephfilesystem mds.0 is damaged
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mds.ocs-storagecluster-cephfilesystem-b crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5896fbb5k7r8z at 2023-02-10T03:40:44.301123Z



Version of all relevant components (if applicable):

OCP:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.45   True        False         21d     Cluster version is 4.10.45

ODF:
NAME                                        DISPLAY                               VERSION   REPLACES                                    PHASE
mcg-operator.v4.10.9                        NooBaa Operator                       4.10.9    mcg-operator.v4.9.13                        Succeeded
ocs-operator.v4.10.9                        OpenShift Container Storage           4.10.9    ocs-operator.v4.9.13                        Succeeded
odf-csi-addons-operator.v4.10.9             CSI Addons                            4.10.9    odf-csi-addons-operator.v4.10.8             Succeeded
odf-operator.v4.10.9                        OpenShift Data Foundation             4.10.9    odf-operator.v4.9.13                        Succeeded



$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 9
    }
}



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, cannot access any PVCs associated with CephFS


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
No

Can this issue reproduce from the UI?
No


Additional info:

We’ve tried the following:

- Restarted ceph-mgr
- Marked filesystem repaired
- Collected debug logs (because mds pods are not crashing nothing of importance was in there)
- Restarted mds pods
- Scaled ocs/rook-ceph operators along with mds deployments down, then back up
- Ran the following command: $ ceph fs set ocs-storagecluster-cephfilesystem max_mds 1
- Collected output of dumps_ops_in_flight which yielded this error: ERROR: (38) Function not implemented

  I spoke with Michael Kidd and Greg Farnum who agreed this was odd that the mds pods weren’t crashing and were up in Running status yet Ceph is in this state. I asked the customer for a detailed response as to what preceded this issue (upgrade, machine outage, etc.). I will upload the logs along with the response from the customer soon.

Comment 3 Venky Shankar 2023-02-23 03:03:52 UTC

This is the crash backtrace (at least one of the crashes):

{                                                                                                                                                                                                                     
    "crash_id": "2023-02-01T05:44:04.993341Z_ff9d8815-10e1-4304-90fd-32d91f7bbcdb",                                                                                                                                   
    "timestamp": "2023-02-01T05:44:04.993341Z",                                                                                                                                                                       
    "process_name": "ceph-mon",                                                                                                                                                                                       
    "entity_name": "mon.b",                                                                                                                                                                                           
    "ceph_version": "16.2.0-152.el8cp",                                                                                                                                                                               
    "utsname_hostname": "rook-ceph-mon-b-549b6df65f-2s5dw",                                                                                                                                                           
    "utsname_sysname": "Linux",                                                                                                                                                                                       
    "utsname_release": "4.18.0-305.72.1.el8_4.x86_64",                                                                                                                                                                
    "utsname_version": "#1 SMP Thu Nov 17 09:15:11 EST 2022",                                                                                                                                                         
    "utsname_machine": "x86_64",                                                                                                                                                                                      
    "os_name": "Red Hat Enterprise Linux",                                                                                                                                                                            
    "os_id": "rhel",                                                                                                                                                                                                  
    "os_version_id": "8.5",                                                                                                                                                                                           
    "os_version": "8.5 (Ootpa)",                                                                                                                                                                                      
    "assert_condition": "fs->mds_map.compat.compare(compat) == 0",                                                                                                                                                    
    "assert_func": "void FSMap::sanity() const",                                                                                                                                                                      
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc",                                                                                                                                              
    "assert_line": 857,                                                                                                                                                                                               
    "assert_thread_name": "ceph-mon",                                                                                                                                                                                 
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f11298e8700 time 2023-02-01T05:44:04.989175+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: 857: FAILED ceph_assert(fs->mds_map.compat.compare(compat) == 0)\n",                                                                                                                                      
    "backtrace": [                                                                                                                                                                                                        
        "/lib64/libpthread.so.0(+0x12c20) [0x7f111e577c20]",                                                                                                                                                              
        "gsignal()",                                                                                                                                                                                                      
        "abort()",                                                                                                                                                                                                        
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1120a7abb1]",                                                                                                                  
        "/usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1120a7ad7a]",                                                                                                                                                
        "(FSMap::sanity() const+0xcd) [0x7f1120fbd9dd]",                                                                                                                                                                  
        "(MDSMonitor::update_from_paxos(bool*)+0x378) [0x55c6dacfe838]",                                                                                                                                                  
        "(PaxosService::refresh(bool*)+0x10e) [0x55c6dac1fc9e]",                                                                                                                                                          
        "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x55c6daad147c]",                                                                                                                                                    
        "(Monitor::init_paxos()+0x10c) [0x55c6daad178c]",                                                                                                                                                                 
        "(Monitor::preinit()+0xd30) [0x55c6daafec40]",                                                                                                                                                                    
        "main()",                                                                                                                                                                                                         
        "__libc_start_main()",                                                                                                                                                                                            
        "_start()"                                                                                                                                                                                                    
    ]
}

This does ring a bell, I've seen this before (probably during an upgrade). Checking...

Note You need to log in before you can comment on or make changes to this bug.