Bug 1860939

Summary:	[BAREMETAL] When OCP cluster was upgrade ceph status reported 2 daemons have recently crashed
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Pratik Surve <prsurve>
Component:	ceph	Assignee:	Scott Ostapovicz <sostapov>
Status:	CLOSED ERRATA	QA Contact:	Pratik Surve <prsurve>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	aeyal, assingh, bkunal, bniver, giridhar.ramaraju, jijoy, madam, mhackett, muagarwa, nberry, ocs-bugs, pdonnell, ratamir, rcyriac, sabose, sostapov, tdesala
Target Milestone:	---	Keywords:	AutomationTriaged
Target Release:	OCS 4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.5.0-526.ci	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-15 10:18:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1850720
Bug Blocks:

Description Pratik Surve 2020-07-27 14:02:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When OCP was upgrade from `4.5.0-0.nightly-2020-07-23-201307` to `4.5.0-0.nightly-2020-07-25-031342` ceph status reported 2 daemons have recently crashed


Version of all relevant components (if applicable):
OCS :- ocs-operator.v4.5.0-494.ci
OCP :- 4.5.0-0.nightly-2020-07-25-031342

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Deploy OCP with OCS over BM LSO
2. Run some i/o
3. Perform ocp upgrade


Actual results:
# ceph -s
  cluster:
    id:     b6a04f01-f8f2-437e-a69c-6607e7e0f68a
    health: HEALTH_WARN
            2 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 4h)
    mgr: a(active, since 5h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 4h), 3 in (since 3d)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 20.84k objects, 4.0 GiB
    usage:   14 GiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   9.4 KiB/s rd, 20 KiB/s wr, 2 op/s rd, 1 op/s wr


# ceph health detail 
HEALTH_WARN 2 daemons have recently crashed
RECENT_CRASH 2 daemons have recently crashed
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at 2020-07-27 08:15:11.179356Z
    mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at 2020-07-27 08:15:11.262603Z

Expected results:
Ceph health should be HEALTH_OK


Additional info:
# ceph crash ls
ID                                                               ENTITY                                  NEW 
2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042 mds.ocs-storagecluster-cephfilesystem-a  *  
2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06 mds.ocs-storagecluster-cephfilesystem-a  *  
[root@argo006 /]# ceph crash info 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042
{
    "crash_id": "2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042",
    "timestamp": "2020-07-27 08:15:11.179356Z",
    "process_name": "ceph-mds",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "ceph_version": "14.2.8-81.el8cp",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-193.13.2.el8_2.x86_64",
    "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "Red Hat Enterprise Linux",
    "os_id": "rhel",
    "os_version_id": "8.2",
    "os_version": "8.2 (Ootpa)",
    "assert_condition": "r == 0",
    "assert_func": "virtual void C_MDS_rename_finish::finish(int)",
    "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc",
    "assert_line": 7380,
    "assert_thread_name": "fn_anonymous",
    "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: In function 'virtual void C_MDS_rename_finish::finish(int)' thread 7f196e5ad700 time 2020-07-27 08:15:11.177416\n/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: 7380: FAILED ceph_assert(r == 0)\n",
    "backtrace": [
        "(()+0x12dd0) [0x7f197c937dd0]",
        "(gsignal()+0x10f) [0x7f197b38970f]",
        "(abort()+0x127) [0x7f197b373b25]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]",
        "(()+0x26c774) [0x7f197eb24774]",
        "(()+0x21a431) [0x564b84108431]",
        "(MDSContext::complete(int)+0x7f) [0x564b842f20df]",
        "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]",
        "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]",
        "(()+0x82de) [0x7f197c92d2de]",
        "(clone()+0x43) [0x7f197b44de83]"
    ]
}
[root@argo006 /]# ceph crash info 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06
{
    "crash_id": "2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06",
    "timestamp": "2020-07-27 08:15:11.262603Z",
    "process_name": "ceph-mds",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "ceph_version": "14.2.8-81.el8cp",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-193.13.2.el8_2.x86_64",
    "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020",
    "utsname_machine": "x86_64",
    "os_name": "Red Hat Enterprise Linux",
    "os_id": "rhel",
    "os_version_id": "8.2",
    "os_version": "8.2 (Ootpa)",
    "assert_condition": "_head.empty()",
    "assert_func": "elist<T>::~elist() [with T = MDSIOContextBase*]",
    "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h",
    "assert_line": 91,
    "assert_thread_name": "fn_anonymous",
    "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f196e5ad700 time 2020-07-27 08:15:11.260009\n/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())\n",
    "backtrace": [
        "(()+0x12dd0) [0x7f197c937dd0]",
        "(abort()+0x203) [0x7f197b373c01]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]",
        "(()+0x26c774) [0x7f197eb24774]",
        "(()+0x404887) [0x564b842f2887]",
        "(()+0x39e9c) [0x7f197b38be9c]",
        "(on_exit()+0) [0x7f197b38bfd0]",
        "(()+0x48ef40) [0x564b8437cf40]",
        "(()+0x12dd0) [0x7f197c937dd0]",
        "(gsignal()+0x10f) [0x7f197b38970f]",
        "(abort()+0x127) [0x7f197b373b25]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]",
        "(()+0x26c774) [0x7f197eb24774]",
        "(()+0x21a431) [0x564b84108431]",
        "(MDSContext::complete(int)+0x7f) [0x564b842f20df]",
        "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]",
        "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]",
        "(()+0x82de) [0x7f197c92d2de]",
        "(clone()+0x43) [0x7f197b44de83]"
    ]
}

Comment 2 Neha Berry 2020-07-27 14:05:43 UTC

Since it involves a crash, proposing as a blocker to keep it in 4.5 until we have an initial analysis.

Thanks

Comment 4 Yaniv Kaul 2020-07-30 13:52:25 UTC

Patrick, can anyone from the CephFS team take a look at the above?

Comment 5 Patrick Donnelly 2020-07-30 19:07:10 UTC

(In reply to Pratik Surve from comment #0)
> Description of problem (please be detailed as possible and provide log
> snippests):
> When OCP was upgrade from `4.5.0-0.nightly-2020-07-23-201307` to
> `4.5.0-0.nightly-2020-07-25-031342` ceph status reported 2 daemons have
> recently crashed
> 
> 
> Version of all relevant components (if applicable):
> OCS :- ocs-operator.v4.5.0-494.ci
> OCP :- 4.5.0-0.nightly-2020-07-25-031342
> 
> Does this issue impact your ability to continue to work with the product
> (please explain in detail what is the user impact)?
> No
> 
> Is there any workaround available to the best of your knowledge?
> 
> 
> Rate from 1 - 5 the complexity of the scenario you performed that caused this
> bug (1 - very simple, 5 - very complex)?
> 3
> 
> Can this issue reproducible?
> 
> 
> Can this issue reproduce from the UI?
> 
> 
> If this is a regression, please provide more details to justify this:
> No
> 
> Steps to Reproduce:
> 1. Deploy OCP with OCS over BM LSO
> 2. Run some i/o
> 3. Perform ocp upgrade
> 
> 
> Actual results:
> # ceph -s
>   cluster:
>     id:     b6a04f01-f8f2-437e-a69c-6607e7e0f68a
>     health: HEALTH_WARN
>             2 daemons have recently crashed
>  
>   services:
>     mon: 3 daemons, quorum a,b,c (age 4h)
>     mgr: a(active, since 5h)
>     mds: ocs-storagecluster-cephfilesystem:1
> {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
>     osd: 3 osds: 3 up (since 4h), 3 in (since 3d)
>     rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a,
> ocs.storagecluster.cephobjectstore.b)
>  
>   task status:
>     scrub status:
>         mds.ocs-storagecluster-cephfilesystem-a: idle
>         mds.ocs-storagecluster-cephfilesystem-b: idle
>  
>   data:
>     pools:   10 pools, 176 pgs
>     objects: 20.84k objects, 4.0 GiB
>     usage:   14 GiB used, 2.7 TiB / 2.7 TiB avail
>     pgs:     176 active+clean
>  
>   io:
>     client:   9.4 KiB/s rd, 20 KiB/s wr, 2 op/s rd, 1 op/s wr
> 
> 
> # ceph health detail 
> HEALTH_WARN 2 daemons have recently crashed
> RECENT_CRASH 2 daemons have recently crashed
>     mds.ocs-storagecluster-cephfilesystem-a crashed on host
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at
> 2020-07-27 08:15:11.179356Z
>     mds.ocs-storagecluster-cephfilesystem-a crashed on host
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at
> 2020-07-27 08:15:11.262603Z
> 
> Expected results:
> Ceph health should be HEALTH_OK
> 
> 
> Additional info:
> # ceph crash ls
> ID                                                               ENTITY     
> NEW 
> 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042
> mds.ocs-storagecluster-cephfilesystem-a  *  
> 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06
> mds.ocs-storagecluster-cephfilesystem-a  *  
> [root@argo006 /]# ceph crash info
> 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042
> {
>     "crash_id":
> "2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042",
>     "timestamp": "2020-07-27 08:15:11.179356Z",
>     "process_name": "ceph-mds",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "ceph_version": "14.2.8-81.el8cp",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk",
>     "utsname_sysname": "Linux",
>     "utsname_release": "4.18.0-193.13.2.el8_2.x86_64",
>     "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020",
>     "utsname_machine": "x86_64",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_id": "rhel",
>     "os_version_id": "8.2",
>     "os_version": "8.2 (Ootpa)",
>     "assert_condition": "r == 0",
>     "assert_func": "virtual void C_MDS_rename_finish::finish(int)",
>     "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc",
>     "assert_line": 7380,
>     "assert_thread_name": "fn_anonymous",
>     "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: In
> function 'virtual void C_MDS_rename_finish::finish(int)' thread 7f196e5ad700
> time 2020-07-27
> 08:15:11.177416\n/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: 7380:
> FAILED ceph_assert(r == 0)\n",
>     "backtrace": [
>         "(()+0x12dd0) [0x7f197c937dd0]",
>         "(gsignal()+0x10f) [0x7f197b38970f]",
>         "(abort()+0x127) [0x7f197b373b25]",
>         "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1a7) [0x7f197eb245ab]",
>         "(()+0x26c774) [0x7f197eb24774]",
>         "(()+0x21a431) [0x564b84108431]",
>         "(MDSContext::complete(int)+0x7f) [0x564b842f20df]",
>         "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]",
>         "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]",
>         "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]",
>         "(()+0x82de) [0x7f197c92d2de]",
>         "(clone()+0x43) [0x7f197b44de83]"
>     ]
> }

This one looks like an ordinary OSD op failure causing the MDS to abort (which should trigger a respawn by systemd). Ideally the MDS should exit more gracefully. I'll add a ticket for this.

> [root@argo006 /]# ceph crash info
> 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06
> {
>     "crash_id":
> "2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06",
>     "timestamp": "2020-07-27 08:15:11.262603Z",
>     "process_name": "ceph-mds",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "ceph_version": "14.2.8-81.el8cp",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk",
>     "utsname_sysname": "Linux",
>     "utsname_release": "4.18.0-193.13.2.el8_2.x86_64",
>     "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020",
>     "utsname_machine": "x86_64",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_id": "rhel",
>     "os_version_id": "8.2",
>     "os_version": "8.2 (Ootpa)",
>     "assert_condition": "_head.empty()",
>     "assert_func": "elist<T>::~elist() [with T = MDSIOContextBase*]",
>     "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h",
>     "assert_line": 91,
>     "assert_thread_name": "fn_anonymous",
>     "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: In
> function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread
> 7f196e5ad700 time 2020-07-27
> 08:15:11.260009\n/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: 91:
> FAILED ceph_assert(_head.empty())\n",
>     "backtrace": [
>         "(()+0x12dd0) [0x7f197c937dd0]",
>         "(abort()+0x203) [0x7f197b373c01]",
>         "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1a7) [0x7f197eb245ab]",
>         "(()+0x26c774) [0x7f197eb24774]",
>         "(()+0x404887) [0x564b842f2887]",
>         "(()+0x39e9c) [0x7f197b38be9c]",
>         "(on_exit()+0) [0x7f197b38bfd0]",
>         "(()+0x48ef40) [0x564b8437cf40]",
>         "(()+0x12dd0) [0x7f197c937dd0]",
>         "(gsignal()+0x10f) [0x7f197b38970f]",
>         "(abort()+0x127) [0x7f197b373b25]",
>         "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1a7) [0x7f197eb245ab]",
>         "(()+0x26c774) [0x7f197eb24774]",
>         "(()+0x21a431) [0x564b84108431]",
>         "(MDSContext::complete(int)+0x7f) [0x564b842f20df]",
>         "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]",
>         "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]",
>         "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]",
>         "(()+0x82de) [0x7f197c92d2de]",
>         "(clone()+0x43) [0x7f197b44de83]"
>     ]
> }

Looks like https://tracker.ceph.com/issues/44294. That is fixed by https://github.com/ceph/ceph/pull/34343 in v14.2.10. There is already a BZ ON_QA backporting that downstream: bz1850720

Comment 6 Mudit Agarwal 2020-07-31 12:57:05 UTC

Thanks Patrick.

@Raz, I guess we still need this open here for ocs testing? If yes, can we have the acks please.

Comment 7 Neha Berry 2020-07-31 13:33:16 UTC

(In reply to Mudit Agarwal from comment #6)
> Thanks Patrick.
> 
> @Raz, I guess we still need this open here for ocs testing? If yes, can we
> have the acks please.

@mudit

We ca ack it in case we are ready to consume ceph-14.2.8-85.el8cp, ceph-14.2.8-85.el7cp in our OCS 4.5 builds(which does seem likely though). Can you confirm.

Comment 8 Scott Ostapovicz 2020-08-03 13:57:17 UTC

This has a fix ON_QA for RHCS 4.1 z2, which will be available in OCS 4.6, NOT 4.5.

Comment 9 Mudit Agarwal 2020-08-03 14:13:05 UTC

Thanks Scott.

Raz/Neha, can this be moved to 4.6?

Comment 10 Mudit Agarwal 2020-08-04 05:45:38 UTC

AFAICT, we are not blocking OCS release for ceph issue. Moving it to 4.6, please revert if you think that is not correct.

Comment 11 Raz Tamir 2020-08-04 07:08:33 UTC

Hi Scott, Mudit,

The question is when RHCS 4.1.z2 is planned to be shipped.
If the timelines of 4.5 and 4.1.2z are close enough, we might want to consider having this fix in 4.5.
Regarding "we are not blocking OCS for Ceph issues", that's not right - we should aim for releasing asyncs for Ceph in case we have an issue impacting OCS.

Comment 13 Mudit Agarwal 2020-08-04 07:58:13 UTC

https://pp.engineering.redhat.com/pp/product/ceph/release/ceph-4-1/status/trend

If I read this correctly, target date for 4.1z2 is 09/15 which is almost 3 weeks from OCS 4.5 GA date (08/26), so I don't know if we can wait till then. 
Also, we need to check how easy is this to hit.

But yes, as Neha mentioned we should document it properly.

Comment 19 Neha Berry 2020-08-04 15:20:31 UTC

(In reply to Scott Ostapovicz from comment #8)
> This has a fix ON_QA for RHCS 4.1 z2, which will be available in OCS 4.6,
> NOT 4.5.

@scott is the RHCS side issue https://bugzilla.redhat.com/show_bug.cgi?id=1847685 ?

Comment 21 Scott Ostapovicz 2020-08-04 16:16:31 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1850720

Comment 29 errata-xmlrpc 2020-09-15 10:18:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 30 Jilju Joy 2021-08-24 07:57:57 UTC

Covered in test tests/ecosystem/upgrade/test_upgrade_ocp.py

Comment 31 Red Hat Bugzilla 2023-09-15 00:34:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days