Description of problem (please be detailed as possible and provide log snippests): When OCP was upgrade from `4.5.0-0.nightly-2020-07-23-201307` to `4.5.0-0.nightly-2020-07-25-031342` ceph status reported 2 daemons have recently crashed Version of all relevant components (if applicable): OCS :- ocs-operator.v4.5.0-494.ci OCP :- 4.5.0-0.nightly-2020-07-25-031342 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Deploy OCP with OCS over BM LSO 2. Run some i/o 3. Perform ocp upgrade Actual results: # ceph -s cluster: id: b6a04f01-f8f2-437e-a69c-6607e7e0f68a health: HEALTH_WARN 2 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 4h) mgr: a(active, since 5h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 4h), 3 in (since 3d) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 20.84k objects, 4.0 GiB usage: 14 GiB used, 2.7 TiB / 2.7 TiB avail pgs: 176 active+clean io: client: 9.4 KiB/s rd, 20 KiB/s wr, 2 op/s rd, 1 op/s wr # ceph health detail HEALTH_WARN 2 daemons have recently crashed RECENT_CRASH 2 daemons have recently crashed mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at 2020-07-27 08:15:11.179356Z mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at 2020-07-27 08:15:11.262603Z Expected results: Ceph health should be HEALTH_OK Additional info: # ceph crash ls ID ENTITY NEW 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042 mds.ocs-storagecluster-cephfilesystem-a * 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06 mds.ocs-storagecluster-cephfilesystem-a * [root@argo006 /]# ceph crash info 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042 { "crash_id": "2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042", "timestamp": "2020-07-27 08:15:11.179356Z", "process_name": "ceph-mds", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "ceph_version": "14.2.8-81.el8cp", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk", "utsname_sysname": "Linux", "utsname_release": "4.18.0-193.13.2.el8_2.x86_64", "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020", "utsname_machine": "x86_64", "os_name": "Red Hat Enterprise Linux", "os_id": "rhel", "os_version_id": "8.2", "os_version": "8.2 (Ootpa)", "assert_condition": "r == 0", "assert_func": "virtual void C_MDS_rename_finish::finish(int)", "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc", "assert_line": 7380, "assert_thread_name": "fn_anonymous", "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: In function 'virtual void C_MDS_rename_finish::finish(int)' thread 7f196e5ad700 time 2020-07-27 08:15:11.177416\n/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: 7380: FAILED ceph_assert(r == 0)\n", "backtrace": [ "(()+0x12dd0) [0x7f197c937dd0]", "(gsignal()+0x10f) [0x7f197b38970f]", "(abort()+0x127) [0x7f197b373b25]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]", "(()+0x26c774) [0x7f197eb24774]", "(()+0x21a431) [0x564b84108431]", "(MDSContext::complete(int)+0x7f) [0x564b842f20df]", "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]", "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]", "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]", "(()+0x82de) [0x7f197c92d2de]", "(clone()+0x43) [0x7f197b44de83]" ] } [root@argo006 /]# ceph crash info 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06 { "crash_id": "2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06", "timestamp": "2020-07-27 08:15:11.262603Z", "process_name": "ceph-mds", "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", "ceph_version": "14.2.8-81.el8cp", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk", "utsname_sysname": "Linux", "utsname_release": "4.18.0-193.13.2.el8_2.x86_64", "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020", "utsname_machine": "x86_64", "os_name": "Red Hat Enterprise Linux", "os_id": "rhel", "os_version_id": "8.2", "os_version": "8.2 (Ootpa)", "assert_condition": "_head.empty()", "assert_func": "elist<T>::~elist() [with T = MDSIOContextBase*]", "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h", "assert_line": 91, "assert_thread_name": "fn_anonymous", "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: In function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread 7f196e5ad700 time 2020-07-27 08:15:11.260009\n/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: 91: FAILED ceph_assert(_head.empty())\n", "backtrace": [ "(()+0x12dd0) [0x7f197c937dd0]", "(abort()+0x203) [0x7f197b373c01]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]", "(()+0x26c774) [0x7f197eb24774]", "(()+0x404887) [0x564b842f2887]", "(()+0x39e9c) [0x7f197b38be9c]", "(on_exit()+0) [0x7f197b38bfd0]", "(()+0x48ef40) [0x564b8437cf40]", "(()+0x12dd0) [0x7f197c937dd0]", "(gsignal()+0x10f) [0x7f197b38970f]", "(abort()+0x127) [0x7f197b373b25]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a7) [0x7f197eb245ab]", "(()+0x26c774) [0x7f197eb24774]", "(()+0x21a431) [0x564b84108431]", "(MDSContext::complete(int)+0x7f) [0x564b842f20df]", "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]", "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]", "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]", "(()+0x82de) [0x7f197c92d2de]", "(clone()+0x43) [0x7f197b44de83]" ] }
Since it involves a crash, proposing as a blocker to keep it in 4.5 until we have an initial analysis. Thanks
Patrick, can anyone from the CephFS team take a look at the above?
(In reply to Pratik Surve from comment #0) > Description of problem (please be detailed as possible and provide log > snippests): > When OCP was upgrade from `4.5.0-0.nightly-2020-07-23-201307` to > `4.5.0-0.nightly-2020-07-25-031342` ceph status reported 2 daemons have > recently crashed > > > Version of all relevant components (if applicable): > OCS :- ocs-operator.v4.5.0-494.ci > OCP :- 4.5.0-0.nightly-2020-07-25-031342 > > Does this issue impact your ability to continue to work with the product > (please explain in detail what is the user impact)? > No > > Is there any workaround available to the best of your knowledge? > > > Rate from 1 - 5 the complexity of the scenario you performed that caused this > bug (1 - very simple, 5 - very complex)? > 3 > > Can this issue reproducible? > > > Can this issue reproduce from the UI? > > > If this is a regression, please provide more details to justify this: > No > > Steps to Reproduce: > 1. Deploy OCP with OCS over BM LSO > 2. Run some i/o > 3. Perform ocp upgrade > > > Actual results: > # ceph -s > cluster: > id: b6a04f01-f8f2-437e-a69c-6607e7e0f68a > health: HEALTH_WARN > 2 daemons have recently crashed > > services: > mon: 3 daemons, quorum a,b,c (age 4h) > mgr: a(active, since 5h) > mds: ocs-storagecluster-cephfilesystem:1 > {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay > osd: 3 osds: 3 up (since 4h), 3 in (since 3d) > rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, > ocs.storagecluster.cephobjectstore.b) > > task status: > scrub status: > mds.ocs-storagecluster-cephfilesystem-a: idle > mds.ocs-storagecluster-cephfilesystem-b: idle > > data: > pools: 10 pools, 176 pgs > objects: 20.84k objects, 4.0 GiB > usage: 14 GiB used, 2.7 TiB / 2.7 TiB avail > pgs: 176 active+clean > > io: > client: 9.4 KiB/s rd, 20 KiB/s wr, 2 op/s rd, 1 op/s wr > > > # ceph health detail > HEALTH_WARN 2 daemons have recently crashed > RECENT_CRASH 2 daemons have recently crashed > mds.ocs-storagecluster-cephfilesystem-a crashed on host > rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at > 2020-07-27 08:15:11.179356Z > mds.ocs-storagecluster-cephfilesystem-a crashed on host > rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk at > 2020-07-27 08:15:11.262603Z > > Expected results: > Ceph health should be HEALTH_OK > > > Additional info: > # ceph crash ls > ID ENTITY > NEW > 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042 > mds.ocs-storagecluster-cephfilesystem-a * > 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06 > mds.ocs-storagecluster-cephfilesystem-a * > [root@argo006 /]# ceph crash info > 2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042 > { > "crash_id": > "2020-07-27_08:15:11.179356Z_59da44ac-6a5c-4079-9ade-04c6f1f0a042", > "timestamp": "2020-07-27 08:15:11.179356Z", > "process_name": "ceph-mds", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "ceph_version": "14.2.8-81.el8cp", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk", > "utsname_sysname": "Linux", > "utsname_release": "4.18.0-193.13.2.el8_2.x86_64", > "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020", > "utsname_machine": "x86_64", > "os_name": "Red Hat Enterprise Linux", > "os_id": "rhel", > "os_version_id": "8.2", > "os_version": "8.2 (Ootpa)", > "assert_condition": "r == 0", > "assert_func": "virtual void C_MDS_rename_finish::finish(int)", > "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc", > "assert_line": 7380, > "assert_thread_name": "fn_anonymous", > "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: In > function 'virtual void C_MDS_rename_finish::finish(int)' thread 7f196e5ad700 > time 2020-07-27 > 08:15:11.177416\n/builddir/build/BUILD/ceph-14.2.8/src/mds/Server.cc: 7380: > FAILED ceph_assert(r == 0)\n", > "backtrace": [ > "(()+0x12dd0) [0x7f197c937dd0]", > "(gsignal()+0x10f) [0x7f197b38970f]", > "(abort()+0x127) [0x7f197b373b25]", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1a7) [0x7f197eb245ab]", > "(()+0x26c774) [0x7f197eb24774]", > "(()+0x21a431) [0x564b84108431]", > "(MDSContext::complete(int)+0x7f) [0x564b842f20df]", > "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]", > "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]", > "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]", > "(()+0x82de) [0x7f197c92d2de]", > "(clone()+0x43) [0x7f197b44de83]" > ] > } This one looks like an ordinary OSD op failure causing the MDS to abort (which should trigger a respawn by systemd). Ideally the MDS should exit more gracefully. I'll add a ticket for this. > [root@argo006 /]# ceph crash info > 2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06 > { > "crash_id": > "2020-07-27_08:15:11.262603Z_dbde3d6c-5201-4858-b1d9-ee8ecb5f5a06", > "timestamp": "2020-07-27 08:15:11.262603Z", > "process_name": "ceph-mds", > "entity_name": "mds.ocs-storagecluster-cephfilesystem-a", > "ceph_version": "14.2.8-81.el8cp", > "utsname_hostname": > "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78555bfbw27nk", > "utsname_sysname": "Linux", > "utsname_release": "4.18.0-193.13.2.el8_2.x86_64", > "utsname_version": "#1 SMP Mon Jul 13 23:17:28 UTC 2020", > "utsname_machine": "x86_64", > "os_name": "Red Hat Enterprise Linux", > "os_id": "rhel", > "os_version_id": "8.2", > "os_version": "8.2 (Ootpa)", > "assert_condition": "_head.empty()", > "assert_func": "elist<T>::~elist() [with T = MDSIOContextBase*]", > "assert_file": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h", > "assert_line": 91, > "assert_thread_name": "fn_anonymous", > "assert_msg": "/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: In > function 'elist<T>::~elist() [with T = MDSIOContextBase*]' thread > 7f196e5ad700 time 2020-07-27 > 08:15:11.260009\n/builddir/build/BUILD/ceph-14.2.8/src/include/elist.h: 91: > FAILED ceph_assert(_head.empty())\n", > "backtrace": [ > "(()+0x12dd0) [0x7f197c937dd0]", > "(abort()+0x203) [0x7f197b373c01]", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1a7) [0x7f197eb245ab]", > "(()+0x26c774) [0x7f197eb24774]", > "(()+0x404887) [0x564b842f2887]", > "(()+0x39e9c) [0x7f197b38be9c]", > "(on_exit()+0) [0x7f197b38bfd0]", > "(()+0x48ef40) [0x564b8437cf40]", > "(()+0x12dd0) [0x7f197c937dd0]", > "(gsignal()+0x10f) [0x7f197b38970f]", > "(abort()+0x127) [0x7f197b373b25]", > "(ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1a7) [0x7f197eb245ab]", > "(()+0x26c774) [0x7f197eb24774]", > "(()+0x21a431) [0x564b84108431]", > "(MDSContext::complete(int)+0x7f) [0x564b842f20df]", > "(MDSIOContextBase::complete(int)+0x168) [0x564b842f2358]", > "(MDSLogContextBase::complete(int)+0x44) [0x564b842f2624]", > "(Finisher::finisher_thread_entry()+0x18d) [0x7f197ebb474d]", > "(()+0x82de) [0x7f197c92d2de]", > "(clone()+0x43) [0x7f197b44de83]" > ] > } Looks like https://tracker.ceph.com/issues/44294. That is fixed by https://github.com/ceph/ceph/pull/34343 in v14.2.10. There is already a BZ ON_QA backporting that downstream: bz1850720
Thanks Patrick. @Raz, I guess we still need this open here for ocs testing? If yes, can we have the acks please.
(In reply to Mudit Agarwal from comment #6) > Thanks Patrick. > > @Raz, I guess we still need this open here for ocs testing? If yes, can we > have the acks please. @mudit We ca ack it in case we are ready to consume ceph-14.2.8-85.el8cp, ceph-14.2.8-85.el7cp in our OCS 4.5 builds(which does seem likely though). Can you confirm.
This has a fix ON_QA for RHCS 4.1 z2, which will be available in OCS 4.6, NOT 4.5.
Thanks Scott. Raz/Neha, can this be moved to 4.6?
AFAICT, we are not blocking OCS release for ceph issue. Moving it to 4.6, please revert if you think that is not correct.
Hi Scott, Mudit, The question is when RHCS 4.1.z2 is planned to be shipped. If the timelines of 4.5 and 4.1.2z are close enough, we might want to consider having this fix in 4.5. Regarding "we are not blocking OCS for Ceph issues", that's not right - we should aim for releasing asyncs for Ceph in case we have an issue impacting OCS.
https://pp.engineering.redhat.com/pp/product/ceph/release/ceph-4-1/status/trend If I read this correctly, target date for 4.1z2 is 09/15 which is almost 3 weeks from OCS 4.5 GA date (08/26), so I don't know if we can wait till then. Also, we need to check how easy is this to hit. But yes, as Neha mentioned we should document it properly.
(In reply to Scott Ostapovicz from comment #8) > This has a fix ON_QA for RHCS 4.1 z2, which will be available in OCS 4.6, > NOT 4.5. @scott is the RHCS side issue https://bugzilla.redhat.com/show_bug.cgi?id=1847685 ?
https://bugzilla.redhat.com/show_bug.cgi?id=1850720
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
Covered in test tests/ecosystem/upgrade/test_upgrade_ocp.py
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days