Bug 2104790 - rook-ceph-mds-ocs-storagecluster-cephfilesystem-a crashing
Summary: rook-ceph-mds-ocs-storagecluster-cephfilesystem-a crashing
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.11.0
Assignee: Venky Shankar
QA Contact: avdhoot
URL:
Whiteboard:
Depends On: 2105881
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-07 07:11 UTC by avdhoot
Modified: 2024-04-05 17:02 UTC (History)
9 users (show)

Fixed In Version: 4.11.0-137
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-04-05 17:02:33 UTC
Embargoed:


Attachments (Terms of Use)

Description avdhoot 2022-07-07 07:11:29 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When running the whole workload suite [1], test "tests.e2e.workloads.ocp.registry.test_registry_reboot_node.TestRegistryRebootNode" failed on teardown on health check since rook-ceph-mds-ocs-storagecluster-cephfilesystem-a pod crashed.


{
    "cluster_fingerprint": "f1aebac8-0412-45ec-92b7-1043c2cfed3b",
    "version": "16.2.8-59.el8cp",
    "commit": "4e10c5fa8a9bc0a421a4dd0833f951ab1cfdcfa7",
    "timestamp": "2022-06-30T22:56:19.317455+0000",
    "tag": "",
    "health": {
        "status": "HEALTH_WARN",
        "checks": {
            "RECENT_CRASH": {
                "severity": "HEALTH_WARN",
                "summary": {
                    "message": "1 daemons have recently crashed",
                    "count": 1
                },
                "detail": [
                    {
                        "message": "mds.ocs-storagecluster-cephfilesystem-a crashed on host rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-76647b8czhpp5 at 2022-06-30T21:51:01.775596Z"
                    }
                ],
                "muted": false
            }

mds pod crashed due to below reason:

2022-06-30T21:50:47.422596523Z debug 2022-06-30T21:50:47.421+0000 7f41de37f900  0 pidfile_write: ignore empty --pid-file
2022-06-30T21:50:47.423097588Z starting mds.ocs-storagecluster-cephfilesystem-a at 
2022-06-30T21:50:47.428799237Z debug 2022-06-30T21:50:47.428+0000 7f41cc4ca700  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 52 from mon.1
2022-06-30T21:50:48.412591616Z debug 2022-06-30T21:50:48.412+0000 7f41cc4ca700  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 53 from mon.1
2022-06-30T21:50:48.412655414Z debug 2022-06-30T21:50:48.412+0000 7f41cc4ca700  1 mds.ocs-storagecluster-cephfilesystem-a Monitors have assigned me to become a standby.
2022-06-30T21:50:48.423681377Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700  1 mds.ocs-storagecluster-cephfilesystem-a Updating MDS map to version 54 from mon.1
2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700  1 mds.0.0 handle_mds_map i am now mds.83389.0 replaying mds.0.0
2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700  1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay
2022-06-30T21:50:48.426313648Z debug 2022-06-30T21:50:48.423+0000 7f41cc4ca700  1 mds.0.0 replay_start
2022-06-30T21:50:48.438768590Z debug 2022-06-30T21:50:48.438+0000 7f41c64be700  0 mds.0.cache creating system inode with ino:0x100
2022-06-30T21:50:48.438815024Z debug 2022-06-30T21:50:48.438+0000 7f41c64be700  0 mds.0.cache creating system inode with ino:0x1
2022-06-30T21:50:59.055822268Z debug 2022-06-30T21:50:59.055+0000 7f41ce4ce700  1 mds.ocs-storagecluster-cephfilesystem-a asok_command: status {prefix=status} (starting...)
2022-06-30T21:51:01.772802133Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f41cc4ca700 time 2022-06-30T21:51:01.772738+0000
2022-06-30T21:51:01.772802133Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: 281: FAILED ceph_assert(!mds->is_any_replay())
2022-06-30T21:51:01.774204679Z  ceph version 16.2.8-59.el8cp (4e10c5fa8a9bc0a421a4dd0833f951ab1cfdcfa7) pacific (stable)
2022-06-30T21:51:01.774204679Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f41d4ef7eb8]
2022-06-30T21:51:01.774204679Z  2: /usr/lib64/ceph/libceph-common.so.2(+0x2780d2) [0x7f41d4ef80d2]
2022-06-30T21:51:01.774204679Z  3: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f) [0x5565f835324f]
2022-06-30T21:51:01.774204679Z  4: (Server::journal_close_session(Session*, int, Context*)+0x78c) [0x5565f80816dc]
2022-06-30T21:51:01.774204679Z  5: (Server::kill_session(Session*, Context*)+0x212) [0x5565f8081e32]
2022-06-30T21:51:01.774204679Z  6: (Server::apply_blocklist()+0x10d) [0x5565f80820ed]
2022-06-30T21:51:01.774204679Z  7: (MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x34) [0x5565f803ea24]
2022-06-30T21:51:01.774204679Z  8: (MDSRankDispatcher::handle_osd_map()+0xf6) [0x5565f803ed66]
2022-06-30T21:51:01.774204679Z  9: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x33b) [0x5565f8028d9b]
2022-06-30T21:51:01.774204679Z  10: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x5565f8029683]
2022-06-30T21:51:01.774204679Z  11: (DispatchQueue::entry()+0x126a) [0x7f41d513feca]
2022-06-30T21:51:01.774204679Z  12: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f41d51f26d1]
2022-06-30T21:51:01.774204679Z  13: /lib64/libpthread.so.0(+0x81cf) [0x7f41d3eda1cf]
2022-06-30T21:51:01.774204679Z  14: clone()
2022-06-30T21:51:01.774236825Z debug *** Caught signal (Aborted) **
2022-06-30T21:51:01.774236825Z  in thread 7f41cc4ca700 thread_name:ms_dispatch
2022-06-30T21:51:01.774254086Z 2022-06-30T21:51:01.772+0000 7f41cc4ca700 -1 /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f41cc4ca700 time 2022-06-30T21:51:01.772738+0000
2022-06-30T21:51:01.774254086Z /builddir/build/BUILD/ceph-16.2.8/src/mds/MDLog.cc: 281: FAILED ceph_assert(!mds->is_any_replay())
2022-06-30T21:51:01.774254086Z 

TC - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/workloads/ocp/registry/test_registry_reboot_node.py


[1]: jenkin job link - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/14170/

[2]: must-gather log link - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/mashetty-wl29/mashetty-wl29_20220630T061327/logs/failed_testcase_ocs_logs_1656572860/test_registry_rolling_reboot_node%5bmaster%5d_ocs_logs/ocs_must_gather/

[3]: Console output- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/mashetty-wl29/mashetty-wl29_20220630T061327/logs/

This testcase does the node reboot one by one in rolling manner.

Version of all relevant components (if applicable):
ODF-4.11
OCP-4.11

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Run the workload suite
2.when run tests.e2e.workloads.ocp.registry.test_registry_reboot_node.TestRegistryRebootNode  test it failed on teardown since mds-a crashed


Actual results:
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a crashing

Expected results:
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a pod should not crash.

Additional info:

Comment 2 Travis Nielsen 2022-07-07 21:35:00 UTC
To help understand the impact:
- Is this a new test suite?
- How often is the crash? Is it rare, frequent, or consistent?
- After the mds crashes, does the test continue successfully after the mds comes back up? 
- This shouldn't affect the functionality since this mds is the standy. From the log?
"Monitors have assigned me to become a standby."

Scott, could someone from cephfs take a look at the mds crash?

Comment 3 avdhoot 2022-07-18 07:10:07 UTC
@tnielsen 
Please find ans below-

To help understand the impact:
- Is this a new test suite? --> No
- How often is the crash? Is it rare, frequent, or consistent? Rare
- After the mds crashes, does the test continue successfully after the mds comes back up? Not clear about it.Because cluter get destroyed.
- This shouldn't affect the functionality since this mds is the standy. From the log? --> Can you elaborate it?

Comment 4 avdhoot 2022-08-09 06:50:52 UTC
Hi tnielsen 
Is there any update on this?

Comment 5 Venky Shankar 2022-08-09 07:04:15 UTC
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2105881

Comment 6 Travis Nielsen 2022-08-09 20:54:49 UTC
Which version of RHCS did it make it to? so we can track if it's fixed in ODF.

Comment 7 Venky Shankar 2022-08-10 14:54:54 UTC
(In reply to Travis Nielsen from comment #6)
> Which version of RHCS did it make it to? so we can track if it's fixed in
> ODF.

RHCS 5.2

Comment 8 Travis Nielsen 2022-08-10 20:01:39 UTC
Thanks, we should be able to move this to ON_QA since the latest 5.2 should be there for the 4.11 build.

Comment 9 Travis Nielsen 2022-08-11 18:05:58 UTC
This is already in RHCS 5.2, so we just need acks for 4.11


Note You need to log in before you can comment on or make changes to this bug.