RDR is Regional Disaster Recovery, where we have 3 OCP clusters tied together. 1st is hub where RHACM is installed which manages other 2 OCP clusters using submariner add-ons. 2nd and 3rd are where ODF is installed and they are connected via submariner connectivity and are used in case of any disaster to perform failover/relocate operations between them to recover and ensure business continuity with min. data loss and min. application downtime. At a time, one of the managed clusters where ODF is installed works as primary cluster where IOs are run, and the other works as secondary or stand by cluster where the data is being replicated all the time and can be used in case of any such event.
Milind, PTAL.
Milind, did you get a chance to take a look?
Milind, this question should be asked from the ceph build team (Ken Dryer). We just use the tag provided by the ceph build team to build ODF. If you want build details, this link can help you: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2543192
The impact of this BZ is yet to be assessed on a RDR setup as we couldn't perform failover/relocate operations on cephFS based workloads due to submariner connectivity issues (https://issues.redhat.com/browse/ACM-7600) However, the active and standby mds crashes multiple times while running IOs as reported above and the issue is consistent, which makes it crucial.
@vshankar would it be possible to provide an ETA this fix?
This crash was reproduced as reported in https://bugzilla.redhat.com/show_bug.cgi?id=2282346#c3 ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable) OCP 4.16.0-0.nightly-2024-04-26-145258 ODF 4.16.0-89.stable ACM 2.10.2 MCE 2.5.2 Steps to Reproduce: *****Active hub co-situated with primary managed cluster***** 1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery. 2. Then failover all the workloads running on down managed cluster to the surviving managed cluster. 3. After successful failover, recover the down managed cluster. 4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now. 5. Ensure cluster is cleaned after eviction period timeout, failover is successful and data sync is resumed b/w the managed clusters. Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover. pods|grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c 2/2 Running 1023 (9m33s ago) 7d13h 10.128.2.63 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp 2/2 Running 1006 (17m ago) 7d13h 10.131.0.241 compute-1 <none> <none> oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls ID ENTITY NEW 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669 mds.ocs-storagecluster-cephfilesystem-a 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12 mds.ocs-storagecluster-cephfilesystem-b 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f mds.ocs-storagecluster-cephfilesystem-a * 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd mds.ocs-storagecluster-cephfilesystem-b * 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf mds.ocs-storagecluster-cephfilesystem-b * 2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b mds.ocs-storagecluster-cephfilesystem-b * 2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109 mds.ocs-storagecluster-cephfilesystem-a * 2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba mds.ocs-storagecluster-cephfilesystem-a * 2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb mds.ocs-storagecluster-cephfilesystem-a * 2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08 mds.ocs-storagecluster-cephfilesystem-b * 2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7 mds.ocs-storagecluster-cephfilesystem-a * Out of all these crashes, the below crash is the one actually reported in this BZ. bash-5.1$ ceph crash info 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd { "backtrace": [ "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]", "ceph-mds(+0x22d2ed) [0x557c354f42ed]", "ceph-mds(+0x5a7d02) [0x557c3586ed02]", "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x557c35778fc1]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x79e) [0x557c3577fb6e]", "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]", "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]", "ceph-mds(+0x16cf21) [0x557c35433f21]", "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]", "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]" ], "ceph_version": "18.2.1-136.el9cp", "crash_id": "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd", "entity_name": "mds.ocs-storagecluster-cephfilesystem-b", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.3 (Plow)", "os_version_id": "9.3", "process_name": "ceph-mds", "stack_sig": "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5", "timestamp": "2024-05-17T10:56:29.510669Z", "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp", "utsname_machine": "x86_64", "utsname_release": "5.14.0-427.13.1.el9_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024" } This crash may have repeated on this setup but I am pasting just one of it's occurrence. Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/