+++ This bug was initially created as a clone of Bug #2182184 +++ --- Additional comment from on 2023-03-27 19:53:48 UTC --- Description of problem (please be detailed as possible and provide log snippests): After migration using 'Zerto', customer is seeing MDS pods crashing and ceph reporting 1 fs offline, mds damaged: $ cat ../../../ceph/must_gather_commands/ceph_status cluster: id: c813d12f-6968-4ac2-8a1c-cd17c74a900f health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 28 daemons have recently crashed Version of all relevant components (if applicable): ODF 4.10.11 Additional info: This is what's observed in the MDS logs: 2023-03-24T13:46:59.725030498Z debug 2023-03-24T13:46:59.723+0000 7f8019e10700 10 mds.ocs-storagecluster-cephfilesystem-a handle_mds_map: handling map in rankless mode This error points to us not having a rank, but why? ref: https://github.com/ceph/ceph/blob/main/src/mds/MDSDaemon.cc#L749 Reviewing [1], tried to follow the t/s steps to figure out why, but we don't have MDS logs that go back to when the issue started. Im checking to see if we can get these logs so we can see what's caused the corruption. Action taken thus far: Tried to mark mds as active, but it didnt help # ceph mds repaired ocs-storagecluster-cephfilesystem:0 this will get the rank and mark mds as active Tried to reset the filesystem following [2] but no luck # ceph fs reset ocs-storagecluster-cephfilesystem --yes-i-really-mean-it [1] https://www.spinics.net/lists/ceph-users/msg39362.html [2] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/troubleshooting_openshift_data_foundation/index#restoring_the_cephfs All logs we have thus far are yanked and on supportshell: $ ls -lt /cases/03470375 total 112 drwxrwxrwx. 2 yank yank 72 Mar 27 12:22 0030-ceph_fs_dump.tar.gz drwxrwxrwx. 3 yank yank 59 Mar 27 12:16 0080-ODF-must-gather.local.6974069168264631887.tar.gz drwxrwxrwx. 2 yank yank 4096 Mar 27 11:10 0070-rook-ceph-mds-ocs-storagecluster-cephfilesystem.tar.gz drwxrwxrwx. 3 yank yank 59 Mar 26 11:29 0070-must-gather.local.5894513357504944487.tar.gz -rw-rw-rw-. 1 yank yank 801 Mar 25 13:56 0060-damage_ls.out -rw-rw-rw-. 1 yank yank 124 Mar 25 12:28 0050-ceph_tell_mds.0_damage_ls.out drwxrwxrwx. 3 yank yank 59 Mar 24 14:32 0040-must-gather.local.5432157191433132400.tar.gz -rw-rw-rw-. 1 yank yank 99855 Mar 24 13:19 0020-Screenshot_2023-03-24_at_09.15.28.png drwxrwxrwx. 3 yank yank 59 Mar 24 00:16 0010-must-gather.local.2180438445997783092.tar.gz Please let us know if anything else is required. --- Additional comment from RHEL Program Management on 2023-03-27 19:53:57 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-03-27 19:53:57 UTC --- Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP. --- Additional comment from Venky Shankar on 2023-03-28 04:33:22 UTC --- Milind, PTAL. --- Additional comment from on 2023-03-30 23:13:34 UTC --- Hello, Do we have an update we can share with the customer? --- Additional comment from on 2023-04-05 16:02:11 UTC --- Hi, The DR Exercise has ended and the customer no longer has access to the DR environment and will not be able to provide any further logs or outputs. However, they are still looking for an RCA for this issue. --- Additional comment from on 2023-04-10 16:21:29 UTC --- Hello, It's me again. Any updates? --- Additional comment from Milind Changire on 2023-04-11 11:47:06 UTC --- (In reply to kelwhite from comment #7) > Hello, > > It's me again. Any updates? so far ... the mds assigned as rank 0 (mds.b) was unable to find the journal during startup and hence declared itself as damaged and respawned as standby ----- its difficult to speculate what could've happened in the containerized configuration only more logs will help identify the root cause why did the journal go missing is the question we'd like answers for --- Additional comment from Venky Shankar on 2023-04-11 13:35:16 UTC --- (In reply to Milind Changire from comment #8) > (In reply to kelwhite from comment #7) > > Hello, > > > > It's me again. Any updates? > > so far ... > the mds assigned as rank 0 (mds.b) was unable to find the journal during > startup and hence declared itself as damaged and respawned as standby Do we know why the journal objects are missing? Were any PGs lost that were possibly storing the journal objects? Check the case history: https://access.redhat.com/support/cases/03470375 if that given any hint. --- Additional comment from Milind Changire on 2023-04-11 14:05:33 UTC --- the only explicit mention of PGs in the case history is that all the PGs are active+clean ----- fneloscp401:/u/an81 $ oc rsh rook-ceph-tools-5789888bc5-jnbzg sh-4.4$ ceph mds repaired ocs-storagecluster-cephfilesystem:0 repaired: restoring rank 1:0 sh-4.4$ exit exit fneloscp401:/u/an81 $ oc exec rook-ceph-tools-5789888bc5-jnbzg -- ceph -s cluster: id: c813d12f-6968-4ac2-8a1c-cd17c74a900f health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged services: mon: 3 daemons, quorum b,c,g (age 39h) mgr: a(active, since 39h) mds: 0/1 daemons up, 2 standby osd: 6 osds: 6 up (since 3d), 6 in (since 8w) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 0/1 healthy, 1 recovering; 1 damaged pools: 11 pools, 273 pgs objects: 532.04k objects, 2.0 TiB usage: 5.9 TiB used, 6.1 TiB / 12 TiB avail pgs: 273 active+clean io: client: 45 MiB/s rd, 2.4 MiB/s wr, 12 op/s rd, 130 op/s wr --- Additional comment from on 2023-04-11 17:30:18 UTC --- Hello, Thank you! If we have missing objects, I'd assume this is from the way Zerto handled the failover. I'll ask the customer if they can provide any background to this. --- Additional comment from Mudit Agarwal on 2023-04-24 10:39:26 UTC --- Priority is reduced, removing the blocker? keyword. Will open a ceph clone.