Bug 2189135 - [GSS] mds is reporting 'handling map in rankless mode'
Summary: [GSS] mds is reporting 'handling map in rankless mode'
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 4.2
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 6.1z2
Assignee: Venky Shankar
QA Contact: Amarnath
URL:
Whiteboard:
Depends On: 2182184
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-24 10:40 UTC by Mudit Agarwal
Modified: 2023-08-03 08:29 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2182184
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6514 0 None None None 2023-04-24 10:41:03 UTC

Description Mudit Agarwal 2023-04-24 10:40:12 UTC
+++ This bug was initially created as a clone of Bug #2182184 +++



--- Additional comment from  on 2023-03-27 19:53:48 UTC ---

Description of problem (please be detailed as possible and provide log
snippests):
After migration using 'Zerto', customer is seeing MDS pods crashing and ceph reporting 1 fs offline, mds damaged:

$  cat ../../../ceph/must_gather_commands/ceph_status
  cluster:
    id:     c813d12f-6968-4ac2-8a1c-cd17c74a900f
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            28 daemons have recently crashed

Version of all relevant components (if applicable):
ODF 4.10.11

Additional info:
This is what's observed in the MDS logs:

2023-03-24T13:46:59.725030498Z debug 2023-03-24T13:46:59.723+0000 7f8019e10700 10 mds.ocs-storagecluster-cephfilesystem-a handle_mds_map: handling map in rankless mode

This error points to us not having a rank, but why? 
ref: https://github.com/ceph/ceph/blob/main/src/mds/MDSDaemon.cc#L749

Reviewing [1], tried to follow the t/s steps to figure out why, but we don't have MDS logs that go back to when the issue started. Im checking to see if we can get these logs so we can see what's caused the corruption.

Action taken thus far:
Tried to mark mds as active, but it didnt help
# ceph mds repaired ocs-storagecluster-cephfilesystem:0 this will get the rank and mark mds as active

Tried to reset the filesystem following [2] but no luck
# ceph fs reset ocs-storagecluster-cephfilesystem --yes-i-really-mean-it

[1] https://www.spinics.net/lists/ceph-users/msg39362.html
[2] https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/troubleshooting_openshift_data_foundation/index#restoring_the_cephfs

All logs we have thus far are yanked and on supportshell:
$  ls -lt /cases/03470375
total 112
drwxrwxrwx. 2 yank yank    72 Mar 27 12:22 0030-ceph_fs_dump.tar.gz
drwxrwxrwx. 3 yank yank    59 Mar 27 12:16 0080-ODF-must-gather.local.6974069168264631887.tar.gz
drwxrwxrwx. 2 yank yank  4096 Mar 27 11:10 0070-rook-ceph-mds-ocs-storagecluster-cephfilesystem.tar.gz
drwxrwxrwx. 3 yank yank    59 Mar 26 11:29 0070-must-gather.local.5894513357504944487.tar.gz
-rw-rw-rw-. 1 yank yank   801 Mar 25 13:56 0060-damage_ls.out
-rw-rw-rw-. 1 yank yank   124 Mar 25 12:28 0050-ceph_tell_mds.0_damage_ls.out
drwxrwxrwx. 3 yank yank    59 Mar 24 14:32 0040-must-gather.local.5432157191433132400.tar.gz
-rw-rw-rw-. 1 yank yank 99855 Mar 24 13:19 0020-Screenshot_2023-03-24_at_09.15.28.png
drwxrwxrwx. 3 yank yank    59 Mar 24 00:16 0010-must-gather.local.2180438445997783092.tar.gz

Please let us know if anything else is required.

--- Additional comment from RHEL Program Management on 2023-03-27 19:53:57 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-03-27 19:53:57 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Venky Shankar on 2023-03-28 04:33:22 UTC ---

Milind, PTAL.

--- Additional comment from  on 2023-03-30 23:13:34 UTC ---

Hello,

Do we have an update we can share with the customer?

--- Additional comment from  on 2023-04-05 16:02:11 UTC ---

Hi,

The DR Exercise has ended and the customer no longer has access to the DR environment and will not be able to provide any further logs or outputs. However, they are still looking for an RCA for this issue.

--- Additional comment from  on 2023-04-10 16:21:29 UTC ---

Hello,

It's me again. Any updates?

--- Additional comment from Milind Changire on 2023-04-11 11:47:06 UTC ---

(In reply to kelwhite from comment #7)
> Hello,
> 
> It's me again. Any updates?

so far ...
the mds assigned as rank 0 (mds.b) was unable to find the journal during startup and hence declared itself as damaged and respawned as standby
-----

its difficult to speculate what could've happened in the containerized configuration
only more logs will help identify the root cause


why did the journal go missing is the question we'd like answers for

--- Additional comment from Venky Shankar on 2023-04-11 13:35:16 UTC ---

(In reply to Milind Changire from comment #8)
> (In reply to kelwhite from comment #7)
> > Hello,
> > 
> > It's me again. Any updates?
> 
> so far ...
> the mds assigned as rank 0 (mds.b) was unable to find the journal during
> startup and hence declared itself as damaged and respawned as standby

Do we know why the journal objects are missing? Were any PGs lost that were possibly storing the journal objects?

Check the case history: https://access.redhat.com/support/cases/03470375 if that given any hint.

--- Additional comment from Milind Changire on 2023-04-11 14:05:33 UTC ---

the only explicit mention of PGs in the case history is that all the PGs are active+clean
-----
fneloscp401:/u/an81
$ oc rsh rook-ceph-tools-5789888bc5-jnbzg
sh-4.4$ ceph mds repaired ocs-storagecluster-cephfilesystem:0
repaired: restoring rank 1:0
sh-4.4$ exit
exit
fneloscp401:/u/an81
$ oc exec rook-ceph-tools-5789888bc5-jnbzg -- ceph -s
  cluster:
    id:     c813d12f-6968-4ac2-8a1c-cd17c74a900f
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged

  services:
    mon: 3 daemons, quorum b,c,g (age 39h)
    mgr: a(active, since 39h)
    mds: 0/1 daemons up, 2 standby
    osd: 6 osds: 6 up (since 3d), 6 in (since 8w)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 0/1 healthy, 1 recovering; 1 damaged
    pools:   11 pools, 273 pgs
    objects: 532.04k objects, 2.0 TiB
    usage:   5.9 TiB used, 6.1 TiB / 12 TiB avail
    pgs:     273 active+clean

  io:
    client:   45 MiB/s rd, 2.4 MiB/s wr, 12 op/s rd, 130 op/s wr

--- Additional comment from  on 2023-04-11 17:30:18 UTC ---

Hello,

Thank you! If we have missing objects, I'd assume this is from the way Zerto handled the failover. I'll ask the customer if they can provide any background to this.

--- Additional comment from Mudit Agarwal on 2023-04-24 10:39:26 UTC ---

Priority is reduced, removing the blocker? keyword. Will open a ceph clone.


Note You need to log in before you can comment on or make changes to this bug.