Found a QA run where MDS was stuck in up:resolve: https://pulpito.ceph.com/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/ This occurs in a multimds cluster. Cause is the other active MDS is dropping the new MDS's messages: ``` 2021-11-05T20:08:26.796+0000 7fb4eae68700 1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 2021-11-05T20:08:26.796+0000 7fb4eae68700 1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 crc :-1 s=READY pgs=6 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.? client_cookie=25cbe7aa447d9f35 server_cookie=33eddd17bae5e981 in_seq=0 out_seq=0 ... 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 1 ==== mdsmap(e 21) v2 ==== 933+0+0 (crc 0 0 0) 0x562dc51424e0 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.cephfs.smithi098.pucypu handle_mds_map old map epoch 21 <= 21, discarding 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 2 ==== mds_table_request(snaptable server_ready) v1 ==== 16+0+0 (crc 0 0 0) 0x562dc95be300 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.1.6 got mds_table_request(snaptable server_ready) v1 from down/old/bad/imposter mds mds.?, dropping 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 3 ==== mds_resolve(2+0 subtrees +0 peer requests) v1 ==== 89+0+0 (crc 0 0 0) 0x562dcc33e5a0 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.1.6 got mds_resolve(2+0 subtrees +0 peer requests) v1 from down/old/bad/imposter mds mds.?, dropping ``` From: /ceph/teuthology-archive/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/remote/smithi098/log/cb9d093a-3e72-11ec-8c28-001a4aab830c/ceph-mds.cephfs.smithi098.pucypu.log-20211106.gz rank 1 opened a connection with rank 0 when rank 0 was up:replay. This occurred before rank 0 was able to process its state change from mdsmap e19 and update its "myname" with the messenger: https://github.com/ceph/ceph/blob/fb8671c5733dc4dfed79e42deafd33c46e78c519/src/mds/MDSRank.cc#L2250-L2257 Messenger ProtocolV2 now associates the daemon type / rank at connection creation so any updates by rank0 to its name are no longer propagated to its peers.
Patrick, please create a MR.
Hi I have tested this with below steps 1. Created Filesytem with 2 active mds and 1 standby 2. mounted to a client and started IOs on it 3. Initiated ceph mds fail 0 4. Standy by mds became active with following changes replay -> resolve -> reconnect -> rejoin -> clientreplay ->active 5. I did not observe any message drops in the mds logs [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx Reqs: 0 /s 33.1k 33.2k 6293 533 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 27.6k 27.6k 4452 561 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1193M 54.1G cephfs.cephfs.data data 3586M 54.1G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph config set mds mds_sleep_rank_change 10000000.0 [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph config set mds mds_connect_bootstrapping True [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph -s cluster: id: ef320eb6-eb1c-11ec-8277-fa163eb9a4e9 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-amk-bz-2-c61ql9-node1-installer,ceph-amk-bz-2-c61ql9-node2,ceph-amk-bz-2-c61ql9-node3 (age 18h) mgr: ceph-amk-bz-2-c61ql9-node1-installer.txjreu(active, since 18h), standbys: ceph-amk-bz-2-c61ql9-node2.jbmoqy mds: 2/2 daemons up, 1 standby osd: 12 osds: 12 up (since 18h), 12 in (since 18h) data: volumes: 1/1 healthy pools: 4 pools, 97 pgs objects: 51.77k objects, 1.6 GiB usage: 6.6 GiB used, 173 GiB / 180 GiB avail pgs: 97 active+clean [root@ceph-amk-bz-2-c61ql9-node7 ~]# ls -lrt /mnt/cephfs_fusebo39r14l2j total 1024003 drwxr-xr-x. 2 root root 0 Jun 13 09:44 98nr4kvgtr drwxr-xr-x. 3 root root 81920020 Jun 13 09:46 mt5vbcwd2j drwxr-xr-x. 2 root root 0 Jun 13 09:47 vibx51oy92 drwxr-xr-x. 3 root root 81920020 Jun 13 09:47 fidsc7yzrm -rw-r--r--. 1 root root 1048576000 Jun 13 09:47 ceph-amk-bz-2-c61ql9-node7.txt drwxr-xr-x. 5 root root 40960010 Jun 13 09:49 dir drwxr-xr-x. 2 root root 0 Jun 14 03:53 run_ios [root@ceph-amk-bz-2-c61ql9-node7 ~]# df /mnt/cephfs_fusebo39r14l2j Filesystem 1K-blocks Used Available Use% Mounted on ceph-fuse 57700352 2043904 55656448 4% /mnt/cephfs_fusebo39r14l2j [root@ceph-amk-bz-2-c61ql9-node7 ~]# df /mnt/cephfs_fusebo39r14l2j Filesystem 1K-blocks Used Available Use% Mounted on ceph-fuse 57397248 2265088 55132160 4% /mnt/cephfs_fusebo39r14l2j [root@ceph-amk-bz-2-c61ql9-node7 ~]# df /mnt/cephfs_fusebo39r14l2j Filesystem 1K-blocks Used Available Use% Mounted on ceph-fuse 57372672 2809856 54562816 5% /mnt/cephfs_fusebo39r14l2j [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx Reqs: 6 /s 33.2k 33.2k 6314 591 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 4 /s 27.6k 27.6k 4493 639 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1201M 53.5G cephfs.cephfs.data data 5442M 53.5G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fail mds 0 no valid command found; 10 closest matches: pg stat pg getmap pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...] pg dump_json [all|summary|sum|pools|osds|pgs...] pg dump_pools_json pg ls-by-pool <poolstr> [<states>...] pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...] pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...] pg ls [<pool:int>] [<states>...] pg dump_stuck [inactive|unclean|stale|undersized|degraded...] [<threshold:int>] Error EINVAL: invalid command [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph mds fail 0 failed mds gid 14520 [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 0 0 0 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 0 0 0 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 3 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 0 0 0 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 3 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 resolve cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 0 0 0 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 resolve cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 reconnect cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 4 /s 92 96 55 62 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 reconnect cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 0 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 4 /s 92 96 55 59 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 rejoin cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 82 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 59 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 clientreplay cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 82 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 59 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 clientreplay cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke 60.7k 60.3k 10.2k 83 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 59 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1265M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph-amk-bz-2-c61ql9-node5.tyjkke Reqs: 0 /s 60.7k 60.3k 10.2k 44 1 active cephfs.ceph-amk-bz-2-c61ql9-node6.twwsmw Reqs: 0 /s 92 96 55 59 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1266M 52.9G cephfs.cephfs.data data 5986M 52.9G STANDBY MDS cephfs.ceph-amk-bz-2-c61ql9-node4.mdufsx MDS version: ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) [root@ceph-amk-bz-2-c61ql9-node7 ~]# ceph orch host ls HOST ADDR LABELS STATUS ceph-amk-bz-2-c61ql9-node1-installer 10.0.208.208 _admin mgr installer mon ceph-amk-bz-2-c61ql9-node2 10.0.211.204 mgr osd mon ceph-amk-bz-2-c61ql9-node3 10.0.210.106 osd mon ceph-amk-bz-2-c61ql9-node4 10.0.211.133 mds nfs ceph-amk-bz-2-c61ql9-node5 10.0.211.244 osd mds ceph-amk-bz-2-c61ql9-node6 10.0.211.227 mds nfs 6 hosts in cluster [root@ceph-amk-bz-2-c61ql9-node7 ~]# Regards, Amarnath
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5997