Bug 2021311 - mds opening connection to up:replay/up:creating daemon causes message drop
Summary: mds opening connection to up:replay/up:creating daemon causes message drop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.1
Hardware: All
OS: All
urgent
high
Target Milestone: ---
: 5.1
Assignee: Patrick Donnelly
QA Contact: Amarnath
Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks: 2031073
TreeView+ depends on / blocked
 
Reported: 2021-11-08 19:04 UTC by Patrick Donnelly
Modified: 2022-04-04 10:22 UTC (History)
6 users (show)

Fixed In Version: ceph-16.2.7-22.el8cp
Doc Type: Bug Fix
Doc Text:
.Inter-MDS connections to a replacement Ceph Metadata Server (MDS) is now delayed until an identity state is established Previously, an active Ceph Metadata Server(MDS) would initiate a connection with another replacement MDS before an identity state was established thereby refusing to further process the imposter MDS messages and causing a halt to the failover. With this release, the connection to replacement MDS is delayed until the identity state is established resulting in no message drops or failover issues.
Clone Of:
Environment:
Last Closed: 2022-04-04 10:22:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 53445 0 None None None 2022-01-05 18:57:25 UTC
Red Hat Issue Tracker RHCEPH-2191 0 None None None 2021-11-08 19:07:05 UTC
Red Hat Product Errata RHSA-2022:1174 0 None None None 2022-04-04 10:22:44 UTC

Description Patrick Donnelly 2021-11-08 19:04:06 UTC
Description of problem:

See: https://tracker.ceph.com/issues/53194

Comment 4 Amarnath 2022-01-18 20:52:15 UTC
Hi @pdonnell ,
Can you Please help in verifying this bug.
Any specific steps to follow or specific teutolagy script that needs to be executed.

Comment 5 Patrick Donnelly 2022-01-19 13:38:32 UTC
Just run the failover test in teuthology (--filter failover").

Comment 7 Amarnath 2022-02-02 08:59:50 UTC
Hi @pdonnell,

teuthology runs are failing for one or the other reason.

cmd used : 
./teuthology-suite -n 10 -c master -s fs --ceph-repo https://github.com/AmarnatReddy/ceph.git --suite-repo https://github.com/AmarnatReddy/ceph.git --suite-branch master /home/amk/rh8x_5.1.yaml -e amk -m clara --distro-version 8.5 --distro rhel -t rh --filter failover

recent failure :
Command failed on clara003 with status 1: 'sudo yum remove cephadm ceph-mon ceph-mgr ceph-osd ceph-mds ceph-radosgw ceph-test ceph-selinux ceph-fuse python-rados python-rbd python-cephfs rbd-mirror bison flex elfutils-libelf-devel openssl-devel NetworkManager iproute util-linux libacl-devel libaio-devel libattr-devel libtool libuuid-devel xfsdump xfsprogs xfsprogs-devel libaio-devel libtool libuuid-devel xfsprogs-devel python3-cephfs cephfs-top cephfs-mirror bison flex elfutils-libelf-devel openssl-devel NetworkManager iproute util-linux libacl-devel libaio-devel libattr-devel libtool libuuid-devel xfsdump xfsprogs xfsprogs-devel libaio-devel libtool libuuid-devel xfsprogs-devel python3-cephfs cephfs-top cephfs-mirror -y'

could please let me know is there any other way I can validate this?


Regards,
Amarnath

Comment 9 Amarnath 2022-02-08 04:21:18 UTC
Hi Patrick,

I have verified it on the latest build(16.2.7-48.el8cp).
I see the active node coming back to an active state after initiating the `ceph fail mds 0`.It is not getting stuck at up:resolve state.
I don't see any message dropping in the logs.

commands executed : 

[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK  STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    replay  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                    0      0      0      0   
 1    active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    4 /s   285    169     58    141   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph config set mds mds_sleep_rank_change 10000000.0
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph config set mds mds_connect_bootstrapping True
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph -s
  cluster:
    id:     4041e752-888c-11ec-9ac6-fa163e1e31c2
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-bz-mds-9ozvvy-node1-installer,ceph-bz-mds-9ozvvy-node2,ceph-bz-mds-9ozvvy-node3 (age 40m)
    mgr: ceph-bz-mds-9ozvvy-node1-installer.fzndpb(active, since 43m), standbys: ceph-bz-mds-9ozvvy-node2.znbodr
    mds: 2/2 daemons up, 1 standby
    osd: 12 osds: 12 up (since 39m), 12 in (since 39m)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 65 pgs
    objects: 1.45k objects, 1.9 GiB
    usage:   6.0 GiB used, 174 GiB / 180 GiB avail
    pgs:     65 active+clean
 
  io:
    client:   21 MiB/s rd, 63 MiB/s wr, 22 op/s rd, 55 op/s wr
 

[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK  STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  Reqs:   11 /s  1330   1331    283   1312   
 1    active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:   11 /s   109    113     61     93   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   101M  54.9G  
cephfs.cephfs.data    data    3631M  54.9G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph mds fail 0
failed mds gid 14463   111M  54.9G  
cephfs.cephfs.data    data    3751M  54.9G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK  STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    replay  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                    0      0      0      0   
 1    active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   267    151     58    123   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   111M  54.9G  
cephfs.cephfs.data    data    3273M  54.9G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK   STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    resolve  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                 2943   1356    290      0   
 1     active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   267    151     58    123   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   111M  55.1G  
cephfs.cephfs.data    data    3211M  55.1G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK   STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    resolve  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                 2943   1356    290      0   
 1     active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   267    151     58    123   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   111M  55.1G  
cephfs.cephfs.data    data    3211M  55.1G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK    STATE                     MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    reconnect  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                 2943   1356    290      0   
 1      active   cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   267    151     58    119   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   111M  55.1G  
cephfs.cephfs.data    data    3211M  55.1G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK    STATE                     MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    reconnect  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu                 2943   1356    290      0   
 1      active   cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   267    151     58    119   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   111M  55.1G  
cephfs.cephfs.data    data    3211M  55.1G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)
[root@ceph-bz-mds-9ozvvy-node7 ~]# ceph fs status
cephfs - 1 clients
======
RANK  STATE                    MDS                       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.ceph-bz-mds-9ozvvy-node6.mllnlu  Reqs:   30 /s  2988   1356    290     21   
 1    active  cephfs.ceph-bz-mds-9ozvvy-node5.btiybz  Reqs:    0 /s   270    142     58    120   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata   112M  55.3G  
cephfs.cephfs.data    data    2941M  55.3G  
             STANDBY MDS                
cephfs.ceph-bz-mds-9ozvvy-node4.varcwu  
MDS version: ceph version 16.2.7-48.el8cp (49480538844c9255f03e5b0dccc609ea8fbf2656) pacific (stable)

Comment 15 errata-xmlrpc 2022-04-04 10:22:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174


Note You need to log in before you can comment on or make changes to this bug.