Description of problem: ceph health in ERROR state with damaged MDS, while running cephfs sanity on ubuntu with lvm config and bluestore Version-Release number of selected component (if applicable): ceph-ansible 3.2.0~rc3-2redhat1 ceph version 12.2.8-31redhat1xenial How reproducible: 1/1 Steps to Reproduce: 1. Have 4 MDSs(2 active+ 2 standby, 4 clients ( 2 fuse + 2 kernel), create 1000 dirs. 2. Do MDS pinning, one MDS with maximum pinning and another with less pinning) and start client IO on directories. 3. Do failover of active MDSs Actual results: =================== Filesystem 'cephfs_new' (5) fs_name cephfs_new epoch 855 flags c created 2018-11-20 20:53:31.865953 modified 2018-11-20 23:52:44.334002 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 0 last_failure_osd_epoch 222 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} max_mds 2 in 0,1 up {0=5594} failed damaged 1 stopped data_pools [5] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 5594: 172.16.115.40:6800/3369263764 'ceph-xeniallvm-1542736791764-node6-mds' mds.0.843 up:active seq 64 Standby daemons: 5467: 172.16.115.76:6800/128262811 'ceph-xeniallvm-1542736791764-node3-mds' mds.-1.0 up:standby seq 109 5476: 172.16.115.66:6800/4128937827 'ceph-xeniallvm-1542736791764-node4-mds' mds.-1.0 up:standby seq 2 5883: 172.16.115.29:6800/1229698542 'ceph-xeniallvm-1542736791764-node12-mds' mds.-1.0 up:standby seq 2 ============== cluster: id: 71618eef-ed70-45ff-a191-a2255c5904e2 health: HEALTH_ERR 1 filesystem is degraded 1 mds daemon damaged clock skew detected on mon.ceph-xeniallvm-1542736791764-node15-monmgr, mon.ceph-xeniallvm-1542736791764-node14-monmgr services: mon: 3 daemons, quorum ceph-xeniallvm-1542736791764-node1-monmgrinstaller,ceph-xeniallvm-1542736791764-node15-monmgr,ceph-xeniallvm-1542736791764-node14-monmgr mgr: ceph-xeniallvm-1542736791764-node15-monmgr(active), standbys: ceph-xeniallvm-1542736791764-node14-monmgr, ceph-xeniallvm-1542736791764-node1-monmgrinstaller mds: cephfs_new-1/2/2 up {0=ceph-xeniallvm-1542736791764-node6-mds=up:resolve}, 3 up:standby, 1 damaged osd: 33 osds: 33 up, 33 in data: pools: 6 pools, 384 pgs objects: 28.81k objects, 6.47GiB usage: 66.8GiB used, 923GiB / 990GiB avail pgs: 384 active+clean ================== cephfs_new - 2 clients ========== +------+---------+----------------------------------------+----------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+---------+----------------------------------------+----------+-------+-------+ | 0 | resolve | ceph-xeniallvm-1542736791764-node6-mds | | 0 | 0 | | 1 | failed | | | | | +------+---------+----------------------------------------+----------+-------+-------+ +-----------------+----------+-------+-------+ | Pool | type | used | avail | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 367M | 287G | | data_pool | data | 6258M | 627G | +-----------------+----------+-------+-------+ +-----------------------------------------+ | Standby MDS | +-----------------------------------------+ | ceph-xeniallvm-1542736791764-node3-mds | | ceph-xeniallvm-1542736791764-node4-mds | | ceph-xeniallvm-1542736791764-node12-mds | +-----------------------------------------+ ======================= Expected results: No damage message should be seen Additional info: Logs kept in: magna002.ceph.redhat.com:/home/sshreeka/bz_logs/ceph-xeniallvm-1542736791764 In leader MON log, found this: 2018-11-20 23:45:47.331407 7f5959fa4700 1 mds.0.736 skipping upkeep work because connection to Monitors appears laggy 2018-11-20 23:45:48.873857 7f595c7a9700 5 mds.0.736 laggy, deferring client_session(request_renewcaps seq 79) v1 2018-11-20 23:45:49.266075 7f595c7a9700 5 mds.0.736 laggy, deferring client_session(request_renewcaps seq 79) v1 2018-11-20 23:45:49.302020 7f59597a3700 5 mds.beacon.ceph-xeniallvm-1542736791764-node6-mds Sending beacon up:active seq 859 2018-11-20 23:45:51.530410 7f595779f700 -1 mds.0.journaler.pq(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown 2018-11-20 23:45:51.530428 7f595779f700 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown 2018-11-20 23:45:51.530468 7f595779f700 5 mds.beacon.ceph-xeniallvm-1542736791764-node6-mds set_want_state: up:active -> down:damaged 2018-11-20 23:45:51.532007 7f595779f700 5 mds.beacon.ceph-xeniallvm-1542736791764-node6-mds Sending beacon down:damaged seq 860 2018-11-20 23:45:55.530367 7f59597a3700 5 mds.beacon.ceph-xeniallvm-1542736791764-node6-mds Sending beacon down:damaged seq 861 2018-11-20 23:45:56.282268 7f595779f700 1 mds.ceph-xeniallvm-1542736791764-node6-mds respawn!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0475