2269700 – MDS is going to laggy state when we do MDS fail operations

Bug 2269700 - MDS is going to laggy state when we do MDS fail operations

Summary: MDS is going to laggy state when we do MDS fail operations

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Kotresh HR
QA Contact:	sumr
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-15 12:36 UTC by Amarnath
Modified:	2024-05-08 11:04 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-04-30 11:14:25 UTC
Embargoed:
Dependent Products:
Flags:	khiremat: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8541	0	None	None	None	2024-03-15 12:36:27 UTC

Description Amarnath 2024-03-15 12:36:05 UTC

Description of problem:
MDS is going to laggy state when we do MDS fail operations

We are observing laggy state in 2 scenarios in our weekly runs

Test case 1:
CEPH-83573462 - Validate the fs command to increase/decrease the max mds daemons
In this test we are increasing the MDS to 5 gradually and decreasing it back
INcreasing is working fine and 5 MDS became active. when started Decreaseing 
After executing this command ceph fs set cephfs max_mds 4, MDS is going to stopping(laggy) State
[root@ceph-amk-tier-3-r9qz8t-node8 ~]# ceph fs status
cephfs - 2 clients
======
RANK  STATE                      MDS                         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.ceph-amk-tier-3-r9qz8t-node5.stsalf  Reqs:    0 /s    33     34     32     12   
 1    active  cephfs.ceph-amk-tier-3-r9qz8t-node6.nwztiz  Reqs:    0 /s    11     14     12      1   
 2    active  cephfs.ceph-amk-tier-3-r9qz8t-node3.zlvkdz  Reqs:    0 /s    10     13     11      0   
 3    active  cephfs.ceph-amk-tier-3-r9qz8t-node4.lazamt  Reqs:    0 /s    10     13     11      0   
 4    active  cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh  Reqs:    0 /s    10     13     11      0   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  35.1M  51.3G  
cephfs.cephfs.data    data    14.6G  51.3G  
MDS version: ceph version 18.2.1-52.el9cp (89a4ea154d7ed38760fa0c569d5fa1b569650513) reef (stable)
[root@ceph-amk-tier-3-r9qz8t-node8 ~]# ceph fs status
cephfs - 2 clients
======
RANK       STATE                          MDS                         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0         active      cephfs.ceph-amk-tier-3-r9qz8t-node5.stsalf  Reqs:    0 /s    33     34     32     12   
 1         active      cephfs.ceph-amk-tier-3-r9qz8t-node6.nwztiz  Reqs:    0 /s    11     14     12      1   
 2         active      cephfs.ceph-amk-tier-3-r9qz8t-node3.zlvkdz  Reqs:    0 /s    10     13     11      0   
 3         active      cephfs.ceph-amk-tier-3-r9qz8t-node4.lazamt  Reqs:    0 /s    10     13     11      0   
 4    stopping(laggy)  cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh                   10     13     11      0   
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  35.1M  51.3G  
cephfs.cephfs.data    data    14.6G  51.3G  
MDS version: ceph version 18.2.1-52.el9cp (89a4ea154d7ed38760fa0c569d5fa1b569650513) reef (stable)
Test Run Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1Y2OSW/

MDS Log Snippet: 
2024-03-15T12:26:21.171+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:21.171+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1457.58s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:21.671+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:21.671+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1458.08s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:22.171+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:22.171+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1458.58s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:22.671+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:22.671+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1459.08s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:22.764+0000 7f5844836640  1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] send_to--> mds [v2:10.0.210.187:6832/2782618038,v1:10.0.210.187:6833/2782618038] -- mds_metrics from rank=4 carrying 0 metric updates -- ?+0 0x55f0b052b380
2024-03-15T12:26:22.764+0000 7f5844836640  1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] --> [v2:10.0.210.187:6832/2782618038,v1:10.0.210.187:6833/2782618038] -- mds_metrics from rank=4 carrying 0 metric updates -- 0x55f0b052b380 con 0x55f0b04cfc00
2024-03-15T12:26:23.171+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:23.171+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1459.58s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:23.672+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:23.672+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1460.08s ago); MDS internal heartbeat is not healthy!
2024-03-15T12:26:24.144+0000 7f584903f640  1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] --> [v2:10.0.210.146:6800/2938502705,v1:10.0.210.146:6801/2938502705] -- mgrreport(unknown.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh +0-0 packed 1590) -- 0x55f0b0428a80 con 0x55f0b03dec00
2024-03-15T12:26:24.172+0000 7f5849840640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-15T12:26:24.172+0000 7f5849840640  0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1460.58s ago); MDS internal heartbeat is not healthy!

MDS Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/laggy/ceph-mds.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh.log

Test Case 2 :
CEPH-11227 - Active-active multi MDS without any subtree pinning to particular MDS. make sure to have more than 50k directories and million files, Check MDS load balancing1. Check default MDS load balancing without pinning2. Pin 10 directories to one MDS and continue IO on other directories and check MDS load balancing, Check the migration of subtree to the pinned MDS3. Pin 50% of directories to each MDS and continue IO and check MDS load balancing

Test case is passing but making the MDS go to laggy state. This we are hitting consistently.
As part of this test we are pinning the directories to respective MDS ranks and running IOs and failing the MDS in parallel

Test logs: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-7469NI/
MDS Log snippet : 
2024-03-14T20:19:16.592+0000 7f80918b0640  1 --2- [v2:10.0.209.28:6832/3012660119,v1:10.0.209.28:6833/3012660119] >> [v2:10.0.209.24:6800/2210445915,v1:10.0.209.24:6801/2210445915] conn(0x55a874d8b400 0x55a874ddab00 unknown :-1 s=READY pgs=15 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  e: '/usr/bin/ceph-mds'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  0: '/usr/bin/ceph-mds'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  1: '-n'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  2: 'mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  3: '-f'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  4: '--setuser'
2024-03-14T20:19:16.706+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  5: 'ceph'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  6: '--setgroup'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  7: 'ceph'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  8: '--default-log-to-file=false'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  9: '--default-log-to-journald=true'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  10: '--default-log-to-stderr=false'
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh respawning with exe /usr/bin/ceph-mds
2024-03-14T20:19:16.707+0000 7f808f0ab640  1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh  exe_path /proc/self/exe

MDS Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/laggy/tc2/ceph-mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh.log

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Venky Shankar 2024-03-21 04:48:32 UTC

The MDS heartbeat timing out is not a good sign.

Handing this out to Milind for analysis. Milind, PTAL on priority.

Comment 4 Amarnath 2024-03-21 05:58:08 UTC

Hi Greg,

In Latest test runs I see this test case is passing. 
on consecutive builds it has failed so raised this BZ 

Test case 1 passed logs : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/cephfs/33/tier-2_cephfs_generic/
Test Case 2 Passed Logs : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/cephfs/33/tier-2_cephfs_test_mds_pinning/

Failed Versions : 18.2.1-22 , 	18.2.1-52

Passed Version: 18.2.1-67

Moving this BZ to Closed

Regards,
Amarnath

Comment 5 Venky Shankar 2024-03-21 06:01:52 UTC

(In reply to Amarnath from comment #4)
> Hi Greg,
> 
> In Latest test runs I see this test case is passing. 
> on consecutive builds it has failed so raised this BZ 
> 
> Test case 1 passed logs :
> http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/
> cephfs/33/tier-2_cephfs_generic/
> Test Case 2 Passed Logs :
> http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/
> cephfs/33/tier-2_cephfs_test_mds_pinning/
> 
> Failed Versions : 18.2.1-22 , 	18.2.1-52
> 
> Passed Version: 18.2.1-67
> 
> Moving this BZ to Closed

This still needs to be RCA'd - just because the test passed on a subsequent run doesn't mean that there aren't any bugs.

Milind, please continue as usual.

Comment 13 Amarnath 2024-04-01 11:50:24 UTC

Hi Kotresh,

This are the builds which it was failing.
--rhs-ceph-repo http://download.eng.bos.redhat.com/rhel-9/composes/auto/ceph-7.1-rhel-9/RHCEPH-7.1-RHEL-9-20240308.ci.0 
--docker-registry registry-proxy.engineering.redhat.com 
--docker-image rh-osbs/rhceph 
--docker-tag ceph-7.1-rhel-9-containers-candidate-62103-20240308195345 
now as part of the clean-up, I have lost the setup.

I am not able to reproduce this on latest builds

Regards
Amarnath

Note You need to log in before you can comment on or make changes to this bug.