Description of problem: MDS is going to laggy state when we do MDS fail operations We are observing laggy state in 2 scenarios in our weekly runs Test case 1: CEPH-83573462 - Validate the fs command to increase/decrease the max mds daemons In this test we are increasing the MDS to 5 gradually and decreasing it back INcreasing is working fine and 5 MDS became active. when started Decreaseing After executing this command ceph fs set cephfs max_mds 4, MDS is going to stopping(laggy) State [root@ceph-amk-tier-3-r9qz8t-node8 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph-amk-tier-3-r9qz8t-node5.stsalf Reqs: 0 /s 33 34 32 12 1 active cephfs.ceph-amk-tier-3-r9qz8t-node6.nwztiz Reqs: 0 /s 11 14 12 1 2 active cephfs.ceph-amk-tier-3-r9qz8t-node3.zlvkdz Reqs: 0 /s 10 13 11 0 3 active cephfs.ceph-amk-tier-3-r9qz8t-node4.lazamt Reqs: 0 /s 10 13 11 0 4 active cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Reqs: 0 /s 10 13 11 0 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 35.1M 51.3G cephfs.cephfs.data data 14.6G 51.3G MDS version: ceph version 18.2.1-52.el9cp (89a4ea154d7ed38760fa0c569d5fa1b569650513) reef (stable) [root@ceph-amk-tier-3-r9qz8t-node8 ~]# ceph fs status cephfs - 2 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.ceph-amk-tier-3-r9qz8t-node5.stsalf Reqs: 0 /s 33 34 32 12 1 active cephfs.ceph-amk-tier-3-r9qz8t-node6.nwztiz Reqs: 0 /s 11 14 12 1 2 active cephfs.ceph-amk-tier-3-r9qz8t-node3.zlvkdz Reqs: 0 /s 10 13 11 0 3 active cephfs.ceph-amk-tier-3-r9qz8t-node4.lazamt Reqs: 0 /s 10 13 11 0 4 stopping(laggy) cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh 10 13 11 0 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 35.1M 51.3G cephfs.cephfs.data data 14.6G 51.3G MDS version: ceph version 18.2.1-52.el9cp (89a4ea154d7ed38760fa0c569d5fa1b569650513) reef (stable) Test Run Logs : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1Y2OSW/ MDS Log Snippet: 2024-03-15T12:26:21.171+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:21.171+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1457.58s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:21.671+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:21.671+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1458.08s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:22.171+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:22.171+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1458.58s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:22.671+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:22.671+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1459.08s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:22.764+0000 7f5844836640 1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] send_to--> mds [v2:10.0.210.187:6832/2782618038,v1:10.0.210.187:6833/2782618038] -- mds_metrics from rank=4 carrying 0 metric updates -- ?+0 0x55f0b052b380 2024-03-15T12:26:22.764+0000 7f5844836640 1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] --> [v2:10.0.210.187:6832/2782618038,v1:10.0.210.187:6833/2782618038] -- mds_metrics from rank=4 carrying 0 metric updates -- 0x55f0b052b380 con 0x55f0b04cfc00 2024-03-15T12:26:23.171+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:23.171+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1459.58s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:23.672+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:23.672+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1460.08s ago); MDS internal heartbeat is not healthy! 2024-03-15T12:26:24.144+0000 7f584903f640 1 -- [v2:10.0.209.240:6800/2464511221,v1:10.0.209.240:6801/2464511221] --> [v2:10.0.210.146:6800/2938502705,v1:10.0.210.146:6801/2938502705] -- mgrreport(unknown.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh +0-0 packed 1590) -- 0x55f0b0428a80 con 0x55f0b03dec00 2024-03-15T12:26:24.172+0000 7f5849840640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-03-15T12:26:24.172+0000 7f5849840640 0 mds.beacon.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh Skipping beacon heartbeat to monitors (last acked 1460.58s ago); MDS internal heartbeat is not healthy! MDS Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/laggy/ceph-mds.cephfs.ceph-amk-tier-3-r9qz8t-node7.ueyqhh.log Test Case 2 : CEPH-11227 - Active-active multi MDS without any subtree pinning to particular MDS. make sure to have more than 50k directories and million files, Check MDS load balancing1. Check default MDS load balancing without pinning2. Pin 10 directories to one MDS and continue IO on other directories and check MDS load balancing, Check the migration of subtree to the pinned MDS3. Pin 50% of directories to each MDS and continue IO and check MDS load balancing Test case is passing but making the MDS go to laggy state. This we are hitting consistently. As part of this test we are pinning the directories to respective MDS ranks and running IOs and failing the MDS in parallel Test logs: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-7469NI/ MDS Log snippet : 2024-03-14T20:19:16.592+0000 7f80918b0640 1 --2- [v2:10.0.209.28:6832/3012660119,v1:10.0.209.28:6833/3012660119] >> [v2:10.0.209.24:6800/2210445915,v1:10.0.209.24:6801/2210445915] conn(0x55a874d8b400 0x55a874ddab00 unknown :-1 s=READY pgs=15 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh e: '/usr/bin/ceph-mds' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 0: '/usr/bin/ceph-mds' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 1: '-n' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 2: 'mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 3: '-f' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 4: '--setuser' 2024-03-14T20:19:16.706+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 5: 'ceph' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 6: '--setgroup' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 7: 'ceph' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 8: '--default-log-to-file=false' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 9: '--default-log-to-journald=true' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh 10: '--default-log-to-stderr=false' 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh respawning with exe /usr/bin/ceph-mds 2024-03-14T20:19:16.707+0000 7f808f0ab640 1 mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh exe_path /proc/self/exe MDS Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/laggy/tc2/ceph-mds.cephfs.ceph-amk-week-7469ni-node5.jquwxh.log Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The MDS heartbeat timing out is not a good sign. Handing this out to Milind for analysis. Milind, PTAL on priority.
Hi Greg, In Latest test runs I see this test case is passing. on consecutive builds it has failed so raised this BZ Test case 1 passed logs : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/cephfs/33/tier-2_cephfs_generic/ Test Case 2 Passed Logs : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/cephfs/33/tier-2_cephfs_test_mds_pinning/ Failed Versions : 18.2.1-22 , 18.2.1-52 Passed Version: 18.2.1-67 Moving this BZ to Closed Regards, Amarnath
(In reply to Amarnath from comment #4) > Hi Greg, > > In Latest test runs I see this test case is passing. > on consecutive builds it has failed so raised this BZ > > Test case 1 passed logs : > http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/ > cephfs/33/tier-2_cephfs_generic/ > Test Case 2 Passed Logs : > http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.1-67/Weekly/ > cephfs/33/tier-2_cephfs_test_mds_pinning/ > > Failed Versions : 18.2.1-22 , 18.2.1-52 > > Passed Version: 18.2.1-67 > > Moving this BZ to Closed This still needs to be RCA'd - just because the test passed on a subsequent run doesn't mean that there aren't any bugs. Milind, please continue as usual.
Hi Kotresh, This are the builds which it was failing. --rhs-ceph-repo http://download.eng.bos.redhat.com/rhel-9/composes/auto/ceph-7.1-rhel-9/RHCEPH-7.1-RHEL-9-20240308.ci.0 --docker-registry registry-proxy.engineering.redhat.com --docker-image rh-osbs/rhceph --docker-tag ceph-7.1-rhel-9-containers-candidate-62103-20240308195345 now as part of the clean-up, I have lost the setup. I am not able to reproduce this on latest builds Regards Amarnath