.Load on MDS daemons is not always balanced fairly or evenly in multiple active MDS configurations
Previously, in certain cases, the MDS balancers offloaded too much metadata to another active daemon, or none at all.
As of this update to {product} this is no longer an issue as several balancer fixes and optimization have been made which address the issue.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2018:2819
Description of problem: MDS Default load balancing is not dividing the load among 2 active MDS incase of single directory with multiple subdirectories and files. Due to this observed "MDS_SLOW_REQUEST" messages in "ceph -w" console output for mds.1, also Observed too many object requests and inodes associated to mds.1 in --admin-daemon commands. ================================ 2017-09-21 19:56:04.444695 mon.magna028 [INF] Cluster is now healthy 2017-09-21 19:56:23.918095 mon.magna028 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:27.919163 mon.magna028 [WRN] Health check update: 39 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:39.638662 osd.1 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.013676 secs 2017-09-21 19:56:39.638666 osd.1 [WRN] slow request 30.013676 seconds old, received at 2017-09-21 19:56:09.624941: osd_op(mds.354658.0:315906 2.1f 2:f9f21277:::201.0000369d:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638964 osd.1 [WRN] 16 slow requests, 5 included below; oldest blocked for > 31.013907 secs 2017-09-21 19:56:40.638969 osd.1 [WRN] slow request 30.614480 seconds old, received at 2017-09-21 19:56:10.024368: osd_op(mds.1.8333:761150 2.1f 2:f9f21277:::201.0000369d:head [write 779414~1869 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638976 osd.1 [WRN] slow request 30.606327 seconds old, received at 2017-09-21 19:56:10.032521: osd_op(mds.1.8333:761151 2.1f 2:f9f21277:::201.0000369d:head [write 781283~9194 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638978 osd.1 [WRN] slow request 30.599890 seconds old, received at 2017-09-21 19:56:10.038959: osd_op(mds.1.8333:761152 2.1f 2:f9f21277:::201.0000369d:head [write 790477~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638980 osd.1 [WRN] slow request 30.599818 seconds old, received at 2017-09-21 19:56:10.039030: osd_op(mds.1.8333:761153 2.1f 2:f9f21277:::201.0000369d:head [write 792348~2698 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638983 osd.1 [WRN] slow request 30.599748 seconds old, received at 2017-09-21 19:56:10.039101: osd_op(mds.1.8333:761154 2.1f 2:f9f21277:::201.0000369d:head [write 795046~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:35.147818 mon.magna028 [WRN] MDS health message (mds.1): 665 slow requests are blocked > 30 sec 2017-09-21 19:56:35.147850 mon.magna028 [INF] daemon mds.magna063 is now active in filesystem cephfs as rank 1 2017-09-21 19:56:35.239621 mon.magna028 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 2017-09-21 19:56:37.921894 mon.magna028 [WRN] Health check update: 38 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:41.922931 mon.magna028 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:41.950431 mon.magna028 [WRN] overall HEALTH_WARN 1 MDSs report slow requests; 38 slow requests are blocked > 32 sec 2017-09-21 19:56:41.639208 osd.1 [WRN] 17 slow requests, 5 included below; oldest blocked for > 32.014169 secs 2017-09-21 19:56:41.639213 osd.1 [WRN] slow request 30.232190 seconds old, received at 2017-09-21 19:56:11.406920: osd_op(mds.1.8333:761167 2.1f 2:f9f21277:::201.0000369d:head [write 858188~1785849 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639215 osd.1 [WRN] slow request 31.599940 seconds old, received at 2017-09-21 19:56:10.039170: osd_op(mds.1.8333:761155 2.1f 2:f9f21277:::201.0000369d:head [write 796917~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639218 osd.1 [WRN] slow request 31.592728 seconds old, received at 2017-09-21 19:56:10.046382: osd_op(mds.1.8333:761156 2.1f 2:f9f21277:::201.0000369d:head [write 798788~24897 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639220 osd.1 [WRN] slow request 31.587236 seconds old, received at 2017-09-21 19:56:10.051874: osd_op(mds.1.8333:761157 2.1f 2:f9f21277:::201.0000369d:head [write 823685~8443 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639235 osd.1 [WRN] slow request 31.587170 seconds old, received at 2017-09-21 19:56:10.051940: osd_op(mds.1.8333:761158 2.1f 2:f9f21277:::201.0000369d:head [write 832128~1379 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks =========================================== mds.0 object_requests and total ino's. ============================================= [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok dump tree | grep \"ino\" | wc -l | uniq 1978 [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok objecter_requests { "ops": [], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } [ubuntu@magna058 ~]$ ============================================ mds.1 object_requests and total ino's =========================================== [ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok dump tree | grep \"ino\" | wc -l | uniq 100906 [ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok objecter_requests { "ops": [ { "tid": 907210, "pg": "1.92928168", "osd": 0, "object_id": "100006f2a77.00000000", "object_locator": "@1", "target_object_id": "100006f2a77.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024656s", "osd_ops": [ "delete" ] }, { "tid": 907199, "pg": "1.a22128d3", "osd": 1, "object_id": "100006f2a89.00000000", "object_locator": "@1", "target_object_id": "100006f2a89.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.889298s", "osd_ops": [ "delete" ] }, { "tid": 907201, "pg": "1.f83c78ac", "osd": 3, "object_id": "100006f2a46.00000000", "object_locator": "@1", "target_object_id": "100006f2a46.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.900143s", "osd_ops": [ "delete" ] }, { "tid": 907206, "pg": "1.75a3d4ec", "osd": 3, "object_id": "100006f2a84.00000000", "object_locator": "@1", "target_object_id": "100006f2a84.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.94171s", "osd_ops": [ "delete" ] }, { "tid": 907209, "pg": "2.391b2878", "osd": 4, "object_id": "201.0000371e", "object_locator": "@2", "target_object_id": "201.0000371e", "target_object_locator": "@2", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "0=[]", "mtime": "2017-09-21 20:11:37.0.017147s", "osd_ops": [ "delete" ] }, { "tid": 907211, "pg": "1.892c3cb8", "osd": 4, "object_id": "100006f2977.00000000", "object_locator": "@1", "target_object_id": "100006f2977.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024859s", "osd_ops": [ "delete" ] }, { "tid": 907204, "pg": "1.cd7e27cf", "osd": 6, "object_id": "100006f2a4a.00000000", "object_locator": "@1", "target_object_id": "100006f2a4a.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.905073s", "osd_ops": [ "delete" ] }, { "tid": 907208, "pg": "1.4ad7d6a7", "osd": 6, "object_id": "100006f2792.00000000", "object_locator": "@1", "target_object_id": "100006f2792.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.941854s", "osd_ops": [ "delete" ] }, { "tid": 907212, "pg": "1.47544307", "osd": 6, "object_id": "100006f2a70.00000000", "object_locator": "@1", "target_object_id": "100006f2a70.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024935s", "osd_ops": [ "delete" ] } ], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } [ubuntu@host063 ~]$ ====================================================== Scenario: Create a parent directory dir1, Created 3 sub directories dir1, dir2 and dir2 Started IO's from 3 different(2 kernel and 1 Fuse) clients on above 3 directories. Version-Release number of selected component (if applicable): ceph version 12.2.0-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc) How reproducible: 5/5 Steps to Reproduce: 1. Configure CephFS mount and configure atleast 3 clients 2. Create a directory and 3 subdirectories. 3. Ran IO's from 3 clients in above created directories. Actual results: NA Expected results: NA Additional info: This behaviour was different when 50000 directories and 1 million files created directly on cephfs mount. almost the no of inodes were equal among the active mds.