Description of problem: MDS Default load balancing is not dividing the load among 2 active MDS incase of single directory with multiple subdirectories and files. Due to this observed "MDS_SLOW_REQUEST" messages in "ceph -w" console output for mds.1, also Observed too many object requests and inodes associated to mds.1 in --admin-daemon commands. ================================ 2017-09-21 19:56:04.444695 mon.magna028 [INF] Cluster is now healthy 2017-09-21 19:56:23.918095 mon.magna028 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:27.919163 mon.magna028 [WRN] Health check update: 39 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:39.638662 osd.1 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.013676 secs 2017-09-21 19:56:39.638666 osd.1 [WRN] slow request 30.013676 seconds old, received at 2017-09-21 19:56:09.624941: osd_op(mds.354658.0:315906 2.1f 2:f9f21277:::201.0000369d:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638964 osd.1 [WRN] 16 slow requests, 5 included below; oldest blocked for > 31.013907 secs 2017-09-21 19:56:40.638969 osd.1 [WRN] slow request 30.614480 seconds old, received at 2017-09-21 19:56:10.024368: osd_op(mds.1.8333:761150 2.1f 2:f9f21277:::201.0000369d:head [write 779414~1869 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638976 osd.1 [WRN] slow request 30.606327 seconds old, received at 2017-09-21 19:56:10.032521: osd_op(mds.1.8333:761151 2.1f 2:f9f21277:::201.0000369d:head [write 781283~9194 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638978 osd.1 [WRN] slow request 30.599890 seconds old, received at 2017-09-21 19:56:10.038959: osd_op(mds.1.8333:761152 2.1f 2:f9f21277:::201.0000369d:head [write 790477~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638980 osd.1 [WRN] slow request 30.599818 seconds old, received at 2017-09-21 19:56:10.039030: osd_op(mds.1.8333:761153 2.1f 2:f9f21277:::201.0000369d:head [write 792348~2698 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:40.638983 osd.1 [WRN] slow request 30.599748 seconds old, received at 2017-09-21 19:56:10.039101: osd_op(mds.1.8333:761154 2.1f 2:f9f21277:::201.0000369d:head [write 795046~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:35.147818 mon.magna028 [WRN] MDS health message (mds.1): 665 slow requests are blocked > 30 sec 2017-09-21 19:56:35.147850 mon.magna028 [INF] daemon mds.magna063 is now active in filesystem cephfs as rank 1 2017-09-21 19:56:35.239621 mon.magna028 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 2017-09-21 19:56:37.921894 mon.magna028 [WRN] Health check update: 38 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:41.922931 mon.magna028 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2017-09-21 19:56:41.950431 mon.magna028 [WRN] overall HEALTH_WARN 1 MDSs report slow requests; 38 slow requests are blocked > 32 sec 2017-09-21 19:56:41.639208 osd.1 [WRN] 17 slow requests, 5 included below; oldest blocked for > 32.014169 secs 2017-09-21 19:56:41.639213 osd.1 [WRN] slow request 30.232190 seconds old, received at 2017-09-21 19:56:11.406920: osd_op(mds.1.8333:761167 2.1f 2:f9f21277:::201.0000369d:head [write 858188~1785849 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639215 osd.1 [WRN] slow request 31.599940 seconds old, received at 2017-09-21 19:56:10.039170: osd_op(mds.1.8333:761155 2.1f 2:f9f21277:::201.0000369d:head [write 796917~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639218 osd.1 [WRN] slow request 31.592728 seconds old, received at 2017-09-21 19:56:10.046382: osd_op(mds.1.8333:761156 2.1f 2:f9f21277:::201.0000369d:head [write 798788~24897 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639220 osd.1 [WRN] slow request 31.587236 seconds old, received at 2017-09-21 19:56:10.051874: osd_op(mds.1.8333:761157 2.1f 2:f9f21277:::201.0000369d:head [write 823685~8443 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks 2017-09-21 19:56:41.639235 osd.1 [WRN] slow request 31.587170 seconds old, received at 2017-09-21 19:56:10.051940: osd_op(mds.1.8333:761158 2.1f 2:f9f21277:::201.0000369d:head [write 832128~1379 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks =========================================== mds.0 object_requests and total ino's. ============================================= [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok dump tree | grep \"ino\" | wc -l | uniq 1978 [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok objecter_requests { "ops": [], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } [ubuntu@magna058 ~]$ ============================================ mds.1 object_requests and total ino's =========================================== [ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok dump tree | grep \"ino\" | wc -l | uniq 100906 [ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok objecter_requests { "ops": [ { "tid": 907210, "pg": "1.92928168", "osd": 0, "object_id": "100006f2a77.00000000", "object_locator": "@1", "target_object_id": "100006f2a77.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024656s", "osd_ops": [ "delete" ] }, { "tid": 907199, "pg": "1.a22128d3", "osd": 1, "object_id": "100006f2a89.00000000", "object_locator": "@1", "target_object_id": "100006f2a89.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.889298s", "osd_ops": [ "delete" ] }, { "tid": 907201, "pg": "1.f83c78ac", "osd": 3, "object_id": "100006f2a46.00000000", "object_locator": "@1", "target_object_id": "100006f2a46.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.900143s", "osd_ops": [ "delete" ] }, { "tid": 907206, "pg": "1.75a3d4ec", "osd": 3, "object_id": "100006f2a84.00000000", "object_locator": "@1", "target_object_id": "100006f2a84.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.94171s", "osd_ops": [ "delete" ] }, { "tid": 907209, "pg": "2.391b2878", "osd": 4, "object_id": "201.0000371e", "object_locator": "@2", "target_object_id": "201.0000371e", "target_object_locator": "@2", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "0=[]", "mtime": "2017-09-21 20:11:37.0.017147s", "osd_ops": [ "delete" ] }, { "tid": 907211, "pg": "1.892c3cb8", "osd": 4, "object_id": "100006f2977.00000000", "object_locator": "@1", "target_object_id": "100006f2977.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024859s", "osd_ops": [ "delete" ] }, { "tid": 907204, "pg": "1.cd7e27cf", "osd": 6, "object_id": "100006f2a4a.00000000", "object_locator": "@1", "target_object_id": "100006f2a4a.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.905073s", "osd_ops": [ "delete" ] }, { "tid": 907208, "pg": "1.4ad7d6a7", "osd": 6, "object_id": "100006f2792.00000000", "object_locator": "@1", "target_object_id": "100006f2792.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:36.0.941854s", "osd_ops": [ "delete" ] }, { "tid": 907212, "pg": "1.47544307", "osd": 6, "object_id": "100006f2a70.00000000", "object_locator": "@1", "target_object_id": "100006f2a70.00000000", "target_object_locator": "@1", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "last_sent": "272328s", "attempts": 1, "snapid": "head", "snap_context": "1=[]", "mtime": "2017-09-21 20:11:37.0.024935s", "osd_ops": [ "delete" ] } ], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } [ubuntu@host063 ~]$ ====================================================== Scenario: Create a parent directory dir1, Created 3 sub directories dir1, dir2 and dir2 Started IO's from 3 different(2 kernel and 1 Fuse) clients on above 3 directories. Version-Release number of selected component (if applicable): ceph version 12.2.0-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc) How reproducible: 5/5 Steps to Reproduce: 1. Configure CephFS mount and configure atleast 3 clients 2. Create a directory and 3 subdirectories. 3. Ran IO's from 3 clients in above created directories. Actual results: NA Expected results: NA Additional info: This behaviour was different when 50000 directories and 1 million files created directly on cephfs mount. almost the no of inodes were equal among the active mds.
pinning details: There is no pinning done. [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok get subtrees | grep \"export_pin\"\:\ 1 | wc -l | uniq 2 these 2 pins are for following stray paths "path": "~mds0/stray5/100000003ec" "path": "~mds0/stray3/1000001d849" [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok get subtrees | grep \"export_pin\"\:\ 0 | wc -l | uniq 0 [ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok get subtrees | grep \"export_pin\"\:\ -1 | wc -l | uniq 20
This is a known issue that will not be addressed in 3.0 timeframe.
Moving this bug to verified state. Automation regression tests passed, CI automation run link: https://polarion.engineering.redhat.com/polarion/#/project/CEPH/testrun?id=12_2_5-35_el7_sanity_fs_rhel-7_5-server-x86_64-released_Automated_Smoke_Runs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819