Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1494256 - [CephFS] Default load balancing is not sharing the load among 2 active MDSs
[CephFS] Default load balancing is not sharing the load among 2 active MDSs
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: CephFS (Show other bugs)
3.0
Unspecified Unspecified
high Severity medium
: rc
: 3.1
Assigned To: Yan, Zheng
Ramakrishnan Periyasamy
Bara Ancincova
: TestOnly
Depends On:
Blocks: 1494421 1584264
  Show dependency treegraph
 
Reported: 2017-09-21 16:36 EDT by Ramakrishnan Periyasamy
Modified: 2018-09-26 14:18 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.Load on MDS daemons is not always balanced fairly or evenly in multiple active MDS configurations Previously, in certain cases, the MDS balancers offloaded too much metadata to another active daemon, or none at all. As of this update to {product} this is no longer an issue as several balancer fixes and optimization have been made which address the issue.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-26 14:16:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2819 None None None 2018-09-26 14:18 EDT

  None (edit)
Description Ramakrishnan Periyasamy 2017-09-21 16:36:09 EDT
Description of problem:
MDS Default load balancing is not dividing the load among 2 active MDS incase of single directory with multiple subdirectories and files.

Due to this observed "MDS_SLOW_REQUEST" messages in "ceph -w" console output for mds.1, also Observed too many object requests and inodes associated to mds.1 in --admin-daemon commands.
================================
2017-09-21 19:56:04.444695 mon.magna028 [INF] Cluster is now healthy
2017-09-21 19:56:23.918095 mon.magna028 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:27.919163 mon.magna028 [WRN] Health check update: 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:39.638662 osd.1 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.013676 secs
2017-09-21 19:56:39.638666 osd.1 [WRN] slow request 30.013676 seconds old, received at 2017-09-21 19:56:09.624941: osd_op(mds.354658.0:315906 2.1f 2:f9f21277:::201.0000369d:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638964 osd.1 [WRN] 16 slow requests, 5 included below; oldest blocked for > 31.013907 secs
2017-09-21 19:56:40.638969 osd.1 [WRN] slow request 30.614480 seconds old, received at 2017-09-21 19:56:10.024368: osd_op(mds.1.8333:761150 2.1f 2:f9f21277:::201.0000369d:head [write 779414~1869 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638976 osd.1 [WRN] slow request 30.606327 seconds old, received at 2017-09-21 19:56:10.032521: osd_op(mds.1.8333:761151 2.1f 2:f9f21277:::201.0000369d:head [write 781283~9194 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638978 osd.1 [WRN] slow request 30.599890 seconds old, received at 2017-09-21 19:56:10.038959: osd_op(mds.1.8333:761152 2.1f 2:f9f21277:::201.0000369d:head [write 790477~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638980 osd.1 [WRN] slow request 30.599818 seconds old, received at 2017-09-21 19:56:10.039030: osd_op(mds.1.8333:761153 2.1f 2:f9f21277:::201.0000369d:head [write 792348~2698 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638983 osd.1 [WRN] slow request 30.599748 seconds old, received at 2017-09-21 19:56:10.039101: osd_op(mds.1.8333:761154 2.1f 2:f9f21277:::201.0000369d:head [write 795046~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:35.147818 mon.magna028 [WRN] MDS health message (mds.1): 665 slow requests are blocked > 30 sec
2017-09-21 19:56:35.147850 mon.magna028 [INF] daemon mds.magna063 is now active in filesystem cephfs as rank 1
2017-09-21 19:56:35.239621 mon.magna028 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2017-09-21 19:56:37.921894 mon.magna028 [WRN] Health check update: 38 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:41.922931 mon.magna028 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:41.950431 mon.magna028 [WRN] overall HEALTH_WARN 1 MDSs report slow requests; 38 slow requests are blocked > 32 sec
2017-09-21 19:56:41.639208 osd.1 [WRN] 17 slow requests, 5 included below; oldest blocked for > 32.014169 secs
2017-09-21 19:56:41.639213 osd.1 [WRN] slow request 30.232190 seconds old, received at 2017-09-21 19:56:11.406920: osd_op(mds.1.8333:761167 2.1f 2:f9f21277:::201.0000369d:head [write 858188~1785849 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639215 osd.1 [WRN] slow request 31.599940 seconds old, received at 2017-09-21 19:56:10.039170: osd_op(mds.1.8333:761155 2.1f 2:f9f21277:::201.0000369d:head [write 796917~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639218 osd.1 [WRN] slow request 31.592728 seconds old, received at 2017-09-21 19:56:10.046382: osd_op(mds.1.8333:761156 2.1f 2:f9f21277:::201.0000369d:head [write 798788~24897 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639220 osd.1 [WRN] slow request 31.587236 seconds old, received at 2017-09-21 19:56:10.051874: osd_op(mds.1.8333:761157 2.1f 2:f9f21277:::201.0000369d:head [write 823685~8443 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639235 osd.1 [WRN] slow request 31.587170 seconds old, received at 2017-09-21 19:56:10.051940: osd_op(mds.1.8333:761158 2.1f 2:f9f21277:::201.0000369d:head [write 832128~1379 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
===========================================
mds.0 object_requests and total ino's.
=============================================
[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok dump tree | grep \"ino\" | wc -l | uniq
1978
[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  objecter_requests
{
    "ops": [],
    "linger_ops": [],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}
[ubuntu@magna058 ~]$ 
============================================
mds.1 object_requests and total ino's
===========================================
[ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok dump tree | grep \"ino\" | wc -l | uniq
100906
[ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok objecter_requests
{
    "ops": [
        {
            "tid": 907210,
            "pg": "1.92928168",
            "osd": 0,
            "object_id": "100006f2a77.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a77.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024656s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907199,
            "pg": "1.a22128d3",
            "osd": 1,
            "object_id": "100006f2a89.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a89.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.889298s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907201,
            "pg": "1.f83c78ac",
            "osd": 3,
            "object_id": "100006f2a46.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a46.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.900143s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907206,
            "pg": "1.75a3d4ec",
            "osd": 3,
            "object_id": "100006f2a84.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a84.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.94171s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907209,
            "pg": "2.391b2878",
            "osd": 4,
            "object_id": "201.0000371e",
            "object_locator": "@2",
            "target_object_id": "201.0000371e",
            "target_object_locator": "@2",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-09-21 20:11:37.0.017147s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907211,
            "pg": "1.892c3cb8",
            "osd": 4,
            "object_id": "100006f2977.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2977.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024859s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907204,
            "pg": "1.cd7e27cf",
            "osd": 6,
            "object_id": "100006f2a4a.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a4a.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.905073s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907208,
            "pg": "1.4ad7d6a7",
            "osd": 6,
            "object_id": "100006f2792.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2792.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.941854s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907212,
            "pg": "1.47544307",
            "osd": 6,
            "object_id": "100006f2a70.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a70.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024935s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}
[ubuntu@host063 ~]$ 
======================================================

Scenario:
Create a parent directory dir1, Created 3 sub directories dir1, dir2 and dir2

Started IO's from 3 different(2 kernel and 1 Fuse) clients on above 3 directories. 

Version-Release number of selected component (if applicable):
ceph version 12.2.0-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)

How reproducible:
5/5

Steps to Reproduce:
1. Configure CephFS mount and configure atleast 3 clients
2. Create a directory and 3 subdirectories.
3. Ran IO's from 3 clients in above created directories.

Actual results:
NA

Expected results:
NA

Additional info:
This behaviour was different when 50000 directories and 1 million files created directly on cephfs mount.

almost the no of inodes were equal among the active mds.
Comment 2 Ramakrishnan Periyasamy 2017-09-21 16:40:37 EDT
pinning details:

There is no pinning done.

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ 1 | wc -l | uniq
2

these 2 pins are for following stray paths

"path": "~mds0/stray5/100000003ec"
"path": "~mds0/stray3/1000001d849"

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ 0 | wc -l | uniq
0

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ -1 | wc -l | uniq
20
Comment 5 Brett Niver 2017-10-03 13:54:55 EDT
This is a known issue that will not be addressed in 3.0 timeframe.
Comment 21 Ramakrishnan Periyasamy 2018-07-28 11:03:16 EDT
Moving this bug to verified state.

Automation regression tests passed, CI automation run link: https://polarion.engineering.redhat.com/polarion/#/project/CEPH/testrun?id=12_2_5-35_el7_sanity_fs_rhel-7_5-server-x86_64-released_Automated_Smoke_Runs
Comment 24 errata-xmlrpc 2018-09-26 14:16:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.