1494256 – [CephFS] Default load balancing is not sharing the load among 2 active MDSs

Bug 1494256 - [CephFS] Default load balancing is not sharing the load among 2 active MDSs

Summary: [CephFS] Default load balancing is not sharing the load among 2 active MDSs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	3.1
Assignee:	Yan, Zheng
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks:	1494421 1584264
TreeView+	depends on / blocked

Reported:	2017-09-21 20:36 UTC by Ramakrishnan Periyasamy
Modified:	2018-09-26 18:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.Load on MDS daemons is not always balanced fairly or evenly in multiple active MDS configurations Previously, in certain cases, the MDS balancers offloaded too much metadata to another active daemon, or none at all. As of this update to {product} this is no longer an issue as several balancer fixes and optimization have been made which address the issue.
Clone Of:
Environment:
Last Closed:	2018-09-26 18:16:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1554593	0	high	CLOSED	Kernel and fuse client mount hangs forever	2021-09-09 13:31:14 UTC
Red Hat Product Errata	RHBA-2018:2819	0	None	None	None	2018-09-26 18:18:17 UTC

Internal Links: 1554593

Description Ramakrishnan Periyasamy 2017-09-21 20:36:09 UTC

Description of problem:
MDS Default load balancing is not dividing the load among 2 active MDS incase of single directory with multiple subdirectories and files.

Due to this observed "MDS_SLOW_REQUEST" messages in "ceph -w" console output for mds.1, also Observed too many object requests and inodes associated to mds.1 in --admin-daemon commands.
================================
2017-09-21 19:56:04.444695 mon.magna028 [INF] Cluster is now healthy
2017-09-21 19:56:23.918095 mon.magna028 [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:27.919163 mon.magna028 [WRN] Health check update: 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:39.638662 osd.1 [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.013676 secs
2017-09-21 19:56:39.638666 osd.1 [WRN] slow request 30.013676 seconds old, received at 2017-09-21 19:56:09.624941: osd_op(mds.354658.0:315906 2.1f 2:f9f21277:::201.0000369d:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638964 osd.1 [WRN] 16 slow requests, 5 included below; oldest blocked for > 31.013907 secs
2017-09-21 19:56:40.638969 osd.1 [WRN] slow request 30.614480 seconds old, received at 2017-09-21 19:56:10.024368: osd_op(mds.1.8333:761150 2.1f 2:f9f21277:::201.0000369d:head [write 779414~1869 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638976 osd.1 [WRN] slow request 30.606327 seconds old, received at 2017-09-21 19:56:10.032521: osd_op(mds.1.8333:761151 2.1f 2:f9f21277:::201.0000369d:head [write 781283~9194 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638978 osd.1 [WRN] slow request 30.599890 seconds old, received at 2017-09-21 19:56:10.038959: osd_op(mds.1.8333:761152 2.1f 2:f9f21277:::201.0000369d:head [write 790477~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638980 osd.1 [WRN] slow request 30.599818 seconds old, received at 2017-09-21 19:56:10.039030: osd_op(mds.1.8333:761153 2.1f 2:f9f21277:::201.0000369d:head [write 792348~2698 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:40.638983 osd.1 [WRN] slow request 30.599748 seconds old, received at 2017-09-21 19:56:10.039101: osd_op(mds.1.8333:761154 2.1f 2:f9f21277:::201.0000369d:head [write 795046~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:35.147818 mon.magna028 [WRN] MDS health message (mds.1): 665 slow requests are blocked > 30 sec
2017-09-21 19:56:35.147850 mon.magna028 [INF] daemon mds.magna063 is now active in filesystem cephfs as rank 1
2017-09-21 19:56:35.239621 mon.magna028 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2017-09-21 19:56:37.921894 mon.magna028 [WRN] Health check update: 38 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:41.922931 mon.magna028 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-09-21 19:56:41.950431 mon.magna028 [WRN] overall HEALTH_WARN 1 MDSs report slow requests; 38 slow requests are blocked > 32 sec
2017-09-21 19:56:41.639208 osd.1 [WRN] 17 slow requests, 5 included below; oldest blocked for > 32.014169 secs
2017-09-21 19:56:41.639213 osd.1 [WRN] slow request 30.232190 seconds old, received at 2017-09-21 19:56:11.406920: osd_op(mds.1.8333:761167 2.1f 2:f9f21277:::201.0000369d:head [write 858188~1785849 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639215 osd.1 [WRN] slow request 31.599940 seconds old, received at 2017-09-21 19:56:10.039170: osd_op(mds.1.8333:761155 2.1f 2:f9f21277:::201.0000369d:head [write 796917~1871 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639218 osd.1 [WRN] slow request 31.592728 seconds old, received at 2017-09-21 19:56:10.046382: osd_op(mds.1.8333:761156 2.1f 2:f9f21277:::201.0000369d:head [write 798788~24897 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639220 osd.1 [WRN] slow request 31.587236 seconds old, received at 2017-09-21 19:56:10.051874: osd_op(mds.1.8333:761157 2.1f 2:f9f21277:::201.0000369d:head [write 823685~8443 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
2017-09-21 19:56:41.639235 osd.1 [WRN] slow request 31.587170 seconds old, received at 2017-09-21 19:56:10.051940: osd_op(mds.1.8333:761158 2.1f 2:f9f21277:::201.0000369d:head [write 832128~1379 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e1393) currently waiting for rw locks
===========================================
mds.0 object_requests and total ino's.
=============================================
[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok dump tree | grep \"ino\" | wc -l | uniq
1978
[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  objecter_requests
{
    "ops": [],
    "linger_ops": [],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}
[ubuntu@magna058 ~]$ 
============================================
mds.1 object_requests and total ino's
===========================================
[ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok dump tree | grep \"ino\" | wc -l | uniq
100906
[ubuntu@host063 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host063.asok objecter_requests
{
    "ops": [
        {
            "tid": 907210,
            "pg": "1.92928168",
            "osd": 0,
            "object_id": "100006f2a77.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a77.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024656s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907199,
            "pg": "1.a22128d3",
            "osd": 1,
            "object_id": "100006f2a89.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a89.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.889298s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907201,
            "pg": "1.f83c78ac",
            "osd": 3,
            "object_id": "100006f2a46.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a46.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.900143s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907206,
            "pg": "1.75a3d4ec",
            "osd": 3,
            "object_id": "100006f2a84.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a84.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.94171s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907209,
            "pg": "2.391b2878",
            "osd": 4,
            "object_id": "201.0000371e",
            "object_locator": "@2",
            "target_object_id": "201.0000371e",
            "target_object_locator": "@2",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "0=[]",
            "mtime": "2017-09-21 20:11:37.0.017147s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907211,
            "pg": "1.892c3cb8",
            "osd": 4,
            "object_id": "100006f2977.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2977.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024859s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907204,
            "pg": "1.cd7e27cf",
            "osd": 6,
            "object_id": "100006f2a4a.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a4a.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.905073s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907208,
            "pg": "1.4ad7d6a7",
            "osd": 6,
            "object_id": "100006f2792.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2792.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:36.0.941854s",
            "osd_ops": [
                "delete"
            ]
        },
        {
            "tid": 907212,
            "pg": "1.47544307",
            "osd": 6,
            "object_id": "100006f2a70.00000000",
            "object_locator": "@1",
            "target_object_id": "100006f2a70.00000000",
            "target_object_locator": "@1",
            "paused": 0,
            "used_replica": 0,
            "precalc_pgid": 0,
            "last_sent": "272328s",
            "attempts": 1,
            "snapid": "head",
            "snap_context": "1=[]",
            "mtime": "2017-09-21 20:11:37.0.024935s",
            "osd_ops": [
                "delete"
            ]
        }
    ],
    "linger_ops": [],
    "pool_ops": [],
    "pool_stat_ops": [],
    "statfs_ops": [],
    "command_ops": []
}
[ubuntu@host063 ~]$ 
======================================================

Scenario:
Create a parent directory dir1, Created 3 sub directories dir1, dir2 and dir2

Started IO's from 3 different(2 kernel and 1 Fuse) clients on above 3 directories. 

Version-Release number of selected component (if applicable):
ceph version 12.2.0-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)

How reproducible:
5/5

Steps to Reproduce:
1. Configure CephFS mount and configure atleast 3 clients
2. Create a directory and 3 subdirectories.
3. Ran IO's from 3 clients in above created directories.

Actual results:
NA

Expected results:
NA

Additional info:
This behaviour was different when 50000 directories and 1 million files created directly on cephfs mount.

almost the no of inodes were equal among the active mds.

Comment 2 Ramakrishnan Periyasamy 2017-09-21 20:40:37 UTC

pinning details:

There is no pinning done.

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ 1 | wc -l | uniq
2

these 2 pins are for following stray paths

"path": "~mds0/stray5/100000003ec"
"path": "~mds0/stray3/1000001d849"

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ 0 | wc -l | uniq
0

[ubuntu@host058 ~]$ sudo ceph --admin-daemon /var/run/ceph/qetest-mds.host058.asok  get subtrees | grep \"export_pin\"\:\ -1 | wc -l | uniq
20

Comment 5 Brett Niver 2017-10-03 17:54:55 UTC

This is a known issue that will not be addressed in 3.0 timeframe.

Comment 21 Ramakrishnan Periyasamy 2018-07-28 15:03:16 UTC

Moving this bug to verified state.

Automation regression tests passed, CI automation run link: https://polarion.engineering.redhat.com/polarion/#/project/CEPH/testrun?id=12_2_5-35_el7_sanity_fs_rhel-7_5-server-x86_64-released_Automated_Smoke_Runs

Comment 24 errata-xmlrpc 2018-09-26 18:16:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.