1593322 – load might become zero on a MDS in multi-MDS setup

Bug 1593322 - load might become zero on a MDS in multi-MDS setup

Summary: load might become zero on a MDS in multi-MDS setup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	z5
Target Release:	3.0
Assignee:	Patrick Donnelly
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-20 14:39 UTC by Ram Raja
Modified:	2018-08-09 18:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-32 Ubuntu: ceph_12.2.4-36redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-09 18:27:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	24538	0	None	None	None	2018-06-20 14:39:38 UTC
Red Hat Product Errata	RHBA-2018:2375	0	None	None	None	2018-08-09 18:27:42 UTC

Description Ram Raja 2018-06-20 14:39:39 UTC

Description of problem:

Reported by a community developer,
"
Recently we found mds load might become zero on another MDS under multi-MDSes scenario. The ceph version is Luminous.

From below log, MDS.1 got its load and would send it to MDS.0.
```
2018-05-13 17:20:31.252804 7ffa471d7700  0 mds.1.bal mds.1 epoch 17 load mdsload<[0,6245.74 12491.5]/[0,33581.5 67163], req 13440, hr 0, qlen 18, cpu 0.04>
``` 

When MDS.0 handled the heartbeat and got MDS.1's load from message, the load became zero.
```
2018-05-13 17:20:30.988828 7f65d8b45700  0 mds.0.bal mds.0 epoch 17 load mdsload<[5580.31,18907.2 43394.8]/[33213.8,109221 251656], req 42543, hr 0, qlen 36, cpu 0.75>
2018-05-13 17:20:31.280096 7f65db34a700  0 mds.0.bal   mds.0 mdsload<[5580.31,18907.2 43394.8]/[33213.8,109221 251656], req 42543, hr 0, qlen 36, cpu 0.75> = 127950 ~ 43394.8
2018-05-13 17:20:31.280113 7f65db34a700  0 mds.0.bal   mds.1 mdsload<[0,0 0]/[0,0 0], req 13440, hr 0, qlen 18, cpu 0.04> = 37045.8 ~ 12564.2
```

We found the last_decay in this message is 0 (utime_t()), so the eclipse time is very large and the original value would be decayed to 0. I think we should not decay any value if last_decay is utime_t()."

Comment 14 errata-xmlrpc 2018-08-09 18:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2375

Note You need to log in before you can comment on or make changes to this bug.