Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1327141

Summary:	Sometimes OSD's are getting wrongly marked Down
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tanay Ganguly <tganguly>
Component:	RADOS	Assignee:	Samuel Just <sjust>
Status:	CLOSED WORKSFORME	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	2.0	CC:	ceph-eng-bugs, dzafman, hnallurv, kchai, kdreyer, kurs, tganguly
Target Milestone:	rc
Target Release:	2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-04 07:49:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tanay Ganguly 2016-04-14 10:32:43 UTC

Description of problem:
Sometimes OSD's are getting wrongly marked Down, thus triggering a cluster data rebalance

Version-Release number of selected component (if applicable):
10.1.1.1

How reproducible:
I have seen it couple of times in my setup.

Steps to Reproduce:
There is no defined steps, sometimes while writing IO i am seeing one of the OSD is getting down and cluster rebalance starts among other remaining OSD's.


Actual results:
OSD is wrongly marked Down.

Expected results:
OSD should not be down.

Additional info:
Logs attached ( OSD's and MON log )

Comment 2 Tanay Ganguly 2016-04-14 10:36:31 UTC

I will try to reproduce with more debug log enabled.

Packages:
python-cephfs-10.1.1-1.el7cp.x86_64
ceph-selinux-10.1.1-1.el7cp.x86_64
ceph-mon-10.1.1-1.el7cp.x86_64
ceph-base-10.1.1-1.el7cp.x86_64
ceph-10.1.1-1.el7cp.x86_64
ceph-release-1-1.el7.noarch
libcephfs1-10.1.1-1.el7cp.x86_64
ceph-osd-10.1.1-1.el7cp.x86_64
ceph-common-10.1.1-1.el7cp.x86_64
ceph-mds-10.1.1-1.el7cp.x86_64

Selinux is permissive on all the Nodes.

Comment 3 Samuel Just 2016-04-14 16:19:56 UTC

There are several things that can cause this:
1) If a thread on an OSD hangs for long enough (on an filesystem call, for example), the OSD can be reported as down.  If it recovers quickly enough, you'll see a "wrongly marked down" message.
2) If networking is flaky between the nodes, nodes might mark each other down while the heartbeats are failing.

1. is usually the culprit.  On what hardware is this happening?

Comment 4 David Zafman 2016-04-14 18:23:25 UTC

You should confirm that the OSD is indeed getting marked out.  Because by default an OSD would have to be down for 5 minutes(mon_osd_down_out_interval) before a rebalance would even start.  So my guess is that there is a severe hardware (disk/network) problem or out of memory causing thrashing to swap.

Comment 5 Tanay Ganguly 2016-04-15 06:05:07 UTC

(In reply to Samuel Just from comment #3)
> There are several things that can cause this:
> 1) If a thread on an OSD hangs for long enough (on an filesystem call, for
> example), the OSD can be reported as down.  If it recovers quickly enough,
> you'll see a "wrongly marked down" message.

Ok, i am trying to reproduce it with more debug logs.

> 2) If networking is flaky between the nodes, nodes might mark each other
> down while the heartbeats are failing.

I don't think so, Network is good as this is hosted in Bangalore Lab and on the same machine i tested for 1.3.2

> 
> 1. is usually the culprit.  On what hardware is this happening?

Actually i have seen it in both Local Hardware (Bangalore Lab) as well as Magna servers.

Bangalore Machine Configuration:
128 G RAM
CPU 12 Cores
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Disk are 1.2 TB Drive with speed of 10k rpm sas

I doubt if hardware is an issue.

Comment 6 Samuel Just 2016-04-21 14:49:15 UTC

Marking as Need More Info until we get logs.

Comment 7 Tanay Ganguly 2016-05-04 07:49:29 UTC

Sam,
I am unable to reproduce this BUG again.

If i am able to reproduce it again, i will file a new one.
Till then i am closing this Bug