Bug 1327141 - Sometimes OSD's are getting wrongly marked Down
Summary: Sometimes OSD's are getting wrongly marked Down
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Samuel Just
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-14 10:32 UTC by Tanay Ganguly
Modified: 2017-07-30 15:16 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-04 07:49:29 UTC
Embargoed:


Attachments (Terms of Use)

Description Tanay Ganguly 2016-04-14 10:32:43 UTC
Description of problem:
Sometimes OSD's are getting wrongly marked Down, thus triggering a cluster data rebalance

Version-Release number of selected component (if applicable):
10.1.1.1

How reproducible:
I have seen it couple of times in my setup.

Steps to Reproduce:
There is no defined steps, sometimes while writing IO i am seeing one of the OSD is getting down and cluster rebalance starts among other remaining OSD's.


Actual results:
OSD is wrongly marked Down.

Expected results:
OSD should not be down.

Additional info:
Logs attached ( OSD's and MON log )

Comment 2 Tanay Ganguly 2016-04-14 10:36:31 UTC
I will try to reproduce with more debug log enabled.

Packages:
python-cephfs-10.1.1-1.el7cp.x86_64
ceph-selinux-10.1.1-1.el7cp.x86_64
ceph-mon-10.1.1-1.el7cp.x86_64
ceph-base-10.1.1-1.el7cp.x86_64
ceph-10.1.1-1.el7cp.x86_64
ceph-release-1-1.el7.noarch
libcephfs1-10.1.1-1.el7cp.x86_64
ceph-osd-10.1.1-1.el7cp.x86_64
ceph-common-10.1.1-1.el7cp.x86_64
ceph-mds-10.1.1-1.el7cp.x86_64

Selinux is permissive on all the Nodes.

Comment 3 Samuel Just 2016-04-14 16:19:56 UTC
There are several things that can cause this:
1) If a thread on an OSD hangs for long enough (on an filesystem call, for example), the OSD can be reported as down.  If it recovers quickly enough, you'll see a "wrongly marked down" message.
2) If networking is flaky between the nodes, nodes might mark each other down while the heartbeats are failing.

1. is usually the culprit.  On what hardware is this happening?

Comment 4 David Zafman 2016-04-14 18:23:25 UTC
You should confirm that the OSD is indeed getting marked out.  Because by default an OSD would have to be down for 5 minutes(mon_osd_down_out_interval) before a rebalance would even start.  So my guess is that there is a severe hardware (disk/network) problem or out of memory causing thrashing to swap.

Comment 5 Tanay Ganguly 2016-04-15 06:05:07 UTC
(In reply to Samuel Just from comment #3)
> There are several things that can cause this:
> 1) If a thread on an OSD hangs for long enough (on an filesystem call, for
> example), the OSD can be reported as down.  If it recovers quickly enough,
> you'll see a "wrongly marked down" message.

Ok, i am trying to reproduce it with more debug logs.

> 2) If networking is flaky between the nodes, nodes might mark each other
> down while the heartbeats are failing.

I don't think so, Network is good as this is hosted in Bangalore Lab and on the same machine i tested for 1.3.2

> 
> 1. is usually the culprit.  On what hardware is this happening?

Actually i have seen it in both Local Hardware (Bangalore Lab) as well as Magna servers.

Bangalore Machine Configuration:
128 G RAM
CPU 12 Cores
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Disk are 1.2 TB Drive with speed of 10k rpm sas

I doubt if hardware is an issue.

Comment 6 Samuel Just 2016-04-21 14:49:15 UTC
Marking as Need More Info until we get logs.

Comment 7 Tanay Ganguly 2016-05-04 07:49:29 UTC
Sam,
I am unable to reproduce this BUG again.

If i am able to reproduce it again, i will file a new one.
Till then i am closing this Bug


Note You need to log in before you can comment on or make changes to this bug.