Description of problem: Sometimes OSD's are getting wrongly marked Down, thus triggering a cluster data rebalance Version-Release number of selected component (if applicable): 10.1.1.1 How reproducible: I have seen it couple of times in my setup. Steps to Reproduce: There is no defined steps, sometimes while writing IO i am seeing one of the OSD is getting down and cluster rebalance starts among other remaining OSD's. Actual results: OSD is wrongly marked Down. Expected results: OSD should not be down. Additional info: Logs attached ( OSD's and MON log )
I will try to reproduce with more debug log enabled. Packages: python-cephfs-10.1.1-1.el7cp.x86_64 ceph-selinux-10.1.1-1.el7cp.x86_64 ceph-mon-10.1.1-1.el7cp.x86_64 ceph-base-10.1.1-1.el7cp.x86_64 ceph-10.1.1-1.el7cp.x86_64 ceph-release-1-1.el7.noarch libcephfs1-10.1.1-1.el7cp.x86_64 ceph-osd-10.1.1-1.el7cp.x86_64 ceph-common-10.1.1-1.el7cp.x86_64 ceph-mds-10.1.1-1.el7cp.x86_64 Selinux is permissive on all the Nodes.
There are several things that can cause this: 1) If a thread on an OSD hangs for long enough (on an filesystem call, for example), the OSD can be reported as down. If it recovers quickly enough, you'll see a "wrongly marked down" message. 2) If networking is flaky between the nodes, nodes might mark each other down while the heartbeats are failing. 1. is usually the culprit. On what hardware is this happening?
You should confirm that the OSD is indeed getting marked out. Because by default an OSD would have to be down for 5 minutes(mon_osd_down_out_interval) before a rebalance would even start. So my guess is that there is a severe hardware (disk/network) problem or out of memory causing thrashing to swap.
(In reply to Samuel Just from comment #3) > There are several things that can cause this: > 1) If a thread on an OSD hangs for long enough (on an filesystem call, for > example), the OSD can be reported as down. If it recovers quickly enough, > you'll see a "wrongly marked down" message. Ok, i am trying to reproduce it with more debug logs. > 2) If networking is flaky between the nodes, nodes might mark each other > down while the heartbeats are failing. I don't think so, Network is good as this is hosted in Bangalore Lab and on the same machine i tested for 1.3.2 > > 1. is usually the culprit. On what hardware is this happening? Actually i have seen it in both Local Hardware (Bangalore Lab) as well as Magna servers. Bangalore Machine Configuration: 128 G RAM CPU 12 Cores Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Disk are 1.2 TB Drive with speed of 10k rpm sas I doubt if hardware is an issue.
Marking as Need More Info until we get logs.
Sam, I am unable to reproduce this BUG again. If i am able to reproduce it again, i will file a new one. Till then i am closing this Bug