| Summary: | Sometimes OSD's are getting wrongly marked Down | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Tanay Ganguly <tganguly> |
| Component: | RADOS | Assignee: | Samuel Just <sjust> |
| Status: | CLOSED WORKSFORME | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.0 | CC: | ceph-eng-bugs, dzafman, hnallurv, kchai, kdreyer, kurs, tganguly |
| Target Milestone: | rc | ||
| Target Release: | 2.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-04 07:49:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Tanay Ganguly
2016-04-14 10:32:43 UTC
I will try to reproduce with more debug log enabled. Packages: python-cephfs-10.1.1-1.el7cp.x86_64 ceph-selinux-10.1.1-1.el7cp.x86_64 ceph-mon-10.1.1-1.el7cp.x86_64 ceph-base-10.1.1-1.el7cp.x86_64 ceph-10.1.1-1.el7cp.x86_64 ceph-release-1-1.el7.noarch libcephfs1-10.1.1-1.el7cp.x86_64 ceph-osd-10.1.1-1.el7cp.x86_64 ceph-common-10.1.1-1.el7cp.x86_64 ceph-mds-10.1.1-1.el7cp.x86_64 Selinux is permissive on all the Nodes. There are several things that can cause this: 1) If a thread on an OSD hangs for long enough (on an filesystem call, for example), the OSD can be reported as down. If it recovers quickly enough, you'll see a "wrongly marked down" message. 2) If networking is flaky between the nodes, nodes might mark each other down while the heartbeats are failing. 1. is usually the culprit. On what hardware is this happening? You should confirm that the OSD is indeed getting marked out. Because by default an OSD would have to be down for 5 minutes(mon_osd_down_out_interval) before a rebalance would even start. So my guess is that there is a severe hardware (disk/network) problem or out of memory causing thrashing to swap. (In reply to Samuel Just from comment #3) > There are several things that can cause this: > 1) If a thread on an OSD hangs for long enough (on an filesystem call, for > example), the OSD can be reported as down. If it recovers quickly enough, > you'll see a "wrongly marked down" message. Ok, i am trying to reproduce it with more debug logs. > 2) If networking is flaky between the nodes, nodes might mark each other > down while the heartbeats are failing. I don't think so, Network is good as this is hosted in Bangalore Lab and on the same machine i tested for 1.3.2 > > 1. is usually the culprit. On what hardware is this happening? Actually i have seen it in both Local Hardware (Bangalore Lab) as well as Magna servers. Bangalore Machine Configuration: 128 G RAM CPU 12 Cores Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Disk are 1.2 TB Drive with speed of 10k rpm sas I doubt if hardware is an issue. Marking as Need More Info until we get logs. Sam, I am unable to reproduce this BUG again. If i am able to reproduce it again, i will file a new one. Till then i am closing this Bug |