Bug 1367797 - [DOCS][RFE] Request to add section into hardware guide addressing proper failure domain configuration for Ceph monitors
Summary: [DOCS][RFE] Request to add section into hardware guide addressing proper fail...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Documentation
Version: 1.3.2
Hardware: x86_64
OS: Linux
low
low
Target Milestone: rc
: 1.3.3
Assignee: Bara Ancincova
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-17 13:53 UTC by Mike Hackett
Modified: 2019-11-14 08:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-02 11:04:36 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mike Hackett 2016-08-17 13:53:03 UTC
Description of problem:
We have had several cases opened with support where a customers site took a power hit and the Ceph cluster was unable to recover due to levelDB corruption on the monitors due to not properly having their failure domains configured for monitor nodes or battery backup available when using write-back cache.

We could update section "2.3.7. Additional Considerations" in the Hardware Guide to include recommendations about separate power fees to customers racks to prevent such issues or make note of battery backup as an alternative.

Version-Release number of selected component (if applicable):
1.3.x

Comment 2 Mike Hackett 2016-08-17 13:55:05 UTC
KCS https://access.redhat.com/solutions/2518281 is a WIP for recovering from this issue.

Comment 4 Federico Lucifredi 2016-09-15 10:18:08 UTC
This bug requires no QE verification. 

We want the new hardware guidelines to be reviewed by Kyle Bader if he is not the originator already.

Comment 9 Kyle Bader 2016-10-05 18:13:57 UTC
The problem isn't that customers failed to put monitors in different failure domains. The problem is customers putting the monitor stores on volatile media. Monitor stores should be on SSDs that ensure data integrity during power loss (hint: use Intel DC series). If they are putting them on spinning media, then the disk write cache should be disabled, and any RAID controller should either be battery backed or have a supercap.

Comment 12 Kyle Bader 2016-12-01 17:55:45 UTC
Use SSDs for monitor stores

The monitor store can generate a significant amount of IO, making SSDs an ideal choice of storage media. To ensure data integrity during power loss, all caches in the data path need to either be disabled or safeguarded by hardware mechanisms like battery backup units or super capacitors coupled with non-volatile stores.


Note You need to log in before you can comment on or make changes to this bug.