Bug 1367797

Summary: [DOCS][RFE] Request to add section into hardware guide addressing proper failure domain configuration for Ceph monitors
Product: Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: DocumentationAssignee: Bara Ancincova <bancinco>
Status: CLOSED CURRENTRELEASE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 1.3.2CC: asriram, flucifre, hnallurv, kbader, kdreyer, mhackett
Target Milestone: rc   
Target Release: 1.3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-02 11:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike Hackett 2016-08-17 13:53:03 UTC
Description of problem:
We have had several cases opened with support where a customers site took a power hit and the Ceph cluster was unable to recover due to levelDB corruption on the monitors due to not properly having their failure domains configured for monitor nodes or battery backup available when using write-back cache.

We could update section "2.3.7. Additional Considerations" in the Hardware Guide to include recommendations about separate power fees to customers racks to prevent such issues or make note of battery backup as an alternative.

Version-Release number of selected component (if applicable):
1.3.x

Comment 2 Mike Hackett 2016-08-17 13:55:05 UTC
KCS https://access.redhat.com/solutions/2518281 is a WIP for recovering from this issue.

Comment 4 Federico Lucifredi 2016-09-15 10:18:08 UTC
This bug requires no QE verification. 

We want the new hardware guidelines to be reviewed by Kyle Bader if he is not the originator already.

Comment 9 Kyle Bader 2016-10-05 18:13:57 UTC
The problem isn't that customers failed to put monitors in different failure domains. The problem is customers putting the monitor stores on volatile media. Monitor stores should be on SSDs that ensure data integrity during power loss (hint: use Intel DC series). If they are putting them on spinning media, then the disk write cache should be disabled, and any RAID controller should either be battery backed or have a supercap.

Comment 12 Kyle Bader 2016-12-01 17:55:45 UTC
Use SSDs for monitor stores

The monitor store can generate a significant amount of IO, making SSDs an ideal choice of storage media. To ensure data integrity during power loss, all caches in the data path need to either be disabled or safeguarded by hardware mechanisms like battery backup units or super capacitors coupled with non-volatile stores.