Bug 1332083

Summary: log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: David Zafman <dzafman>
Status: CLOSED ERRATA QA Contact: David Zafman <dzafman>
Severity: high Docs Contact: Bara Ancincova <bancinco>
Priority: high    
Version: 1.2.3CC: bengland, ceph-eng-bugs, dzafman, hnallurv, jbuchta, jdurgin, kchai, kdreyer, tpetr, vumrao
Target Milestone: rc   
Target Release: 3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-12.1.4-1.el7cp Ubuntu: ceph_12.1.4-2redhat1xenial Doc Type: Bug Fix
Doc Text:
.Improvements in handling of full OSDs When an OSD disk became so full that the OSD could not function, the OSD terminated unexpectedly with a confusing assert message. With this update: * The error message has been improved. * By default, no more than 25% of OSDs are automatically marked as `out`. * The `statfs` calculation in FileStore or BlueStore back ends have been improved to better reflect the disk usage. As a result, OSDs are less likely to become full and if they do, a more informative error message is added to the log.
Story Points: ---
Clone Of:
: 1420417 (view as bug list) Environment:
Last Closed: 2017-12-05 23:29:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1420417, 1494421    

Description Vikhyat Umrao 2016-05-02 07:18:16 UTC
Description of problem:

log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

- Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ?

- What has caused osd to go beyond 97% and then crash with  FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ?

- and finally failing to start after that as OSD device was not having Space.


1. mon_osd_full_ratio - (default : 0.95) :
When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted.
Issue Reported in cluster has  same :   "mon_osd_full_ratio": "0.95",

2. mon_osd_nearfull_ratio - (default : 0.85) :
When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output.
Issue Reported in  cluster has same :   "mon_osd_nearfull_ratio": "0.85",

3. osd_backfill_full_ratio - (default : 0.85) :
When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD.
It should be lower than the osd_failsafe_full_ratio.
Issue Reported in  cluster has same :  "osd_backfill_full_ratio": "0.85",

4. osd_failsafe_full_ratio - (default : 0.97) :
This is a final sanity check that makes the OSD throw out writes if it is really close to full.
Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97",

5. osd_failsafe_nearfull_ratio - (default : 0.90) :
When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85.
Issue Reported in  cluster has same : "osd_failsafe_nearfull_ratio": "0.9",

Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster.



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
ceph-0.80.8-5.el7cp

Comment 46 Josh Durgin 2017-09-19 05:24:03 UTC
The piece to verify for this is that recovery does not overfill the osds.

To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state.

Increasing the full ratio again to the default of 0.95 should let recovery complete.

Comment 53 errata-xmlrpc 2017-12-05 23:29:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387