Bug 1332083 - log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented
Summary: log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.2.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 3.0
Assignee: David Zafman
QA Contact: David Zafman
Bara Ancincova
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1420417 1494421
TreeView+ depends on / blocked
 
Reported: 2016-05-02 07:18 UTC by Vikhyat Umrao
Modified: 2017-12-05 23:29 UTC (History)
10 users (show)

(edit)
.Improvements in handling of full OSDs

When an OSD disk became so full that the OSD could not function, the OSD terminated unexpectedly with a confusing assert message. With this update:

* The error message has been improved.
* By default, no more than 25% of OSDs are automatically marked as `out`.
* The `statfs` calculation in FileStore or BlueStore back ends have been improved to better reflect the disk usage.

As a result, OSDs are less likely to become full and if they do, a more informative error message is added to the log.
Clone Of:
: 1420417 (view as bug list)
(edit)
Last Closed: 2017-12-05 23:29:38 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3387 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix and enhancement update 2017-12-06 03:03:45 UTC
Ceph Project Bug Tracker 15912 None None None 2016-05-18 16:06 UTC

Description Vikhyat Umrao 2016-05-02 07:18:16 UTC
Description of problem:

log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

- Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ?

- What has caused osd to go beyond 97% and then crash with  FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ?

- and finally failing to start after that as OSD device was not having Space.


1. mon_osd_full_ratio - (default : 0.95) :
When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted.
Issue Reported in cluster has  same :   "mon_osd_full_ratio": "0.95",

2. mon_osd_nearfull_ratio - (default : 0.85) :
When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output.
Issue Reported in  cluster has same :   "mon_osd_nearfull_ratio": "0.85",

3. osd_backfill_full_ratio - (default : 0.85) :
When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD.
It should be lower than the osd_failsafe_full_ratio.
Issue Reported in  cluster has same :  "osd_backfill_full_ratio": "0.85",

4. osd_failsafe_full_ratio - (default : 0.97) :
This is a final sanity check that makes the OSD throw out writes if it is really close to full.
Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97",

5. osd_failsafe_nearfull_ratio - (default : 0.90) :
When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85.
Issue Reported in  cluster has same : "osd_failsafe_nearfull_ratio": "0.9",

Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster.



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
ceph-0.80.8-5.el7cp

Comment 46 Josh Durgin 2017-09-19 05:24:03 UTC
The piece to verify for this is that recovery does not overfill the osds.

To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state.

Increasing the full ratio again to the default of 0.95 should let recovery complete.

Comment 53 errata-xmlrpc 2017-12-05 23:29:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387


Note You need to log in before you can comment on or make changes to this bug.