Bug 1332083

Summary:	log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	David Zafman <dzafman>
Status:	CLOSED ERRATA	QA Contact:	David Zafman <dzafman>
Severity:	high	Docs Contact:	Bara Ancincova <bancinco>
Priority:	high
Version:	1.2.3	CC:	bengland, ceph-eng-bugs, dzafman, hnallurv, jbuchta, jdurgin, kchai, kdreyer, tpetr, vumrao
Target Milestone:	rc
Target Release:	3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-12.1.4-1.el7cp Ubuntu: ceph_12.1.4-2redhat1xenial	Doc Type:	Bug Fix
Doc Text:	.Improvements in handling of full OSDs When an OSD disk became so full that the OSD could not function, the OSD terminated unexpectedly with a confusing assert message. With this update: * The error message has been improved. * By default, no more than 25% of OSDs are automatically marked as `out`. * The `statfs` calculation in FileStore or BlueStore back ends have been improved to better reflect the disk usage. As a result, OSDs are less likely to become full and if they do, a more informative error message is added to the log.	Story Points:	---
Clone Of:
Clones:	1420417 (view as bug list)		Environment:
Last Closed:	2017-12-05 23:29:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420417, 1494421

Description Vikhyat Umrao 2016-05-02 07:18:16 UTC

Description of problem:

log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

- Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ?

- What has caused osd to go beyond 97% and then crash with FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ?

- and finally failing to start after that as OSD device was not having Space.

1. mon_osd_full_ratio - (default : 0.95) :
When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted.
Issue Reported in cluster has same : "mon_osd_full_ratio": "0.95",

2. mon_osd_nearfull_ratio - (default : 0.85) :
When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output.
Issue Reported in cluster has same : "mon_osd_nearfull_ratio": "0.85",

3. osd_backfill_full_ratio - (default : 0.85) :
When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD.
It should be lower than the osd_failsafe_full_ratio.
Issue Reported in cluster has same : "osd_backfill_full_ratio": "0.85",

4. osd_failsafe_full_ratio - (default : 0.97) :
This is a final sanity check that makes the OSD throw out writes if it is really close to full.
Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97",

5. osd_failsafe_nearfull_ratio - (default : 0.90) :
When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85.
Issue Reported in cluster has same : "osd_failsafe_nearfull_ratio": "0.9",

Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster.

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
ceph-0.80.8-5.el7cp

Comment 46 Josh Durgin 2017-09-19 05:24:03 UTC

The piece to verify for this is that recovery does not overfill the osds.

To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state.

Increasing the full ratio again to the default of 0.95 should let recovery complete.

Comment 53 errata-xmlrpc 2017-12-05 23:29:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387