1332083 – log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

Bug 1332083 - log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

Summary: log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.2.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.0
Assignee:	David Zafman
QA Contact:	David Zafman
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks:	1420417 1494421
TreeView+	depends on / blocked

Reported:	2016-05-02 07:18 UTC by Vikhyat Umrao
Modified:	2020-06-11 12:51 UTC (History)
CC List:	10 users (show)
Fixed In Version:	RHEL: ceph-12.1.4-1.el7cp Ubuntu: ceph_12.1.4-2redhat1xenial
Doc Type:	Bug Fix
Doc Text:	.Improvements in handling of full OSDs When an OSD disk became so full that the OSD could not function, the OSD terminated unexpectedly with a confusing assert message. With this update: * The error message has been improved. * By default, no more than 25% of OSDs are automatically marked as `out`. * The `statfs` calculation in FileStore or BlueStore back ends have been improved to better reflect the disk usage. As a result, OSDs are less likely to become full and if they do, a more informative error message is added to the log.
Clone Of:
Clones:	1420417 (view as bug list)
Environment:
Last Closed:	2017-12-05 23:29:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	15912	0	None	None	None	2016-05-18 16:06:50 UTC
Red Hat Product Errata	RHBA-2017:3387	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 3.0 bug fix and enhancement update	2017-12-06 03:03:45 UTC

Description Vikhyat Umrao 2016-05-02 07:18:16 UTC

Description of problem:

log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented

- Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ?

- What has caused osd to go beyond 97% and then crash with FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ?

- and finally failing to start after that as OSD device was not having Space.

1. mon_osd_full_ratio - (default : 0.95) :
When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted.
Issue Reported in cluster has same : "mon_osd_full_ratio": "0.95",

2. mon_osd_nearfull_ratio - (default : 0.85) :
When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output.
Issue Reported in cluster has same : "mon_osd_nearfull_ratio": "0.85",

3. osd_backfill_full_ratio - (default : 0.85) :
When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD.
It should be lower than the osd_failsafe_full_ratio.
Issue Reported in cluster has same : "osd_backfill_full_ratio": "0.85",

4. osd_failsafe_full_ratio - (default : 0.97) :
This is a final sanity check that makes the OSD throw out writes if it is really close to full.
Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97",

5. osd_failsafe_nearfull_ratio - (default : 0.90) :
When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85.
Issue Reported in cluster has same : "osd_failsafe_nearfull_ratio": "0.9",

Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster.

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
ceph-0.80.8-5.el7cp

Comment 46 Josh Durgin 2017-09-19 05:24:03 UTC

The piece to verify for this is that recovery does not overfill the osds.

To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state.

Increasing the full ratio again to the default of 0.95 should let recovery complete.

Comment 53 errata-xmlrpc 2017-12-05 23:29:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387

Note You need to log in before you can comment on or make changes to this bug.