Description of problem:
log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented
- Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ?
- What has caused osd to go beyond 97% and then crash with FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ?
- and finally failing to start after that as OSD device was not having Space.
1. mon_osd_full_ratio - (default : 0.95) :
When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted.
Issue Reported in cluster has same : "mon_osd_full_ratio": "0.95",
2. mon_osd_nearfull_ratio - (default : 0.85) :
When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output.
Issue Reported in cluster has same : "mon_osd_nearfull_ratio": "0.85",
3. osd_backfill_full_ratio - (default : 0.85) :
When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD.
It should be lower than the osd_failsafe_full_ratio.
Issue Reported in cluster has same : "osd_backfill_full_ratio": "0.85",
4. osd_failsafe_full_ratio - (default : 0.97) :
This is a final sanity check that makes the OSD throw out writes if it is really close to full.
Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97",
5. osd_failsafe_nearfull_ratio - (default : 0.90) :
When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85.
Issue Reported in cluster has same : "osd_failsafe_nearfull_ratio": "0.9",
Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster.
Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
The piece to verify for this is that recovery does not overfill the osds.
To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state.
Increasing the full ratio again to the default of 0.95 should let recovery complete.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.