Description of problem: log [ERR] : OSD full dropping all updates 99% full followed by FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented - Here we need to find the root cause why after having default value for *full* ratio set which is 97% for an OSD ? - What has caused osd to go beyond 97% and then crash with FAILED assert(0 == "unexpected error") : ENOSPC handling not implemented ? - and finally failing to start after that as OSD device was not having Space. 1. mon_osd_full_ratio - (default : 0.95) : When any OSD reaches this threshold the monitor marks the cluster as 'full' and client writes are not accepted. Issue Reported in cluster has same : "mon_osd_full_ratio": "0.95", 2. mon_osd_nearfull_ratio - (default : 0.85) : When any OSD reaches this threshold the cluster goes HEALTH_WARN and calls out near-full OSDs. You can verify in ceph -s output. Issue Reported in cluster has same : "mon_osd_nearfull_ratio": "0.85", 3. osd_backfill_full_ratio - (default : 0.85) : When an OSD locally reaches this threshold it will refuse to migrate a PG to itself. This prevents rebalancing or repair from overfilling an OSD. It should be lower than the osd_failsafe_full_ratio. Issue Reported in cluster has same : "osd_backfill_full_ratio": "0.85", 4. osd_failsafe_full_ratio - (default : 0.97) : This is a final sanity check that makes the OSD throw out writes if it is really close to full. Issue Reported in cluster has same : "osd_failsafe_full_ratio": "0.97", 5. osd_failsafe_nearfull_ratio - (default : 0.90) : When any OSD reaches this threshold, it start throwing warnings for near full for particular OSD but before that cluster will hit mon_osd_nearfull_ratio as it is default value is 0.85. Issue Reported in cluster has same : "osd_failsafe_nearfull_ratio": "0.9", Options 1 and 2 are cluster wide and options 3 , 4 and 5 are for particular OSD in cluster. Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.2.3 ceph-0.80.8-5.el7cp
The piece to verify for this is that recovery does not overfill the osds. To reproduce, you can set the 'mon osd full threshold' on the monitor to a low value, e.g. 0.1 (10%), fill a cluster up close to that point, and then mark one of the osds out. The other osds will start recovering. Once an osd reaches 10% full, it will be marked as full in the osdmap, and subsequent recovery operations should stall in the recovery_toofull state. Increasing the full ratio again to the default of 0.95 should let recovery complete.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387