RFE/Bugzilla: ~~~ a) Description of problem: When an OSD hit its full_ratio, the cluster stops any I/O coming in. From an OpenStack perspective, an OSD hitting the full_ratio will pause the VMs. In order to delete objects and free space, a manual intervention is needed to set cluster flags such as norebalance, increase the full_ratio (0.95) a bit higher to allow I/O, and then delete objects from the OSP side. Since this involves quite a few steps, it is not the easiest to fix. We need a better solution which is easier to follow, perhaps from both Ceph and OpenStack. b) Version-Release number of selected component (if applicable): RHCS2.x c) How reproducible: Always d) Steps to Reproduce: 1. Use OpenStack with Ceph as backing storage. 2. Fill the OSDs until it hit the full_ratio. 3. Make sure an I/O error is hit for further writes. e) Additional info: It may be good to understand how to fix this from an OpenStack perspective as well, and such a feature would help administrators not to meddle with the Ceph cluster but rather get it fixed from the OSP side.
This is really an rbd feature - to use the librados FORCE_FULL_TRY functionality for deletes - and removing an rbd image or snapshot when the cluster is full is possible in luminous.
@Jason, any specific config settings or steps to be done before deleting rbd image or snapshot when the cluster is full?
@Harish: negative -- it *should* just allow you to run the following commands when the cluster is full: "rbd remove", "rbd snap rm", "rbd snap unprotect", and "rbd snap purge".
$ ceph health HEALTH_ERR full flag(s) set; 3 full osd(s) $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 30911M 623M 30288M 97.98 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 1 9000M 100.00 0 2262 $ rbd snap ls foo SNAPID NAME SIZE TIMESTAMP 4 1 10240 MB Tue Oct 24 21:08:11 2017 5 2 10240 MB Tue Oct 24 21:08:29 2017 6 3 10240 MB Tue Oct 24 21:08:51 2017 7 4 10240 MB Tue Oct 24 21:09:32 2017 8 5 10240 MB Tue Oct 24 21:10:01 2017 9 6 10240 MB Tue Oct 24 21:11:10 2017 10 7 10240 MB Tue Oct 24 21:14:01 2017 $ rbd snap unprotect foo@7 $ rbd snap unprotect foo@1 $ rbd snap rm foo@7 $ rbd snap rm foo@1 $ rbd snap purge foo Removing all snapshots: 100% complete...done. $ ceph health HEALTH_ERR full flag(s) set; 3 full osd(s) $ rbd rm foo Removing image: 100% complete...done. $ ceph health HEALTH_OK
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387