1464958 – [RFE] Don't mark down OSDs out by default on small clusters

Bug 1464958 - [RFE] Don't mark down OSDs out by default on small clusters

Summary: [RFE] Don't mark down OSDs out by default on small clusters

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	3.*
Assignee:	Josh Durgin
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-26 09:56 UTC by John Spray
Modified:	2022-02-21 18:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-28 16:16:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description John Spray 2017-06-26 09:56:43 UTC

For example, on a three node cluster it is rarely useful or sensible to mark an OSD out if there's any chance at all that it could come back online.

Comment 4 Josh Durgin 2017-07-19 00:03:32 UTC

If it's a replica count 2 cluster it would make sense to mark osds out - otherwise you'd be left with only 1 copy, perhaps until an admin manually intervenes. There are likely other nuances here...

Comment 5 Sage Weil 2017-07-25 20:47:27 UTC

We currently have two options that govern this:

1. mon_osd_min_in_ratio = .75

We will stop marking OSDs out once we drop to only 75% of existing osds IN.

2. mon_osd_down_out_subtree_limit = rack

If an entire $subtree (rack by default) is down, we won't mark any of the devices as out.

--

I think the question is what size cluster are we talking about, and can we just tweak these values to get the behavior we want for it.  It being a 3-node cluster doesn't tell us much without knowing how many OSDs per node... if we rely on the above options.  For a "normal" 3-node cluster with 12x disks per node, that's 36 OSDs, which would require a mon_osd_min_in_ratio of .98 to prevent any OSDs from being marked down.  I'm not sure that value makes sense for larger clusters.

On the other hand, even on a 3-node cluster, I think we *do* want to allow a few drives to be marked out (such that the data will rebuild on other drives in the same host).

My suggestion is to set mon_osd_min_in_ratio = .9 by default and be done with it.  I think that's a reasonable value for *large* clusters too... as soon as 10% of the infrastructure has failed we need operator help...

What do you think?

Comment 6 Sage Weil 2017-07-28 14:25:20 UTC

also, mon_osd_down_out_subtree_limit = host for small clusters

Comment 7 Yaniv Kaul 2019-01-27 12:08:37 UTC

Any work done on this BZ upstream?

Comment 8 Josh Durgin 2019-01-28 16:16:01 UTC

The current mon_osd_min_in_ratio and better recovery throttling defaults handle this well enough in luminous.

Note You need to log in before you can comment on or make changes to this bug.