Bug 1377875

Summary:	[support] OSD recovery causes pause in IO which lasts longer than expected
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Mike Hackett <mhackett>
Component:	RADOS	Assignee:	Matt Benjamin (redhat) <mbenjamin>
Status:	CLOSED DUPLICATE	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.3.2	CC:	amarango, ceph-eng-bugs, dang, dzafman, hnallurv, icolle, jdurgin, kchai, kdreyer, linuxkidd, mhackett, nlevine, rwheeler, sjust, vumrao, yehuda
Target Milestone:	rc
Target Release:	3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-06-28 19:46:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Hackett 2016-09-20 21:14:29 UTC

Description of problem:

OSD node is taken offline for 15 mins (with noout flag set) and is brought back online, when recovery IO starts RadosGW client IO halts on the cluster due to large amounts of slow requests currently waiting for degraded object. When the OSD node was brought offline client IO was still active to the cluster for those 15 minutes.

This OSD node houses 2 SSD and 14 SATA HDD's. SSD OSD's back the radosgw.index pool.

The degraded objects were present in the radosgw index pool which is required to be accessed for each RGW op, so a large range of RGW operations would be affected.

Rack replication is being used with 3 racks, 6 SSD's per rack, 2 per OSD node.

Per upstream Tracker: http://tracker.ceph.com/issues/13104 this is expected behavior as the object is degraded and the OSD is waiting for it to get repaired. Writes to degraded objects (present on the primary) are not allowed in Hammer and below but this has changed in Infernalis.

Initially the recovery threads were throttled on the entire cluster to 1 to prevent client IO impact during cluster recovery but a recommendation was made to increase this value back to the default of 15 on the SSD OSD's, this did not alleviate the issue and issue was seen again on the next node move.

Is recovery operating properly here as expected in Hammer?
Do we have any method to prevent this impact from occurring during an OSD node move?

Version-Release number of selected component (if applicable):
ceph-0.94.5-14.el7cp.x86_64

How reproducible:
Consistent

Steps to Reproduce:
1. Set noout on cluster.
2. Write several GB to cluster.
2. Down one of the OSD nodes.
4. Write several GB to the cluster.
5. Bring OSD node back into cluster, to generate recovery.
6. While recovery is ongoing generate further IO to cluster and validate IO has halted.

Logs from issues are here:

https://api.access.redhat.com/rs/cases/01703018/attachments/ceb38cac-0a54-4781-9d9c-4498f37abddb

https://api.access.redhat.com/rs/cases/01703018/attachments/db1e4598-6e5e-4808-9592-a904738f408f

https://api.access.redhat.com/rs/cases/01703018/attachments/38fd0ded-07d5-4f4e-b71b-10601f9ee58b

Comment 81 Mike Hackett 2016-11-07 21:07:38 UTC

adding needinfo back as last update cleared it.