1250233 – Ceph shows blocked requests on down OSD

Bug 1250233 - Ceph shows blocked requests on down OSD

Summary: Ceph shows blocked requests on down OSD

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	1.3.2
Assignee:	Samuel Just
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-04 19:22 UTC by Tupper Cole
Modified:	2022-02-21 18:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-11 21:54:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-3400	0	None	None	None	2022-02-21 18:23:26 UTC

Description Tupper Cole 2015-08-04 19:22:58 UTC

Description of problem:When a cluster acts in a degraded state, such as poor performing OSDs, ops can end up blocked. In order to perform maintenance without triggering backfill and recovery we will set noout and shut down OSDs. The expectation is that all ops will be redirected. In some cases ops continue to be blocked on down OSDs. 


Version-Release number of selected component (if applicable):
0.80.7

How reproducible:
Not very

Steps to Reproduce:
1.Poor performing OSD or HBA causes blocked ops
2.Mark noout and shut down poor performing OSD
3.Most blocked ops get cleared, but some remain. 

Actual results:

root@rtp1-1-csx-ceph1-001:~# ceph -s
    cluster 30a8ba6d-898a-4be2-be60-d11a8b0ac180
     health HEALTH_WARN 6993 pgs degraded; 6993 pgs stuck unclean; 243 requests are blocked > 32 sec; recovery 3622153/162583743 objects degraded (2.228%); 9/448 in osds are down; noout,noscrub,nodeep-scrub flag(s) set
     monmap e19: 5 mons at {rtp1-1-csx-ceph1-001=10.202.49.11:6789/0,rtp1-1-csx-ceph1-002=10.202.49.12:6789/0,rtp1-1-csx-ceph1-018=10.202.49.28:6789/0,rtp1-1-csx-ceph1-035=10.202.49.45:6789/0,rtp1-1-csx-ceph1-036=10.202.49.46:6789/0}, election epoch 14506, quorum 0,1,2,3,4 rtp1-1-csx-ceph1-001,rtp1-1-csx-ceph1-002,rtp1-1-csx-ceph1-018,rtp1-1-csx-ceph1-035,rtp1-1-csx-ceph1-036
     osdmap e527200: 449 osds: 439 up, 448 in
            flags noout,noscrub,nodeep-scrub
      pgmap v43942934: 102408 pgs, 17 pools, 136 TB data, 52924 kobjects
            410 TB used, 807 TB / 1217 TB avail
            3622153/162583743 objects degraded (2.228%)
               95415 active+clean
                6993 active+degraded
  client io 24241 kB/s rd, 22923 kB/s wr, 601 op/s
root@rtp1-1-csx-ceph1-001:~# ceph health detail |grep block
HEALTH_WARN 6993 pgs degraded; 6993 pgs stuck unclean; 243 requests are blocked > 32 sec; 2 osds have slow requests; recovery 3622179/162584532 objects degraded (2.228%); 9/448 in osds are down; noout,noscrub,nodeep-scrub flag(s) set
11 ops are blocked > 262.144 sec
43 ops are blocked > 131.072 sec
81 ops are blocked > 65.536 sec
108 ops are blocked > 32.768 sec
11 ops are blocked > 262.144 sec on osd.204
43 ops are blocked > 131.072 sec on osd.204
81 ops are blocked > 65.536 sec on osd.204
106 ops are blocked > 32.768 sec on osd.204
2 ops are blocked > 32.768 sec on osd.419


Expected results:
IO would be redirected when pgs are degraded. 

Additional info:

Comment 2 Federico Lucifredi 2015-12-11 21:54:40 UTC

This is fixed in the current release.

Comment 3 Samuel Just 2015-12-11 21:55:30 UTC

I think this was actually fixed in 149a3059d462d760392f7aadd8931e8cac5b0607 after firefly -- the slow requests weren't actually happening, they were just not getting cleared on the mon after the osd went down.

Note You need to log in before you can comment on or make changes to this bug.