Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1488479

Summary:	RHCS upgrade 2.4 - 3.0: PGs are stuck at recovering and degraded state
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Parikshith <pbyregow>
Component:	RADOS	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED ERRATA	QA Contact:	Parikshith <pbyregow>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.0	CC:	ceph-eng-bugs, dzafman, hnallurv, icolle, jdurgin, kchai, kdreyer, sweil
Target Milestone:	rc
Target Release:	3.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-12.2.1-1.el7cp Ubuntu: ceph_12.2.1-2redhat1xenial	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1489077 (view as bug list)		Environment:
Last Closed:	2017-12-05 23:41:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1489077

Description Parikshith 2017-09-05 13:35:28 UTC

Description of problem:
Few PGs(3 pgs) are stuck at recovering and degraded state while osds are being updated to 3.0

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create two 2.4 Cluster(encrypted-collocated Osds). Write some data(I ran Rgw IOs)
2. While IOs are running, do a rolling update to upgrade one of the cluster to RHCS 3.0
3. ceph-ansible times out after 40 retries while waiting for clean Pgs. 

FAILED - RETRYING: TASK: waiting for clean pgs... (1 retries left).
fatal: [magna075 -> magna043]: FAILED! => {"attempts": 40, "changed": true, "cmd": "test \"[\"\"$(ceph --cluster master -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"pgmap\"][\"num_pgs\"])')\"\"]\" = \"$(ceph --cluster master -s -f json | python -c 'import sys, json; print [ i[\"count\"] for i in json.load(sys.stdin)[\"pgmap\"][\"pgs_by_state\"] if i[\"state_name\"] == \"active+clean\"]')\"", "delta": "0:00:00.657863", "end": "2017-09-05 12:32:58.600534", "failed": true, "rc": 1, "start": "2017-09-05 12:32:57.942671", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}


 Cluster is still in error even state after few hours.

sudo ceph -w --cluster master
  cluster:
    id:     3174c679-de9c-4518-845a-e2bd6271ba8d
    health: HEALTH_ERR
            clock skew detected on mon.magna067
            15 pgs degraded
            3 pgs recovering
            12 pgs recovery_wait
            15 pgs stuck degraded
            15 pgs stuck unclean
            1 requests are blocked > 32 sec
            9 requests are blocked > 4096 sec
            recovery 1271/284010 objects degraded (0.448%)
            noout,noscrub,nodeep-scrub flag(s) set
            Monitor clock skew detected 
 
  services:
    mon: 3 daemons, quorum magna043,magna048,magna067
    mgr: no daemons active
    mds: cephfs-0/0/1 up 
    osd: 9 osds: 9 up, 9 in
         flags noout,noscrub,nodeep-scrub
 
  data:
    pools:   17 pools, 192 pgs
    objects: 94670 objects, 101 GB
    usage:   305 GB used, 8029 GB / 8334 GB avail
    pgs:     1271/284010 objects degraded (0.448%)
             177 active+clean
             12  active+recovery_wait+degraded
             3   active+recovering+degraded
 
  io:
    client:   16228 B/s rd, 0 B/s wr, 16 op/s rd, 0 op/s wr
 

2017-09-05 13:17:04.887876 mon.magna043 [INF] pgmap 192 pgs: 3 active+recovering+degraded, 12 active+recovery_wait+degraded, 177 active+clean; 101 GB data, 305 GB used, 8029 GB / 8334 GB avail; 16228 B/s rd, 0 B/s wr, 16 op/s; 1271/284010 objects degraded (0.448%)
2017-09-05 13:17:09.427720 mon.magna043 [INF] pgmap 192 pgs: 3 active+recovering+degraded, 12 active+recovery_wait+degraded, 177 active+clean; 101 GB data, 305 GB used, 8029 GB / 8334 GB avail; 15395 B/s rd, 0 B/s wr, 15 op/s; 1271/284010 objects degraded (0.448%)
2017-09-05 13:17:09.717693  [WRN] 10 slow requests, 1 included below; oldest blocked for > 3910.815950 secs
2017-09-05 13:17:09.717698  [WRN] slow request 3840.729100 seconds old, received at 2017-09-05 12:13:08.988557: osd_op(client.357436.0:1042903 14.eb4f5f61 fa719695-d75e-441c-a90d-1ba36d79ea3f.357436.2_myobject_1946 [getxattrs,stat] snapc 0=[] ack+read+known_if_redirected e179) currently waiting for missing object


Actual results:


Expected results:


Additional info:

Comment 3 Sage Weil 2017-09-05 19:16:07 UTC

2017-09-05 19:15:50.031468 7f3953da0700 -1 failed to decode message of type 118 v1: buffer::malformed_input: void hobject_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 4 < struct_compat
2017-09-05 19:15:50.031476 7f3953da0700  0 dump: 
00000000  01 01 05 00 00 00 03 00  00 00 ff 01 01 12 00 00  |................|
00000010  00 01 0e 00 00 00 00 00  00 00 01 00 00 00 ff ff  |................|
00000020  ff ff ff b2 00 00 00 e8  03 00 00 00 00 00 00 01  |................|
00000030  00 00 00 04 03 5c 00 00  00 00 00 00 00 3b 00 00  |.....\.......;..|
00000040  00 66 61 37 31 39 36 39  35 2d 64 37 35 65 2d 34  |.fa719695-d75e-4|
00000050  34 31 63 2d 61 39 30 64  2d 31 62 61 33 36 64 37  |41c-a90d-1ba36d7|
00000060  39 65 61 33 66 2e 33 35  37 34 33 36 2e 32 5f 6d  |9ea3f.357436.2_m|
00000070  79 6f 62 6a 65 63 74 5f  31 39 34 36 fe ff ff ff  |yobject_1946....|
00000080  ff ff ff ff 61 5f 4f eb  00 00 00 00 00 0e 00 00  |....a_O.........|
00000090  00 00 00 00 00 36 35 00  00 00 00 00 00 b0 00 00  |.....65.........|
000000a0  00                                                |.|
000000a1

Comment 6 Josh Durgin 2017-09-27 05:44:48 UTC

This is fixed in 12.2.1.

Comment 13 errata-xmlrpc 2017-12-05 23:41:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387