Bug 1488479
| Summary: | RHCS upgrade 2.4 - 3.0: PGs are stuck at recovering and degraded state | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Parikshith <pbyregow> | |
| Component: | RADOS | Assignee: | Josh Durgin <jdurgin> | |
| Status: | CLOSED ERRATA | QA Contact: | Parikshith <pbyregow> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 3.0 | CC: | ceph-eng-bugs, dzafman, hnallurv, icolle, jdurgin, kchai, kdreyer, sweil | |
| Target Milestone: | rc | |||
| Target Release: | 3.0 | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | RHEL: ceph-12.2.1-1.el7cp Ubuntu: ceph_12.2.1-2redhat1xenial | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1489077 (view as bug list) | Environment: | ||
| Last Closed: | 2017-12-05 23:41:09 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1489077 | |||
2017-09-05 19:15:50.031468 7f3953da0700 -1 failed to decode message of type 118 v1: buffer::malformed_input: void hobject_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 4 < struct_compat 2017-09-05 19:15:50.031476 7f3953da0700 0 dump: 00000000 01 01 05 00 00 00 03 00 00 00 ff 01 01 12 00 00 |................| 00000010 00 01 0e 00 00 00 00 00 00 00 01 00 00 00 ff ff |................| 00000020 ff ff ff b2 00 00 00 e8 03 00 00 00 00 00 00 01 |................| 00000030 00 00 00 04 03 5c 00 00 00 00 00 00 00 3b 00 00 |.....\.......;..| 00000040 00 66 61 37 31 39 36 39 35 2d 64 37 35 65 2d 34 |.fa719695-d75e-4| 00000050 34 31 63 2d 61 39 30 64 2d 31 62 61 33 36 64 37 |41c-a90d-1ba36d7| 00000060 39 65 61 33 66 2e 33 35 37 34 33 36 2e 32 5f 6d |9ea3f.357436.2_m| 00000070 79 6f 62 6a 65 63 74 5f 31 39 34 36 fe ff ff ff |yobject_1946....| 00000080 ff ff ff ff 61 5f 4f eb 00 00 00 00 00 0e 00 00 |....a_O.........| 00000090 00 00 00 00 00 36 35 00 00 00 00 00 00 b0 00 00 |.....65.........| 000000a0 00 |.| 000000a1 This is fixed in 12.2.1. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387 |
Description of problem: Few PGs(3 pgs) are stuck at recovering and degraded state while osds are being updated to 3.0 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create two 2.4 Cluster(encrypted-collocated Osds). Write some data(I ran Rgw IOs) 2. While IOs are running, do a rolling update to upgrade one of the cluster to RHCS 3.0 3. ceph-ansible times out after 40 retries while waiting for clean Pgs. FAILED - RETRYING: TASK: waiting for clean pgs... (1 retries left). fatal: [magna075 -> magna043]: FAILED! => {"attempts": 40, "changed": true, "cmd": "test \"[\"\"$(ceph --cluster master -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"pgmap\"][\"num_pgs\"])')\"\"]\" = \"$(ceph --cluster master -s -f json | python -c 'import sys, json; print [ i[\"count\"] for i in json.load(sys.stdin)[\"pgmap\"][\"pgs_by_state\"] if i[\"state_name\"] == \"active+clean\"]')\"", "delta": "0:00:00.657863", "end": "2017-09-05 12:32:58.600534", "failed": true, "rc": 1, "start": "2017-09-05 12:32:57.942671", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []} Cluster is still in error even state after few hours. sudo ceph -w --cluster master cluster: id: 3174c679-de9c-4518-845a-e2bd6271ba8d health: HEALTH_ERR clock skew detected on mon.magna067 15 pgs degraded 3 pgs recovering 12 pgs recovery_wait 15 pgs stuck degraded 15 pgs stuck unclean 1 requests are blocked > 32 sec 9 requests are blocked > 4096 sec recovery 1271/284010 objects degraded (0.448%) noout,noscrub,nodeep-scrub flag(s) set Monitor clock skew detected services: mon: 3 daemons, quorum magna043,magna048,magna067 mgr: no daemons active mds: cephfs-0/0/1 up osd: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub data: pools: 17 pools, 192 pgs objects: 94670 objects, 101 GB usage: 305 GB used, 8029 GB / 8334 GB avail pgs: 1271/284010 objects degraded (0.448%) 177 active+clean 12 active+recovery_wait+degraded 3 active+recovering+degraded io: client: 16228 B/s rd, 0 B/s wr, 16 op/s rd, 0 op/s wr 2017-09-05 13:17:04.887876 mon.magna043 [INF] pgmap 192 pgs: 3 active+recovering+degraded, 12 active+recovery_wait+degraded, 177 active+clean; 101 GB data, 305 GB used, 8029 GB / 8334 GB avail; 16228 B/s rd, 0 B/s wr, 16 op/s; 1271/284010 objects degraded (0.448%) 2017-09-05 13:17:09.427720 mon.magna043 [INF] pgmap 192 pgs: 3 active+recovering+degraded, 12 active+recovery_wait+degraded, 177 active+clean; 101 GB data, 305 GB used, 8029 GB / 8334 GB avail; 15395 B/s rd, 0 B/s wr, 15 op/s; 1271/284010 objects degraded (0.448%) 2017-09-05 13:17:09.717693 [WRN] 10 slow requests, 1 included below; oldest blocked for > 3910.815950 secs 2017-09-05 13:17:09.717698 [WRN] slow request 3840.729100 seconds old, received at 2017-09-05 12:13:08.988557: osd_op(client.357436.0:1042903 14.eb4f5f61 fa719695-d75e-441c-a90d-1ba36d79ea3f.357436.2_myobject_1946 [getxattrs,stat] snapc 0=[] ack+read+known_if_redirected e179) currently waiting for missing object Actual results: Expected results: Additional info: