Bug 1262480
Summary: | osd: hammer: fail to start due to stray pgs after firefly->hammer upgrade | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasu Kulkarni <vakulkar> | |
Component: | RADOS | Assignee: | Ken Dreyer (Red Hat) <kdreyer> | |
Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 1.3.0 | CC: | ceph-eng-bugs, dzafman, hnallurv, kchai, kdreyer, sjust, tganguly | |
Target Milestone: | rc | |||
Target Release: | 1.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | ceph-0.94.1-19.el7cp (RHEL) ceph v0.94.1.8 (Ubuntu) | Doc Type: | Bug Fix | |
Doc Text: |
In a scenario where user is running a version of Ceph older than v0.94, and Ceph's Object Storage Daemon (OSD) restarts before completing a placement group (PG) removal operation, and the user upgrades to RHCS 1.3, Ceph's OSD could fail to start when it encounters remnants of the old placement group. With this update, Ceph's OSD ignores the old PG and starts up successfully.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1262485 (view as bug list) | Environment: | ||
Last Closed: | 2015-10-08 18:59:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1262485 |
Description
Vasu Kulkarni
2015-09-11 20:36:49 UTC
Fix that went into upstream's hammer: https://github.com/ceph/ceph/pull/5892 Hi Ken, Based of the comment, i can think of the below mentioned scenario. Please correct me. 1. Bring the Cluster in 1.2.3 version. 2. Create a pool with 128 PGs. 3. Fill it up with some data. 4. Try to delete the pool. 5. While step 4 is in happening shutdown the OSD immediately [ The OSD must be part of the acting set for few PGs ] So when the OSD is brought back it will still have the information of the PGs which are already being deleted from other OSD, hence leading to inconsistency after upgrade. 6. When the OSD comes back and the cluster becomes Healthy, start upgrading from 1.2.3 to 1.3.0 Can you please let me know what can be other scenario's. Also let me know if the below makes sense. e.g. un-mounting the OSD partition while deleting the Pool is in progress. Thanks, Tanay Ken, can you please also let us know which version to use to start upgrading from? We are planning to use 1.2.3 on RHEL 7.1 and upgrade from there to 1.3.0 async. If this is not the version to start upgrading from, then please let us know the right version. Following test will be run downstream to verify on RH 7.1 https://github.com/ceph/ceph-qa-suite/blob/f0c925e30a1d6fc9db00a220d129f63274cdf94f/suites/rados/singleton-nomsgr/all/11429.yaml Tanay: Looking at Sage's changes to 11429.yaml, that looks like the right idea. You probably need a lot more than 128 pgs, though. The trick is that when the 'delete pool' command completes, it actually just begins an async pg deletion process. The key is to kill the osds after the deletion has begun, but before it has completed so that some of the pgs are caught in the intermediate state. You probably want to wait a bit (10s from 11429.yaml) between running the command to remove the pools and shutting down the osds (you probably want to stop all of them). I don't think un-mounting the OSD partitions is necessary. Using the 11429.yaml directly would be better, of course! *** Bug 1262485 has been marked as a duplicate of this bug. *** Sorry for the confusion Federico, will close this as is gets verified in 1.3.0 async. Verified on magna076/magna059 using 1.2.3->1.3.0 , partial logs at : http://pastebin.test.redhat.com/315887 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:1882 |