Bug 1363807
Summary: | Upgrade crashes OSD in void FileStore::init_temp_collections() | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | David Zafman <dzafman> |
Component: | RADOS | Assignee: | David Zafman <dzafman> |
Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 2.0 | CC: | ceph-eng-bugs, ceph-qe-bugs, dzafman, hnallurv, kchai, kdreyer, rgowdege, tserlin |
Target Milestone: | rc | ||
Target Release: | 2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-10.2.2-35.el7cp Ubuntu: ceph_10.2.2-27redhat1xenial | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-23 19:45:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Zafman
2016-08-03 15:39:27 UTC
To test the fix: 1. Bring up a Jewel cluster using filestore 2. Create a pool with size 1 3. Write a bunch of data to a pool 4. Stop the cluster 5. Find a */current osd dir which doesn't include the a pg_head for a valid PGs 6. Create pg_TEMP for that pg (e.g. mkdir ....current/1.5_TEMP 7. Find any other OSD current directory with a pg_head without a pg_TEMP 8. Rename the pg_head to pg_TEMP (e.g. mv ....current/1.0_head ...current/1.0_TEMP 9. Start cluster The 2 OSDs that were manipulated should NOT crash We now fully understand the bug in Hammer which triggers this bug in later releases. An OSD must meet the following criteria: 1. An OSD has been marked out then back in or gets pg(s) pushed to it due to a different map change. This creates some pg_TEMP dirs 2. The pg_num/pgp_num is increased causing pgs to split including one with pg_TEMP 3. The OSD is NOT restarted prior to upgrade Workarounds: 1. Restart all ODSs before installing upgrade 2. With old OSD stopped and before starting an upgraded OSDs manually search for and rmdir all pg_TEMP directories without a corresponding pg_head directory (scripting this would be helpful) I would say that this isn't a blocker. The upstream fix is ready to merge having passed Rados suite testing. To avoid customers running in to this I advise including it in 2.0 if it isn't too late. Fix merged to Upstream Jewel branch: https://github.com/ceph/ceph/pull/10561 accidentally moved to verified, changing it back to ON_QA I followed the steps as mentioned in the comment 5. And after restarting the modified osds, they did not crash. here are the steps that I followed. 1. I selected a PG 5.0 which exists in osds [3,1,2]. so created a pg_head in osd.8 where it does not exits ( /current/5.0_head ) 2. then I selected another osd, osd.7. I searched for a pg without pg_temp . there were none, all had head associated with temp 3. so I selected pg 6.0_head and deleted it's 6.0_TEMP in that osd. And renamed 6.0_HEAD to 6.0_TEMP. pg 6.0_TEMP was empty before deleting the original dir. 4. I restarted the cluster and did some IOs and none of the osds crashed.. I am moving this bug to verified stage Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html |