Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1363807 - Upgrade crashes OSD in void FileStore::init_temp_collections()
Upgrade crashes OSD in void FileStore::init_temp_collections()
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
2.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 2.0
Assigned To: David Zafman
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-08-03 11:39 EDT by David Zafman
Modified: 2017-07-30 11:20 EDT (History)
8 users (show)

See Also:
Fixed In Version: RHEL: ceph-10.2.2-35.el7cp Ubuntu: ceph_10.2.2-27redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-23 15:45:36 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 16672 None None None 2016-08-03 11:39 EDT
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 19:23:52 EDT

  None (edit)
Description David Zafman 2016-08-03 11:39:27 EDT
Description of problem:

When OSDs restart after upgrade from 1.3.x to 2.0 they can crash almost immediately.  This could happen any number of times, but eventually the OSD will start normally.

How reproducible:

It should happen upon upgrade, but I haven't determined why we aren't seeing this in our testing.
Comment 5 David Zafman 2016-08-05 11:53:04 EDT
To test the fix:

1. Bring up a Jewel cluster using filestore
2. Create a pool with size 1
3. Write a bunch of data to a pool
4. Stop the cluster
5. Find a */current osd dir which doesn't include the a pg_head for a valid PGs
6. Create pg_TEMP for that pg (e.g. mkdir ....current/1.5_TEMP
7. Find any other OSD current directory with a pg_head without a pg_TEMP
8. Rename the pg_head to pg_TEMP (e.g. mv ....current/1.0_head ...current/1.0_TEMP
9. Start cluster

The 2 OSDs that were manipulated should NOT crash
Comment 6 David Zafman 2016-08-05 14:09:32 EDT
We now fully understand the bug in Hammer which triggers this bug in later releases.

An OSD must meet the following criteria:

1. An OSD has been marked out then back in or gets pg(s) pushed to it due to a different map change.  This creates some pg_TEMP dirs
2. The pg_num/pgp_num is increased causing pgs to split including one with pg_TEMP
3. The OSD is NOT restarted prior to upgrade

Workarounds:
1. Restart all ODSs before installing upgrade
2. With old OSD stopped and before starting an upgraded OSDs manually search for and rmdir all pg_TEMP directories without a corresponding pg_head directory (scripting this would be helpful)

I would say that this isn't a blocker.  The upstream fix is ready to merge having passed Rados suite testing.  To avoid customers running in to this I advise including it in 2.0 if it isn't too late.
Comment 7 David Zafman 2016-08-05 15:52:37 EDT
Fix merged to Upstream Jewel branch: https://github.com/ceph/ceph/pull/10561
Comment 17 rakesh 2016-08-10 08:15:12 EDT
accidentally moved to verified, changing it back to ON_QA
Comment 18 rakesh 2016-08-11 10:17:35 EDT
I followed the steps as mentioned in the comment 5. And after restarting the modified osds, they did not crash. here are the steps that I followed. 

1. I selected a PG 5.0 which exists in osds [3,1,2]. so created a pg_head in osd.8 where it does not exits ( /current/5.0_head )

2. then I selected another osd, osd.7. I searched for a pg without pg_temp . there were none, all had head associated with temp

3. so I selected pg 6.0_head and deleted it's 6.0_TEMP in that osd. And renamed 6.0_HEAD to 6.0_TEMP. pg 6.0_TEMP was empty before deleting the original dir. 

4. I restarted the cluster and did some IOs and none of the osds crashed.. 

I am moving this bug to verified stage
Comment 20 errata-xmlrpc 2016-08-23 15:45:36 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.