Bug 1363807 - Upgrade crashes OSD in void FileStore::init_temp_collections()
Summary: Upgrade crashes OSD in void FileStore::init_temp_collections()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 2.0
Assignee: David Zafman
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-03 15:39 UTC by David Zafman
Modified: 2017-07-30 15:20 UTC (History)
8 users (show)

Fixed In Version: RHEL: ceph-10.2.2-35.el7cp Ubuntu: ceph_10.2.2-27redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:45:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 16672 None None None 2016-08-03 15:39:27 UTC
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description David Zafman 2016-08-03 15:39:27 UTC
Description of problem:

When OSDs restart after upgrade from 1.3.x to 2.0 they can crash almost immediately.  This could happen any number of times, but eventually the OSD will start normally.

How reproducible:

It should happen upon upgrade, but I haven't determined why we aren't seeing this in our testing.

Comment 5 David Zafman 2016-08-05 15:53:04 UTC
To test the fix:

1. Bring up a Jewel cluster using filestore
2. Create a pool with size 1
3. Write a bunch of data to a pool
4. Stop the cluster
5. Find a */current osd dir which doesn't include the a pg_head for a valid PGs
6. Create pg_TEMP for that pg (e.g. mkdir ....current/1.5_TEMP
7. Find any other OSD current directory with a pg_head without a pg_TEMP
8. Rename the pg_head to pg_TEMP (e.g. mv ....current/1.0_head ...current/1.0_TEMP
9. Start cluster

The 2 OSDs that were manipulated should NOT crash

Comment 6 David Zafman 2016-08-05 18:09:32 UTC
We now fully understand the bug in Hammer which triggers this bug in later releases.

An OSD must meet the following criteria:

1. An OSD has been marked out then back in or gets pg(s) pushed to it due to a different map change.  This creates some pg_TEMP dirs
2. The pg_num/pgp_num is increased causing pgs to split including one with pg_TEMP
3. The OSD is NOT restarted prior to upgrade

Workarounds:
1. Restart all ODSs before installing upgrade
2. With old OSD stopped and before starting an upgraded OSDs manually search for and rmdir all pg_TEMP directories without a corresponding pg_head directory (scripting this would be helpful)

I would say that this isn't a blocker.  The upstream fix is ready to merge having passed Rados suite testing.  To avoid customers running in to this I advise including it in 2.0 if it isn't too late.

Comment 7 David Zafman 2016-08-05 19:52:37 UTC
Fix merged to Upstream Jewel branch: https://github.com/ceph/ceph/pull/10561

Comment 17 rakesh 2016-08-10 12:15:12 UTC
accidentally moved to verified, changing it back to ON_QA

Comment 18 rakesh 2016-08-11 14:17:35 UTC
I followed the steps as mentioned in the comment 5. And after restarting the modified osds, they did not crash. here are the steps that I followed. 

1. I selected a PG 5.0 which exists in osds [3,1,2]. so created a pg_head in osd.8 where it does not exits ( /current/5.0_head )

2. then I selected another osd, osd.7. I searched for a pg without pg_temp . there were none, all had head associated with temp

3. so I selected pg 6.0_head and deleted it's 6.0_TEMP in that osd. And renamed 6.0_HEAD to 6.0_TEMP. pg 6.0_TEMP was empty before deleting the original dir. 

4. I restarted the cluster and did some IOs and none of the osds crashed.. 

I am moving this bug to verified stage

Comment 20 errata-xmlrpc 2016-08-23 19:45:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.