1363807 – Upgrade crashes OSD in void FileStore::init_temp_collections()

Bug 1363807 - Upgrade crashes OSD in void FileStore::init_temp_collections()

Summary: Upgrade crashes OSD in void FileStore::init_temp_collections()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	2.0
Assignee:	David Zafman
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-03 15:39 UTC by David Zafman
Modified:	2022-02-21 18:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:	RHEL: ceph-10.2.2-35.el7cp Ubuntu: ceph_10.2.2-27redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:45:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	16672	0	None	None	None	2016-08-03 15:39:27 UTC
Red Hat Product Errata	RHBA-2016:1755	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Description David Zafman 2016-08-03 15:39:27 UTC

Description of problem:

When OSDs restart after upgrade from 1.3.x to 2.0 they can crash almost immediately.  This could happen any number of times, but eventually the OSD will start normally.

How reproducible:

It should happen upon upgrade, but I haven't determined why we aren't seeing this in our testing.

Comment 5 David Zafman 2016-08-05 15:53:04 UTC

To test the fix:

1. Bring up a Jewel cluster using filestore
2. Create a pool with size 1
3. Write a bunch of data to a pool
4. Stop the cluster
5. Find a */current osd dir which doesn't include the a pg_head for a valid PGs
6. Create pg_TEMP for that pg (e.g. mkdir ....current/1.5_TEMP
7. Find any other OSD current directory with a pg_head without a pg_TEMP
8. Rename the pg_head to pg_TEMP (e.g. mv ....current/1.0_head ...current/1.0_TEMP
9. Start cluster

The 2 OSDs that were manipulated should NOT crash

Comment 6 David Zafman 2016-08-05 18:09:32 UTC

We now fully understand the bug in Hammer which triggers this bug in later releases.

An OSD must meet the following criteria:

1. An OSD has been marked out then back in or gets pg(s) pushed to it due to a different map change.  This creates some pg_TEMP dirs
2. The pg_num/pgp_num is increased causing pgs to split including one with pg_TEMP
3. The OSD is NOT restarted prior to upgrade

Workarounds:
1. Restart all ODSs before installing upgrade
2. With old OSD stopped and before starting an upgraded OSDs manually search for and rmdir all pg_TEMP directories without a corresponding pg_head directory (scripting this would be helpful)

I would say that this isn't a blocker.  The upstream fix is ready to merge having passed Rados suite testing.  To avoid customers running in to this I advise including it in 2.0 if it isn't too late.

Comment 7 David Zafman 2016-08-05 19:52:37 UTC

Fix merged to Upstream Jewel branch: https://github.com/ceph/ceph/pull/10561

Comment 17 rakesh-gm 2016-08-10 12:15:12 UTC

accidentally moved to verified, changing it back to ON_QA

Comment 18 rakesh-gm 2016-08-11 14:17:35 UTC

I followed the steps as mentioned in the comment 5. And after restarting the modified osds, they did not crash. here are the steps that I followed. 

1. I selected a PG 5.0 which exists in osds [3,1,2]. so created a pg_head in osd.8 where it does not exits ( /current/5.0_head )

2. then I selected another osd, osd.7. I searched for a pg without pg_temp . there were none, all had head associated with temp

3. so I selected pg 6.0_head and deleted it's 6.0_TEMP in that osd. And renamed 6.0_HEAD to 6.0_TEMP. pg 6.0_TEMP was empty before deleting the original dir. 

4. I restarted the cluster and did some IOs and none of the osds crashed.. 

I am moving this bug to verified stage

Comment 20 errata-xmlrpc 2016-08-23 19:45:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.