Bug 1307177

Summary: after upgrading from 1.2.3 to 1.3.0 Journal file sym link missing and osd is down
Product: Red Hat Ceph Storage Reporter: Warren <wusui>
Component: Ceph-DiskAssignee: Loic Dachary <ldachary>
Status: CLOSED CURRENTRELEASE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.2.3CC: adeza, ceph-eng-bugs, kdreyer, tmuthami
Target Milestone: rc   
Target Release: 1.3.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-30 16:51:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Error none

Description Warren 2016-02-12 22:33:19 UTC
Created attachment 1123652 [details]
Error

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Warren 2016-02-12 22:57:53 UTC
Ugh.  I hit CR too early.

Description:
A few Ceph journal partitions were empty on a fairly large upgrade test.  (31 OSD hosts, 12 OSDs per host).  The upgrade was from 1.2.3 to 1.3.  The problems were not noticed until the 1.3 upgrade was in progress. 

What happened was that the partition for the journal disk appeared cleared before the upgrade on 3 separate OSDs.  It is quite possible that two of the errors may be due to a combination of operator error and possibly a known bug (tracker issues http://tracker.ceph.com/issues/9665 or http://tracker.ceph.com/issues/10375), but one of the errors we are not sure of.

We noticed this problem after the upgrade of one OSD host when the journal file's symlink was broken, causing an unhealthy ceph cluster.  The OSD did not come up because the journal link was missing.  We did not run into a problem until the upgrade, but it is unclear how long this link was bad before this point.

After noticing that the partitions were unavailable, we fixed the situation by using sgdisk to copy another partition to the clobbered partition, finding the old guid in the symlink name, and editing the partition's guid to match the original.