Red Hat Bugzilla – Bug 1307177
after upgrading from 1.2.3 to 1.3.0 Journal file sym link missing and osd is down
Last modified: 2017-07-30 10:58:24 EDT
Created attachment 1123652 [details]
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Ugh. I hit CR too early.
A few Ceph journal partitions were empty on a fairly large upgrade test. (31 OSD hosts, 12 OSDs per host). The upgrade was from 1.2.3 to 1.3. The problems were not noticed until the 1.3 upgrade was in progress.
What happened was that the partition for the journal disk appeared cleared before the upgrade on 3 separate OSDs. It is quite possible that two of the errors may be due to a combination of operator error and possibly a known bug (tracker issues http://tracker.ceph.com/issues/9665 or http://tracker.ceph.com/issues/10375), but one of the errors we are not sure of.
We noticed this problem after the upgrade of one OSD host when the journal file's symlink was broken, causing an unhealthy ceph cluster. The OSD did not come up because the journal link was missing. We did not run into a problem until the upgrade, but it is unclear how long this link was bad before this point.
After noticing that the partitions were unavailable, we fixed the situation by using sgdisk to copy another partition to the clobbered partition, finding the old guid in the symlink name, and editing the partition's guid to match the original.