Bug 1305812

Summary: osd down during upgrade to 1.3.2 latest build
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: rakesh-gm <rgowdege>
Component: UnclassifiedAssignee: ceph-eng-bugs <ceph-eng-bugs>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.3.2CC: hnallurv, kdreyer, ldachary, rgowdege
Target Milestone: rc   
Target Release: 1.3.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-15 14:31:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mon and osd logs none

Description rakesh-gm 2016-02-09 10:07:56 UTC
Created attachment 1122366 [details]
mon and osd logs

My cluster set up 

3 mons 
3 osds 
1 admin 

initially installed ceph 1.3.2 of jn 16th build. 
and then updated today to ceph 1.3.2 build of feb 5th build. 

after update, one of the osd . i.e osd.4 in one node is down
when checked with logs. it was not pointing to right journal, and hence the service was not starting. 

when i did ls -l in /var/lib/ceph/osd/ceph-4/

-rw-r--r--   1 root root  490 Feb  5 14:56 activate.monmap
-rw-r--r--   1 root root    3 Feb  5 14:56 active
-rw-r--r--   1 root root   37 Feb  5 14:56 ceph_fsid
drwxr-xr-x 190 root root 8192 Feb  8 09:29 current
-rw-r--r--   1 root root   37 Feb  5 14:56 fsid
lrwxrwxrwx   1 root root   58 Feb  5 14:56 journal -> /dev/disk/by-partuuid/37a4cc7f-ec14-4a14-90b6-2c8db3caae18
-rw-r--r--   1 root root   37 Feb  5 14:56 journal_uuid
-rw-------   1 root root   56 Feb  5 14:56 keyring
-rw-r--r--   1 root root   21 Feb  5 14:56 magic
-rw-r--r--   1 root root    6 Feb  5 14:56 ready
-rw-r--r--   1 root root    4 Feb  5 14:56 store_version
-rw-r--r--   1 root root   53 Feb  5 14:56 superblock
-rw-r--r--   1 root root    0 Feb  5 14:56 upstart

but by-partuuid shows it does not have journal

 ls -l /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 Feb  5 15:00 7f223239-d39b-4401-8794-78a00ee95c68 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Feb  8 11:03 c4110af4-b22a-43c1-869d-f4e01dc334aa -> ../../sdb2
lrwxrwxrwx 1 root root 10 Feb  8 11:05 e60c60ea-9076-48e2-8bf9-2683f067051d -> ../../sdd2
lrwxrwxrwx 1 root root 10 Feb  5 15:00 f588e1f6-ec3f-4dd7-a2df-1f5a7cb6b44f -> ../../sdd1

this is on ubuntu 14.04 
I had kernel 3.16 and then updated to kernel 3.19. 
this bug was hit when kernel was in 3.16 and still existed when updated to 3.19 


the osd.4 logs are attached

Comment 3 Loic Dachary 2016-02-09 15:07:50 UTC
It seems to indicate that during the upgrade the /dev/disk/by-partuuid/37a4cc7f-ec14-4a14-90b6-2c8db3caae18 symlink was broken. This symlink is not created by ceph itself but by the underlying operating system. I don't see how a ceph upgrade could damage it. Are you able to repeat the problem ?

Comment 4 rakesh-gm 2016-02-15 14:31:44 UTC
Hi Loic, 

I have testing this upgrade path again. I did not hit this issue and hence could reproduce. closing this as not a bug as of now.