Bug 1393582

Summary: 1.0.5-42.el7scon - osd's are not up
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Vasu Kulkarni <vakulkar>
Component: ceph-ansibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2CC: adeza, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan, tchandra
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-1.0.5-43.el7scon Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 23:42:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1391675    

Description Vasu Kulkarni 2016-11-09 22:14:18 UTC
Description of problem:

with the latest ansible, the osd's are not coming up, 

2016-11-09T16:17:16.086 INFO:teuthology.orchestra.run.pluto007:Running: 'sudo ceph -s'
2016-11-09T16:17:16.339 INFO:teuthology.orchestra.run.pluto007.stdout:    cluster f74830b0-c6f2-4a0f-8246-4a6e0d4362e2
2016-11-09T16:17:16.373 INFO:teuthology.orchestra.run.pluto007.stdout:     health HEALTH_ERR
2016-11-09T16:17:16.375 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs are stuck inactive for more than 300 seconds
2016-11-09T16:17:16.415 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs stuck inactive
2016-11-09T16:17:16.417 INFO:teuthology.orchestra.run.pluto007.stdout:            no osds
2016-11-09T16:17:16.457 INFO:teuthology.orchestra.run.pluto007.stdout:     monmap e1: 2 mons at {clara005=10.8.129.5:6789/0,pluto007=10.8.129.107:6789/0}
2016-11-09T16:17:16.459 INFO:teuthology.orchestra.run.pluto007.stdout:            election epoch 4, quorum 0,1 clara005,pluto007
2016-11-09T16:17:16.499 INFO:teuthology.orchestra.run.pluto007.stdout:      fsmap e3: 0/0/1 up
2016-11-09T16:17:16.501 INFO:teuthology.orchestra.run.pluto007.stdout:     osdmap e4: 0 osds: 0 up, 0 in
2016-11-09T16:17:16.541 INFO:teuthology.orchestra.run.pluto007.stdout:            flags sortbitwise
2016-11-09T16:17:16.543 INFO:teuthology.orchestra.run.pluto007.stdout:      pgmap v5: 320 pgs, 3 pools, 0 bytes data, 0 objects
2016-11-09T16:17:16.583 INFO:teuthology.orchestra.run.pluto007.stdout:            0 kB used, 0 kB / 0 kB avail
2016-11-09T16:17:16.584 INFO:teuthology.orchestra.run.pluto007.stdout:                 320 creating
2016-11-09T16:17:16.625 INFO:teuthology.task.ceph_ansible:Waiting for Ceph health to reach HEALTH_OK   

Full logs:
http://magna002.ceph.redhat.com/vasu-2016-11-09_16:04:17-smoke-jewel---basic-multi/259550/teuthology.log

Comment 3 Tejas 2016-11-10 04:26:29 UTC
Hi,

   my observation was this issue was not consistent in the rolling_update tests that I ran yesterday.
Build used:
1.0.5-42

Thanks,
Tejas

Comment 4 Tejas 2016-11-10 04:30:29 UTC
Workaround that worked:

The cause for this is in some cases one of the OSD daemons on a node is failing to come back up after the OSD restart.
So I manually rebooted the OSD node, and the OSD daemon came back up, and only then the cluster will come to a HEALTH_OK state.
But its a very destructive workaround.

Thanks,
Tejas

Comment 5 Ken Dreyer (Red Hat) 2016-11-10 15:01:57 UTC
With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using ceph-installer), my OSDs do not come up because they cannot find a keyring.

Comment 6 Ken Dreyer (Red Hat) 2016-11-10 19:58:44 UTC
(In reply to Ken Dreyer (Red Hat) from comment #5)
> With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using
> ceph-installer), my OSDs do not come up because they cannot find a keyring.

This was a red-herring problem with my own test environment (I didn't realize ceph-installer does not actually put the client.admin keyring on the OSDs). My bad.

The real error I'm seeing with ceph-ansible-1.0.5-42 is the same one that is reported here in Vasu's bug decription above, and in bug 1393684 as well: "HEALTH_ERR" and "no osds".

ceph-ansible-1.0.5-43 reverts the latest dmcrypt changes, so this regression bug should now be fixed in that build. This is in today's compose, RHSCON-2.0-RHEL-7-20161110.t.0

Comment 9 errata-xmlrpc 2016-11-22 23:42:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2817