Bug 1393582

Summary:	1.0.5-42.el7scon - osd's are not up
Product:	[Red Hat Storage] Red Hat Storage Console	Reporter:	Vasu Kulkarni <vakulkar>
Component:	ceph-ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	2	CC:	adeza, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan, tchandra
Target Milestone:	---
Target Release:	2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-ansible-1.0.5-43.el7scon	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-22 23:42:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1391675

Description Vasu Kulkarni 2016-11-09 22:14:18 UTC

Description of problem:

with the latest ansible, the osd's are not coming up, 

2016-11-09T16:17:16.086 INFO:teuthology.orchestra.run.pluto007:Running: 'sudo ceph -s'
2016-11-09T16:17:16.339 INFO:teuthology.orchestra.run.pluto007.stdout:    cluster f74830b0-c6f2-4a0f-8246-4a6e0d4362e2
2016-11-09T16:17:16.373 INFO:teuthology.orchestra.run.pluto007.stdout:     health HEALTH_ERR
2016-11-09T16:17:16.375 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs are stuck inactive for more than 300 seconds
2016-11-09T16:17:16.415 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs stuck inactive
2016-11-09T16:17:16.417 INFO:teuthology.orchestra.run.pluto007.stdout:            no osds
2016-11-09T16:17:16.457 INFO:teuthology.orchestra.run.pluto007.stdout:     monmap e1: 2 mons at {clara005=10.8.129.5:6789/0,pluto007=10.8.129.107:6789/0}
2016-11-09T16:17:16.459 INFO:teuthology.orchestra.run.pluto007.stdout:            election epoch 4, quorum 0,1 clara005,pluto007
2016-11-09T16:17:16.499 INFO:teuthology.orchestra.run.pluto007.stdout:      fsmap e3: 0/0/1 up
2016-11-09T16:17:16.501 INFO:teuthology.orchestra.run.pluto007.stdout:     osdmap e4: 0 osds: 0 up, 0 in
2016-11-09T16:17:16.541 INFO:teuthology.orchestra.run.pluto007.stdout:            flags sortbitwise
2016-11-09T16:17:16.543 INFO:teuthology.orchestra.run.pluto007.stdout:      pgmap v5: 320 pgs, 3 pools, 0 bytes data, 0 objects
2016-11-09T16:17:16.583 INFO:teuthology.orchestra.run.pluto007.stdout:            0 kB used, 0 kB / 0 kB avail
2016-11-09T16:17:16.584 INFO:teuthology.orchestra.run.pluto007.stdout:                 320 creating
2016-11-09T16:17:16.625 INFO:teuthology.task.ceph_ansible:Waiting for Ceph health to reach HEALTH_OK   

Full logs:
http://magna002.ceph.redhat.com/vasu-2016-11-09_16:04:17-smoke-jewel---basic-multi/259550/teuthology.log

Comment 3 Tejas 2016-11-10 04:26:29 UTC

Hi,

   my observation was this issue was not consistent in the rolling_update tests that I ran yesterday.
Build used:
1.0.5-42

Thanks,
Tejas

Comment 4 Tejas 2016-11-10 04:30:29 UTC

Workaround that worked:

The cause for this is in some cases one of the OSD daemons on a node is failing to come back up after the OSD restart.
So I manually rebooted the OSD node, and the OSD daemon came back up, and only then the cluster will come to a HEALTH_OK state.
But its a very destructive workaround.

Thanks,
Tejas

Comment 5 Ken Dreyer (Red Hat) 2016-11-10 15:01:57 UTC

With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using ceph-installer), my OSDs do not come up because they cannot find a keyring.

Comment 6 Ken Dreyer (Red Hat) 2016-11-10 19:58:44 UTC

(In reply to Ken Dreyer (Red Hat) from comment #5)
> With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using
> ceph-installer), my OSDs do not come up because they cannot find a keyring.

This was a red-herring problem with my own test environment (I didn't realize ceph-installer does not actually put the client.admin keyring on the OSDs). My bad.

The real error I'm seeing with ceph-ansible-1.0.5-42 is the same one that is reported here in Vasu's bug decription above, and in bug 1393684 as well: "HEALTH_ERR" and "no osds".

ceph-ansible-1.0.5-43 reverts the latest dmcrypt changes, so this regression bug should now be fixed in that build. This is in today's compose, RHSCON-2.0-RHEL-7-20161110.t.0

Comment 9 errata-xmlrpc 2016-11-22 23:42:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2817