1393582 – 1.0.5-42.el7scon - osd's are not up

Bug 1393582 - 1.0.5-42.el7scon - osd's are not up

Summary: 1.0.5-42.el7scon - osd's are not up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	ceph-ansible
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	2
Assignee:	Sébastien Han
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1391675
TreeView+	depends on / blocked

Reported:	2016-11-09 22:14 UTC by Vasu Kulkarni
Modified:	2016-11-22 23:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-ansible-1.0.5-43.el7scon
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-22 23:42:51 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2817	0	normal	SHIPPED_LIVE	ceph-iscsi-ansible and ceph-ansible bug fix update	2017-04-18 19:50:43 UTC

Description Vasu Kulkarni 2016-11-09 22:14:18 UTC

Description of problem:

with the latest ansible, the osd's are not coming up, 

2016-11-09T16:17:16.086 INFO:teuthology.orchestra.run.pluto007:Running: 'sudo ceph -s'
2016-11-09T16:17:16.339 INFO:teuthology.orchestra.run.pluto007.stdout:    cluster f74830b0-c6f2-4a0f-8246-4a6e0d4362e2
2016-11-09T16:17:16.373 INFO:teuthology.orchestra.run.pluto007.stdout:     health HEALTH_ERR
2016-11-09T16:17:16.375 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs are stuck inactive for more than 300 seconds
2016-11-09T16:17:16.415 INFO:teuthology.orchestra.run.pluto007.stdout:            320 pgs stuck inactive
2016-11-09T16:17:16.417 INFO:teuthology.orchestra.run.pluto007.stdout:            no osds
2016-11-09T16:17:16.457 INFO:teuthology.orchestra.run.pluto007.stdout:     monmap e1: 2 mons at {clara005=10.8.129.5:6789/0,pluto007=10.8.129.107:6789/0}
2016-11-09T16:17:16.459 INFO:teuthology.orchestra.run.pluto007.stdout:            election epoch 4, quorum 0,1 clara005,pluto007
2016-11-09T16:17:16.499 INFO:teuthology.orchestra.run.pluto007.stdout:      fsmap e3: 0/0/1 up
2016-11-09T16:17:16.501 INFO:teuthology.orchestra.run.pluto007.stdout:     osdmap e4: 0 osds: 0 up, 0 in
2016-11-09T16:17:16.541 INFO:teuthology.orchestra.run.pluto007.stdout:            flags sortbitwise
2016-11-09T16:17:16.543 INFO:teuthology.orchestra.run.pluto007.stdout:      pgmap v5: 320 pgs, 3 pools, 0 bytes data, 0 objects
2016-11-09T16:17:16.583 INFO:teuthology.orchestra.run.pluto007.stdout:            0 kB used, 0 kB / 0 kB avail
2016-11-09T16:17:16.584 INFO:teuthology.orchestra.run.pluto007.stdout:                 320 creating
2016-11-09T16:17:16.625 INFO:teuthology.task.ceph_ansible:Waiting for Ceph health to reach HEALTH_OK   

Full logs:
http://magna002.ceph.redhat.com/vasu-2016-11-09_16:04:17-smoke-jewel---basic-multi/259550/teuthology.log

Comment 3 Tejas 2016-11-10 04:26:29 UTC

Hi,

   my observation was this issue was not consistent in the rolling_update tests that I ran yesterday.
Build used:
1.0.5-42

Thanks,
Tejas

Comment 4 Tejas 2016-11-10 04:30:29 UTC

Workaround that worked:

The cause for this is in some cases one of the OSD daemons on a node is failing to come back up after the OSD restart.
So I manually rebooted the OSD node, and the OSD daemon came back up, and only then the cluster will come to a HEALTH_OK state.
But its a very destructive workaround.

Thanks,
Tejas

Comment 5 Ken Dreyer (Red Hat) 2016-11-10 15:01:57 UTC

With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using ceph-installer), my OSDs do not come up because they cannot find a keyring.

Comment 6 Ken Dreyer (Red Hat) 2016-11-10 19:58:44 UTC

(In reply to Ken Dreyer (Red Hat) from comment #5)
> With ceph-1.0.5-42.el7scon in the raw_multi_journal scenario (using
> ceph-installer), my OSDs do not come up because they cannot find a keyring.

This was a red-herring problem with my own test environment (I didn't realize ceph-installer does not actually put the client.admin keyring on the OSDs). My bad.

The real error I'm seeing with ceph-ansible-1.0.5-42 is the same one that is reported here in Vasu's bug decription above, and in bug 1393684 as well: "HEALTH_ERR" and "no osds".

ceph-ansible-1.0.5-43 reverts the latest dmcrypt changes, so this regression bug should now be fixed in that build. This is in today's compose, RHSCON-2.0-RHEL-7-20161110.t.0

Comment 9 errata-xmlrpc 2016-11-22 23:42:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2817

Note You need to log in before you can comment on or make changes to this bug.