Bug 1332676

Summary:	OSD fails to start when custom cluster name contains numbers
Product:	[Red Hat Storage] Red Hat Storage Console	Reporter:	Nishanth Thomas <nthomas>
Component:	ceph-installer	Assignee:	Andrew Schoen <aschoen>
Status:	CLOSED ERRATA	QA Contact:	Daniel Horák <dahorak>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	2	CC:	adeza, aschoen, ceph-eng-bugs, dahorak, kdreyer, mkudlej, nthomas, sankarshan, sds-qe-bugs
Target Milestone:	---
Target Release:	2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-ansible-1.0.5-32.el7scon	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-23 19:49:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nishanth Thomas 2016-05-03 18:19:02 UTC

Hitting this issue while creating the OSDs. This is particularly seen when more than 1 OSDs are created per host. Can you please have a look? Looks like OSDs are getting created but the task return a failure.

Error:

TASK: [ceph-osd | start and add that the osd service(s) to the init sequence (for or after infernalis)] ***
ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "enabled": true, "item": "123", "name": "ceph-osd@123", "state": "started"}
ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=0) => {"changed": false, "enabled": true, "item": "0", "name": "ceph-osd@0", "state": "started"}
failed: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "failed": true, "item": "123"}
msg: Job for ceph-osd failed because start of the service was attempted too often. See "systemctl status ceph-osd" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd" followed by "systemctl start ceph-osd" again.

Comment 2 Andrew Schoen 2016-05-03 18:29:12 UTC

This is created by using the custom cluster name 'mine123'. The method ceph-ansible uses to get the OSD IDs fails to parse a custom cluster name that includes numbers. It returns the ID of '123' twice, which is the failure you're seeing.

Here's what ceph-ansible is doing to determine the OSD IDs:

[root@dhcp47-41 ~]# ls /var/lib/ceph/osd
mine123-0  mine123-1
[root@dhcp47-41 ~]# ls /var/lib/ceph/osd/ |grep -oh '[0-9]*'
123
0
123
1

Nishanth, can you try again and either use the default cluster name or one without numbers in it? Thanks.

Comment 3 Andrew Schoen 2016-05-03 20:16:14 UTC

I made a PR upstream to address the issue of retrieving OSD IDs when the cluster name includes numbers: https://github.com/ceph/ceph-ansible/pull/750

Comment 4 Nishanth Thomas 2016-05-04 14:46:47 UTC

I have tried with default cluster name and this issue is not seen

Comment 8 Daniel Horák 2016-08-02 14:08:44 UTC

The fix is not correct, because it have problems with cluster names containing '-' (dash), for example:
  MyCluster-01 
  my-cluster

Dash should be supported in Ceph cluster name, because it is mentioned in documentation[1]:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~
  For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~

Tested on:
  USM Server/ceph-installer server (RHEL 7.2):
  ceph-ansible-1.0.5-31.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.39-1.el7scon.x86_64
  rhscon-core-0.0.39-1.el7scon.x86_64
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph MON (RHEL 7.2):
  calamari-server-1.4.8-1.el7cp.x86_64
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-mon-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph OSD (RHEL 7.2):
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-osd-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

>> moving back to ASSIGNED

[1] http://docs.ceph.com/docs/master/install/manual-deployment/

Comment 9 Andrew Schoen 2016-08-02 14:37:25 UTC

This upstream PR fixed the dashes in a cluster name issue: 

https://github.com/ceph/ceph-ansible/pull/816

Comment 13 Daniel Horák 2016-08-03 13:25:39 UTC

Retested with names mentioned in comment 8 and it works as expected.

Tested on:
  USM Server/ceph-installer server (RHEL 7.2):
  ceph-ansible-1.0.5-32.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.39-1.el7scon.x86_64
  rhscon-core-0.0.39-1.el7scon.x86_64
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph MON (RHEL 7.2):
  calamari-server-1.4.8-1.el7cp.x86_64
  ceph-base-10.2.2-33.el7cp.x86_64
  ceph-common-10.2.2-33.el7cp.x86_64
  ceph-mon-10.2.2-33.el7cp.x86_64
  ceph-selinux-10.2.2-33.el7cp.x86_64
  libcephfs1-10.2.2-33.el7cp.x86_64
  python-cephfs-10.2.2-33.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph OSD (RHEL 7.2):
  ceph-base-10.2.2-33.el7cp.x86_64
  ceph-common-10.2.2-33.el7cp.x86_64
  ceph-osd-10.2.2-33.el7cp.x86_64
  ceph-selinux-10.2.2-33.el7cp.x86_64
  libcephfs1-10.2.2-33.el7cp.x86_64
  python-cephfs-10.2.2-33.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

>> VERIFIED

Comment 15 errata-xmlrpc 2016-08-23 19:49:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754