Bug 1332676 - OSD fails to start when custom cluster name contains numbers
Summary: OSD fails to start when custom cluster name contains numbers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat Storage
Component: ceph-installer
Version: 2
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 2
Assignee: Andrew Schoen
QA Contact: Daniel Horák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-03 18:19 UTC by Nishanth Thomas
Modified: 2016-08-23 19:49 UTC (History)
9 users (show)

Fixed In Version: ceph-ansible-1.0.5-32.el7scon
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:49:43 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1754 0 normal SHIPPED_LIVE New packages: Red Hat Storage Console 2.0 2017-04-18 19:09:06 UTC

Description Nishanth Thomas 2016-05-03 18:19:02 UTC
Hitting this issue while creating the OSDs. This is particularly seen when more than 1 OSDs are created per host. Can you please have a look? Looks like OSDs are getting created but the task return a failure.

Error:

TASK: [ceph-osd | start and add that the osd service(s) to the init sequence (for or after infernalis)] ***
ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "enabled": true, "item": "123", "name": "ceph-osd@123", "state": "started"}
ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=0) => {"changed": false, "enabled": true, "item": "0", "name": "ceph-osd@0", "state": "started"}
failed: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "failed": true, "item": "123"}
msg: Job for ceph-osd failed because start of the service was attempted too often. See "systemctl status ceph-osd" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-osd" followed by "systemctl start ceph-osd" again.

Comment 2 Andrew Schoen 2016-05-03 18:29:12 UTC
This is created by using the custom cluster name 'mine123'. The method ceph-ansible uses to get the OSD IDs fails to parse a custom cluster name that includes numbers. It returns the ID of '123' twice, which is the failure you're seeing.

Here's what ceph-ansible is doing to determine the OSD IDs:

[root@dhcp47-41 ~]# ls /var/lib/ceph/osd
mine123-0  mine123-1
[root@dhcp47-41 ~]# ls /var/lib/ceph/osd/ |grep -oh '[0-9]*'
123
0
123
1

Nishanth, can you try again and either use the default cluster name or one without numbers in it? Thanks.

Comment 3 Andrew Schoen 2016-05-03 20:16:14 UTC
I made a PR upstream to address the issue of retrieving OSD IDs when the cluster name includes numbers: https://github.com/ceph/ceph-ansible/pull/750

Comment 4 Nishanth Thomas 2016-05-04 14:46:47 UTC
I have tried with default cluster name and this issue is not seen

Comment 8 Daniel Horák 2016-08-02 14:08:44 UTC
The fix is not correct, because it have problems with cluster names containing '-' (dash), for example:
  MyCluster-01 
  my-cluster

Dash should be supported in Ceph cluster name, because it is mentioned in documentation[1]:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~
  For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~

Tested on:
  USM Server/ceph-installer server (RHEL 7.2):
  ceph-ansible-1.0.5-31.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.39-1.el7scon.x86_64
  rhscon-core-0.0.39-1.el7scon.x86_64
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph MON (RHEL 7.2):
  calamari-server-1.4.8-1.el7cp.x86_64
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-mon-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph OSD (RHEL 7.2):
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-osd-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

>> moving back to ASSIGNED

[1] http://docs.ceph.com/docs/master/install/manual-deployment/

Comment 9 Andrew Schoen 2016-08-02 14:37:25 UTC
This upstream PR fixed the dashes in a cluster name issue: 

https://github.com/ceph/ceph-ansible/pull/816

Comment 13 Daniel Horák 2016-08-03 13:25:39 UTC
Retested with names mentioned in comment 8 and it works as expected.

Tested on:
  USM Server/ceph-installer server (RHEL 7.2):
  ceph-ansible-1.0.5-32.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.39-1.el7scon.x86_64
  rhscon-core-0.0.39-1.el7scon.x86_64
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph MON (RHEL 7.2):
  calamari-server-1.4.8-1.el7cp.x86_64
  ceph-base-10.2.2-33.el7cp.x86_64
  ceph-common-10.2.2-33.el7cp.x86_64
  ceph-mon-10.2.2-33.el7cp.x86_64
  ceph-selinux-10.2.2-33.el7cp.x86_64
  libcephfs1-10.2.2-33.el7cp.x86_64
  python-cephfs-10.2.2-33.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

  Ceph OSD (RHEL 7.2):
  ceph-base-10.2.2-33.el7cp.x86_64
  ceph-common-10.2.2-33.el7cp.x86_64
  ceph-osd-10.2.2-33.el7cp.x86_64
  ceph-selinux-10.2.2-33.el7cp.x86_64
  libcephfs1-10.2.2-33.el7cp.x86_64
  python-cephfs-10.2.2-33.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.39-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch
  salt-selinux-0.0.39-1.el7scon.noarch

>> VERIFIED

Comment 15 errata-xmlrpc 2016-08-23 19:49:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754


Note You need to log in before you can comment on or make changes to this bug.