Hitting this issue while creating the OSDs. This is particularly seen when more than 1 OSDs are created per host. Can you please have a look? Looks like OSDs are getting created but the task return a failure. Error: TASK: [ceph-osd | start and add that the osd service(s) to the init sequence (for or after infernalis)] *** ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "enabled": true, "item": "123", "name": "ceph-osd@123", "state": "started"} ok: [dhcp47-41.lab.eng.blr.redhat.com] => (item=0) => {"changed": false, "enabled": true, "item": "0", "name": "ceph-osd@0", "state": "started"} failed: [dhcp47-41.lab.eng.blr.redhat.com] => (item=123) => {"changed": false, "failed": true, "item": "123"} msg: Job for ceph-osd failed because start of the service was attempted too often. See "systemctl status ceph-osd" and "journalctl -xe" for details. To force a start use "systemctl reset-failed ceph-osd" followed by "systemctl start ceph-osd" again.
This is created by using the custom cluster name 'mine123'. The method ceph-ansible uses to get the OSD IDs fails to parse a custom cluster name that includes numbers. It returns the ID of '123' twice, which is the failure you're seeing. Here's what ceph-ansible is doing to determine the OSD IDs: [root@dhcp47-41 ~]# ls /var/lib/ceph/osd mine123-0 mine123-1 [root@dhcp47-41 ~]# ls /var/lib/ceph/osd/ |grep -oh '[0-9]*' 123 0 123 1 Nishanth, can you try again and either use the default cluster name or one without numbers in it? Thanks.
I made a PR upstream to address the issue of retrieving OSD IDs when the cluster name includes numbers: https://github.com/ceph/ceph-ansible/pull/750
I have tried with default cluster name and this issue is not seen
The fix is not correct, because it have problems with cluster names containing '-' (dash), for example: MyCluster-01 my-cluster Dash should be supported in Ceph cluster name, because it is mentioned in documentation[1]: ~~~~~~~~~~~~~~~~~~~~~~~~~~ For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session. ~~~~~~~~~~~~~~~~~~~~~~~~~~ Tested on: USM Server/ceph-installer server (RHEL 7.2): ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch Ceph MON (RHEL 7.2): calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-32.el7cp.x86_64 ceph-common-10.2.2-32.el7cp.x86_64 ceph-mon-10.2.2-32.el7cp.x86_64 ceph-selinux-10.2.2-32.el7cp.x86_64 libcephfs1-10.2.2-32.el7cp.x86_64 python-cephfs-10.2.2-32.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch Ceph OSD (RHEL 7.2): ceph-base-10.2.2-32.el7cp.x86_64 ceph-common-10.2.2-32.el7cp.x86_64 ceph-osd-10.2.2-32.el7cp.x86_64 ceph-selinux-10.2.2-32.el7cp.x86_64 libcephfs1-10.2.2-32.el7cp.x86_64 python-cephfs-10.2.2-32.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch >> moving back to ASSIGNED [1] http://docs.ceph.com/docs/master/install/manual-deployment/
This upstream PR fixed the dashes in a cluster name issue: https://github.com/ceph/ceph-ansible/pull/816
Retested with names mentioned in comment 8 and it works as expected. Tested on: USM Server/ceph-installer server (RHEL 7.2): ceph-ansible-1.0.5-32.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch Ceph MON (RHEL 7.2): calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-33.el7cp.x86_64 ceph-common-10.2.2-33.el7cp.x86_64 ceph-mon-10.2.2-33.el7cp.x86_64 ceph-selinux-10.2.2-33.el7cp.x86_64 libcephfs1-10.2.2-33.el7cp.x86_64 python-cephfs-10.2.2-33.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch Ceph OSD (RHEL 7.2): ceph-base-10.2.2-33.el7cp.x86_64 ceph-common-10.2.2-33.el7cp.x86_64 ceph-osd-10.2.2-33.el7cp.x86_64 ceph-selinux-10.2.2-33.el7cp.x86_64 libcephfs1-10.2.2-33.el7cp.x86_64 python-cephfs-10.2.2-33.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.39-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754