Bug 1320574 - [ceph-ansible] : unable to create cluster using ceph-ansible - osds are not activated
Summary: [ceph-ansible] : unable to create cluster using ceph-ansible - osds are not a...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat Storage
Component: ceph-installer
Version: 2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 2
Assignee: Alfredo Deza
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
: 1320547 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-23 14:14 UTC by Rachana Patel
Modified: 2016-05-10 06:21 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-05 18:36:31 UTC
Embargoed:


Attachments (Terms of Use)
output of ceph-ansible installation (164.33 KB, text/plain)
2016-03-23 14:16 UTC, Rachana Patel
no flags Details
Command_Log (278.07 KB, text/plain)
2016-03-31 10:34 UTC, Tanay Ganguly
no flags Details
osd add error [shubhendu] (102.04 KB, text/plain)
2016-04-01 12:56 UTC, Shubhendu Tripathi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1322551 0 unspecified CLOSED OSDs on 7.1 are not starting after ceph-deploy installation. 2022-02-21 18:03:29 UTC

Internal Links: 1322551

Description Rachana Patel 2016-03-23 14:14:51 UTC
Description of problem:
=======================
Used ceph-ansible to create ceph cluster. Installation completed without any error but OSDs are not activated on any node.

$ sudo ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1      0 root default  

Version-Release number of selected component (if applicable):
=============================================================
ceph-ansible-1.0.2-1.el7.noarch



How reproducible:
================
always


Steps to Reproduce:
===================
1.Install ceph-ansible 'ceph-ansible-1.0.2-1.el7.noarch' on installer node

2. change below parameter in /usr/share/ceph-ansible/group_vars/all
ceph_stable_rh_storage: true
ceph_stable_rh_storage_cdn_install: true
monitor_interface: eno1
monitor_secret: AQA7P8dWAAAAABAAH/tbiZQn/40Z8pr959UmEA==
journal_size: 10240
public_network: 

3. change below parameter in /usr/share/ceph-ansible/group_vars/osds
crush_location: false
osd_crush_location: "'root={{ ceph_crush_root }} rack={{ ceph_crush_rack }} host={{ ansible_hostname }}'"

devices:
  - /dev/sdb
  - /dev/sdc
 
journal_collocation: true
 
4. add hosts in 'hosts' file. (1 mon and 3 OSD node)

5. execute ansible-playbook site.yml
once it is completed succesfully check cluster status


Actual results:
===============
[root@magna074 ceph-ansible]# sudo ceph -s
    cluster ea686d5b-9724-400c-9d17-dc135d0ee648
     health HEALTH_ERR
            64 pgs are stuck inactive for more than 300 seconds
            64 pgs stuck inactive
     monmap e1: 1 mons at {magna074=10.8.128.74:6789/0}
            election epoch 4, quorum 0 magna074
     osdmap e4: 6 osds: 0 up, 0 in
            flags sortbitwise
      pgmap v5: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating

[root@magna074 ceph-ansible]# ceph osd tree

ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1      0 root default  



Expected results:
=================
Ceph cluster should be up and running and 6 OSD should be in. (active+clean)


Additional info:
================

one of OSD node:-

during installation was able to see OSD mounted

[ubuntu@magna067 ~]$ sudo df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       917G  3.0G  868G   1% /
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G   25M   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
tmpfs           3.2G     0  3.2G   0% /run/user/0
tmpfs           3.2G     0  3.2G   0% /run/user/1000
/dev/sdb1       922G   33M  922G   1% /var/lib/ceph/tmp/mnt.fRDt8U

but once installation is complete

[ubuntu@magna067 ~]$ sudo df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       917G  3.0G  868G   1% /
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G   25M   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
tmpfs           3.2G     0  3.2G   0% /run/user/1000

Comment 2 Rachana Patel 2016-03-23 14:16:50 UTC
Created attachment 1139601 [details]
output of ceph-ansible installation

Comment 3 Andrew Schoen 2016-03-24 19:44:20 UTC
What are you setting ``public_network`` to? I've had OSDs not start because of an incorrect public network. Trying to start it manually will sometimes tell you it can not find the cluster at <public_network>.

Comment 7 Warren 2016-03-30 17:29:22 UTC
I have had problems on 7.1 where OSDs do not start after installing ceph with just straight ceph-deploy commands.  This problem may not be with ceph-ansible but somewhere else.

Comment 8 Warren 2016-03-30 17:29:30 UTC
I have had problems on 7.1 where OSDs do not start after installing ceph with just straight ceph-deploy commands.  This problem may not be with ceph-ansible but somewhere else.

Comment 9 Warren 2016-03-30 17:29:38 UTC
I have had problems on 7.1 where OSDs do not start after installing ceph with just straight ceph-deploy commands.  This problem may not be with ceph-ansible but somewhere else.

Comment 10 Tanay Ganguly 2016-03-31 10:33:43 UTC
I am hitting the original Bug.
No OSD is showing up after the installation.
After seeing all the Nodes, i see the packages are all installed but none of the ceph osd process started, manually starting is also failing.

sudo ceph -s
    cluster ac811885-ead4-43a3-a509-ff0724718399
     health HEALTH_ERR
            64 pgs are stuck inactive for more than 300 seconds
            64 pgs stuck inactive
            no osds
     monmap e1: 1 mons at {cephqe3=10.70.44.40:6789/0}
            election epoch 4, quorum 0 cephqe3
     osdmap e1: 0 osds: 0 up, 0 in
            flags sortbitwise
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating

ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1      0 root default    

I followed the document:
https://docs.google.com/document/d/1GzcpiciMLdNzZ46BLjVivKBVIhb1VDvzwZXAqX4Or9s/edit

I couldn't capture the file first time, again re-run and collected the log

Comment 11 Tanay Ganguly 2016-03-31 10:34:08 UTC
Created attachment 1142143 [details]
Command_Log

Comment 12 Alfredo Deza 2016-03-31 11:25:15 UTC
It looks like you are all hitting different issues.

@Tanay, I see IPs in your output, not hosts. Ceph monitors usually talk to each other via short hostnames. Maybe your ansible hosts file has IPs only? The nodes need to be able to resolve to each other as hosts too.

Comment 13 Tejas 2016-04-01 06:25:04 UTC
*** Bug 1320547 has been marked as a duplicate of this bug. ***

Comment 15 Shubhendu Tripathi 2016-04-01 12:55:06 UTC
I hit the same issue twice since yesterday with builds ceph-10.1.0-1.el7cp.x86_64 and ceph-10.0.4-2.el7cp.x86_64 both.


The task status shows completed successfully as below

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--> started: 2016-04-01 17:50:11.012934
--> exit_code: 0
--> ended: 2016-04-01 17:57:44.150559
--> command: /bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/osd-configure.yml -i /tmp/913ad7e2-2f3d-40c8-a613-cff4fe9c1fe9_KFsmce --extra-vars {"raw_journal_devices": ["/dev/vde"], "ceph_stable": true, "devices": ["/dev/vdd"], "public_network": "10.70.44.0/22", "fetch_directory": "/var/lib/ceph-installer/fetch", "cluster_network": "10.70.44.0/22", "raw_multi_journal": true, "fsid": "deedcb4c-a67a-4997-93a6-92149ad2622a", "journal_size": 1024} --skip-tags package-install
--> stderr: 
--> identifier: 913ad7e2-2f3d-40c8-a613-cff4fe9c1fe9
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

But nothing listed under "ceph -s"

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ceph -s
    cluster deedcb4c-a67a-4997-93a6-92149ad2622a
     health HEALTH_ERR
            64 pgs are stuck inactive for more than 300 seconds
            64 pgs stuck inactive
            no osds
     monmap e1: 1 mons at {dhcp46-204=10.70.46.204:6789/0}
            election epoch 3, quorum 0 dhcp46-204
     osdmap e1: 0 osds: 0 up, 0 in
            flags sortbitwise
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
~~~~~~~~~~~~~~~~~~~~~~~~

Attached the task output.

Comment 16 Shubhendu Tripathi 2016-04-01 12:56:20 UTC
Created attachment 1142564 [details]
osd add error [shubhendu]

Comment 17 Alfredo Deza 2016-04-01 14:11:56 UTC
@Shubhendu, it looks like you might be hitting firewall issues. 

It looks like the OSD node is not able to communicate with the MON.
Ensure that the OSD can communicate on the standard port and that the
monitor is able to get resolved from the OSD. This can be inferred
from this line here in the output you attached:

2016-04-01 17:54:29.087330 7fe65410a700  0 -- :/3770174610 >> 0.0.0.0:6789/0 pipe(0x7fe644004300 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe644001990).fault

This section might be helpful:

http://docs.ceph.com/docs/master/start/quick-start-preflight/#open-required-ports

The quick-start-preflight is a good source of information to ensure
things are working properly and understand any caveats.

The troubleshooting guide is good as well:

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/

To be able to understand where/why ceph-installer is failing, we need
not only the log output but some troubleshooting on those nodes that
failed. The troubleshooting guide is a start.

Comment 21 Vasu Kulkarni 2016-04-05 18:13:25 UTC
Hemant,

Do you have a magna nodes to try? lets syncup and lets try this once on magna nodes.

Comment 22 Vasu Kulkarni 2016-04-05 21:33:59 UTC
I just tried this on 2 node using ceph-ansible packages and it all works for me..

I also see you missed some config in osds

cat group_vars/osds
crush_location: false
osd_crush_location: "'root={{ ceph_crush_root }} rack={{ ceph_crush_rack }} host={{ ansible_hostname }}'"

zap_devices: true
devices:
  - /dev/sdb
  - /dev/sdc
  - /dev/sdd
journal_collocation: true

Comment 23 Christina Meno 2016-04-06 17:02:44 UTC
It is my understanding that the storage console team also faces this issue.
Nishanth would you please provide additional details about the failure?

Comment 24 Nishanth Thomas 2016-04-07 17:24:33 UTC
As discussed, here is the setup where this issue can be reproduced

Here are the nodes:

dhcp46-146.lab.eng.blr.redhat.com(10.70.46.146) - Ceph Installer
dhcp46-153.lab.eng.blr.redhat.com(10.70.46.153) - OSD
dhcp46-156.lab.eng.blr.redhat.com(10.70.46.156) - MON

All the Nodes are bootstrapped and Installation is done(with ceph-installer)
Please have a look and let me know is something missing here.

Comment 26 Nishanth Thomas 2016-04-07 20:10:51 UTC
Andrew,

I haven't run the mon/configure commands on these nodes. Gregory asked me to provide a setup where this can be reproducible. If you want me to run mon/osd commands, I can do that. Otherwise you are free to run those commands to reproduce the issue. Please let me know

Comment 27 Nishanth Thomas 2016-04-07 20:26:05 UTC
Below are the commands to reproduce the issue:

curl -d "{\"calamari\": true, \"host\": \"dhcp46-156.lab.eng.blr.redhat.com\", \"fsid\": \"deedcb4c-a67a-4997-93a6-92149ad2622a\", \"interface\": \"eth0\", \"monitor_secret\": \"AQA7P8dWAAAAABAAH/tbiZQn/40Z8pr959UmEA==\", \"cluster_network\": \"10.70.44.0/22\", \"public_network\": \"10.70.44.0/22\", \"redhat_storage\": false}" http://dhcp46-146.lab.eng.blr.redhat.com:8181/api/mon/configure/


curl -d "{\"devices\": {\"/dev/vdb\" : \"/dev/vdc\"}, \"journal_size\": 1024, \"fsid\": \"deedcb4c-a67a-4997-93a6-92149ad2622a\", \"host\": \"dhcp46-153.lab.eng.blr.redhat.com\", \"cluster_network\": \"10.70.44.0/22\", \"public_network\": \"10.70.44.0/22\", \"monitors\": [{\"host\": \"dhcp46-156.lab.eng.blr.redhat.com\", \"interface\": \"eth0\"}]}" http://dhcp46-146.lab.eng.blr.redhat.com:8181/api/osd/configure/

Comment 28 Andrew Schoen 2016-04-07 21:42:02 UTC
Nishanth,

I was able to reproduce the issue. When looking at ceph.conf on the OSD node I see the following:

[mon.dhcp46-156]
host = dhcp46-156
mon addr = 0.0.0.0

The mon addr is incorrect here, which is why the OSD is not joining the cluster. The ceph.conf is correct for the MON node however.

I tried using ``address`` instead of ``interface`` for the ``monitors`` parameter, but received back this validation error, which should not have happened.

{"message": "-> monitors -> [{u'host': u'dhcp46-156.lab.eng.blr.redhat.com', u'address': u'10.70.46.156'}] failed validation, requires format: [{'host': 'mon1.host', 'interface': 'eth1'},{'host': 'mon2.host', 'interface': 'enp0s8'}]"}

Comment 29 Andrew Schoen 2016-04-08 19:47:16 UTC
There was a bug in ceph-ansible that was causing the incorrect mon addr setting on the OSD node.

This PR addresses that: https://github.com/ceph/ceph-ansible/pull/701

Comment 30 Nishanth Thomas 2016-04-09 13:59:17 UTC
Does this fix available in the latest build?

Comment 31 Christina Meno 2016-04-11 22:02:47 UTC
Nishanth, this fix is not downstream yet.

Comment 33 Tamil 2016-04-13 16:15:00 UTC
Nishanth,

Could you let me know what's the value of 'mon addr' in ceph.conf on your OSD node?

Thanks,
Andrew

Comment 34 Nishanth Thomas 2016-04-13 17:01:32 UTC
[mon.dhcp46-139]
host = dhcp46-139
mon addr = 10.70.46.139

Comment 35 Andrew Schoen 2016-04-13 18:41:06 UTC
I've fixed another bug related to the use of 'address' instead of 'interface' in ceph-ansible.

https://github.com/ceph/ceph-ansible/pull/712

Comment 41 Alfredo Deza 2016-05-03 19:57:25 UTC
It seems like there is a pretty severe clock skew. Again, I am following the troubleshooting guide for mons: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues

    $ ceph health detail
    HEALTH_ERR clock skew detected on mon.cephqe8, mon.cephqe9; 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; no osds; Monitor clock skew detected

Comment 42 Tanay Ganguly 2016-05-04 07:16:09 UTC
Thanks Rachana, the information you provided is correct.

Alfredo,
cephqe3 is the Installer Node which i already mentioned in the comment38

Ansible Installer: cephqe3


I just ran "ansible-playbook site.yml" from this directory.


I am following this document for installation
https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/version-2/installation-guide-for-red-hat-enterprise-linux/


I am again re-try after purging my cluster.

Comment 45 Tanay Ganguly 2016-05-05 17:39:35 UTC
Thanks Alfredo for your Input.
I am able to bring up my Cluster clean.

As discussed fsid and clock synchronization was the problem.


Note You need to log in before you can comment on or make changes to this bug.