Description of problem: ======================= Used ceph-ansible to create ceph cluster. Installation completed without any error but OSDs are not activated on any node. $ sudo ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0 root default Version-Release number of selected component (if applicable): ============================================================= ceph-ansible-1.0.2-1.el7.noarch How reproducible: ================ always Steps to Reproduce: =================== 1.Install ceph-ansible 'ceph-ansible-1.0.2-1.el7.noarch' on installer node 2. change below parameter in /usr/share/ceph-ansible/group_vars/all ceph_stable_rh_storage: true ceph_stable_rh_storage_cdn_install: true monitor_interface: eno1 monitor_secret: AQA7P8dWAAAAABAAH/tbiZQn/40Z8pr959UmEA== journal_size: 10240 public_network: 3. change below parameter in /usr/share/ceph-ansible/group_vars/osds crush_location: false osd_crush_location: "'root={{ ceph_crush_root }} rack={{ ceph_crush_rack }} host={{ ansible_hostname }}'" devices: - /dev/sdb - /dev/sdc journal_collocation: true 4. add hosts in 'hosts' file. (1 mon and 3 OSD node) 5. execute ansible-playbook site.yml once it is completed succesfully check cluster status Actual results: =============== [root@magna074 ceph-ansible]# sudo ceph -s cluster ea686d5b-9724-400c-9d17-dc135d0ee648 health HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds 64 pgs stuck inactive monmap e1: 1 mons at {magna074=10.8.128.74:6789/0} election epoch 4, quorum 0 magna074 osdmap e4: 6 osds: 0 up, 0 in flags sortbitwise pgmap v5: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating [root@magna074 ceph-ansible]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0 root default Expected results: ================= Ceph cluster should be up and running and 6 OSD should be in. (active+clean) Additional info: ================ one of OSD node:- during installation was able to see OSD mounted [ubuntu@magna067 ~]$ sudo df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 917G 3.0G 868G 1% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 25M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup tmpfs 3.2G 0 3.2G 0% /run/user/0 tmpfs 3.2G 0 3.2G 0% /run/user/1000 /dev/sdb1 922G 33M 922G 1% /var/lib/ceph/tmp/mnt.fRDt8U but once installation is complete [ubuntu@magna067 ~]$ sudo df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 917G 3.0G 868G 1% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 25M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup tmpfs 3.2G 0 3.2G 0% /run/user/1000
Created attachment 1139601 [details] output of ceph-ansible installation
What are you setting ``public_network`` to? I've had OSDs not start because of an incorrect public network. Trying to start it manually will sometimes tell you it can not find the cluster at <public_network>.
I have had problems on 7.1 where OSDs do not start after installing ceph with just straight ceph-deploy commands. This problem may not be with ceph-ansible but somewhere else.
I am hitting the original Bug. No OSD is showing up after the installation. After seeing all the Nodes, i see the packages are all installed but none of the ceph osd process started, manually starting is also failing. sudo ceph -s cluster ac811885-ead4-43a3-a509-ff0724718399 health HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds 64 pgs stuck inactive no osds monmap e1: 1 mons at {cephqe3=10.70.44.40:6789/0} election epoch 4, quorum 0 cephqe3 osdmap e1: 0 osds: 0 up, 0 in flags sortbitwise pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0 root default I followed the document: https://docs.google.com/document/d/1GzcpiciMLdNzZ46BLjVivKBVIhb1VDvzwZXAqX4Or9s/edit I couldn't capture the file first time, again re-run and collected the log
Created attachment 1142143 [details] Command_Log
It looks like you are all hitting different issues. @Tanay, I see IPs in your output, not hosts. Ceph monitors usually talk to each other via short hostnames. Maybe your ansible hosts file has IPs only? The nodes need to be able to resolve to each other as hosts too.
*** Bug 1320547 has been marked as a duplicate of this bug. ***
I hit the same issue twice since yesterday with builds ceph-10.1.0-1.el7cp.x86_64 and ceph-10.0.4-2.el7cp.x86_64 both. The task status shows completed successfully as below ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --> started: 2016-04-01 17:50:11.012934 --> exit_code: 0 --> ended: 2016-04-01 17:57:44.150559 --> command: /bin/ansible-playbook -v -u ceph-installer /usr/share/ceph-ansible/osd-configure.yml -i /tmp/913ad7e2-2f3d-40c8-a613-cff4fe9c1fe9_KFsmce --extra-vars {"raw_journal_devices": ["/dev/vde"], "ceph_stable": true, "devices": ["/dev/vdd"], "public_network": "10.70.44.0/22", "fetch_directory": "/var/lib/ceph-installer/fetch", "cluster_network": "10.70.44.0/22", "raw_multi_journal": true, "fsid": "deedcb4c-a67a-4997-93a6-92149ad2622a", "journal_size": 1024} --skip-tags package-install --> stderr: --> identifier: 913ad7e2-2f3d-40c8-a613-cff4fe9c1fe9 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ But nothing listed under "ceph -s" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ceph -s cluster deedcb4c-a67a-4997-93a6-92149ad2622a health HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds 64 pgs stuck inactive no osds monmap e1: 1 mons at {dhcp46-204=10.70.46.204:6789/0} election epoch 3, quorum 0 dhcp46-204 osdmap e1: 0 osds: 0 up, 0 in flags sortbitwise pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating ~~~~~~~~~~~~~~~~~~~~~~~~ Attached the task output.
Created attachment 1142564 [details] osd add error [shubhendu]
@Shubhendu, it looks like you might be hitting firewall issues. It looks like the OSD node is not able to communicate with the MON. Ensure that the OSD can communicate on the standard port and that the monitor is able to get resolved from the OSD. This can be inferred from this line here in the output you attached: 2016-04-01 17:54:29.087330 7fe65410a700 0 -- :/3770174610 >> 0.0.0.0:6789/0 pipe(0x7fe644004300 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe644001990).fault This section might be helpful: http://docs.ceph.com/docs/master/start/quick-start-preflight/#open-required-ports The quick-start-preflight is a good source of information to ensure things are working properly and understand any caveats. The troubleshooting guide is good as well: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/ To be able to understand where/why ceph-installer is failing, we need not only the log output but some troubleshooting on those nodes that failed. The troubleshooting guide is a start.
Hemant, Do you have a magna nodes to try? lets syncup and lets try this once on magna nodes.
I just tried this on 2 node using ceph-ansible packages and it all works for me.. I also see you missed some config in osds cat group_vars/osds crush_location: false osd_crush_location: "'root={{ ceph_crush_root }} rack={{ ceph_crush_rack }} host={{ ansible_hostname }}'" zap_devices: true devices: - /dev/sdb - /dev/sdc - /dev/sdd journal_collocation: true
It is my understanding that the storage console team also faces this issue. Nishanth would you please provide additional details about the failure?
As discussed, here is the setup where this issue can be reproduced Here are the nodes: dhcp46-146.lab.eng.blr.redhat.com(10.70.46.146) - Ceph Installer dhcp46-153.lab.eng.blr.redhat.com(10.70.46.153) - OSD dhcp46-156.lab.eng.blr.redhat.com(10.70.46.156) - MON All the Nodes are bootstrapped and Installation is done(with ceph-installer) Please have a look and let me know is something missing here.
Andrew, I haven't run the mon/configure commands on these nodes. Gregory asked me to provide a setup where this can be reproducible. If you want me to run mon/osd commands, I can do that. Otherwise you are free to run those commands to reproduce the issue. Please let me know
Below are the commands to reproduce the issue: curl -d "{\"calamari\": true, \"host\": \"dhcp46-156.lab.eng.blr.redhat.com\", \"fsid\": \"deedcb4c-a67a-4997-93a6-92149ad2622a\", \"interface\": \"eth0\", \"monitor_secret\": \"AQA7P8dWAAAAABAAH/tbiZQn/40Z8pr959UmEA==\", \"cluster_network\": \"10.70.44.0/22\", \"public_network\": \"10.70.44.0/22\", \"redhat_storage\": false}" http://dhcp46-146.lab.eng.blr.redhat.com:8181/api/mon/configure/ curl -d "{\"devices\": {\"/dev/vdb\" : \"/dev/vdc\"}, \"journal_size\": 1024, \"fsid\": \"deedcb4c-a67a-4997-93a6-92149ad2622a\", \"host\": \"dhcp46-153.lab.eng.blr.redhat.com\", \"cluster_network\": \"10.70.44.0/22\", \"public_network\": \"10.70.44.0/22\", \"monitors\": [{\"host\": \"dhcp46-156.lab.eng.blr.redhat.com\", \"interface\": \"eth0\"}]}" http://dhcp46-146.lab.eng.blr.redhat.com:8181/api/osd/configure/
Nishanth, I was able to reproduce the issue. When looking at ceph.conf on the OSD node I see the following: [mon.dhcp46-156] host = dhcp46-156 mon addr = 0.0.0.0 The mon addr is incorrect here, which is why the OSD is not joining the cluster. The ceph.conf is correct for the MON node however. I tried using ``address`` instead of ``interface`` for the ``monitors`` parameter, but received back this validation error, which should not have happened. {"message": "-> monitors -> [{u'host': u'dhcp46-156.lab.eng.blr.redhat.com', u'address': u'10.70.46.156'}] failed validation, requires format: [{'host': 'mon1.host', 'interface': 'eth1'},{'host': 'mon2.host', 'interface': 'enp0s8'}]"}
There was a bug in ceph-ansible that was causing the incorrect mon addr setting on the OSD node. This PR addresses that: https://github.com/ceph/ceph-ansible/pull/701
Does this fix available in the latest build?
Nishanth, this fix is not downstream yet.
Nishanth, Could you let me know what's the value of 'mon addr' in ceph.conf on your OSD node? Thanks, Andrew
[mon.dhcp46-139] host = dhcp46-139 mon addr = 10.70.46.139
I've fixed another bug related to the use of 'address' instead of 'interface' in ceph-ansible. https://github.com/ceph/ceph-ansible/pull/712
It seems like there is a pretty severe clock skew. Again, I am following the troubleshooting guide for mons: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues $ ceph health detail HEALTH_ERR clock skew detected on mon.cephqe8, mon.cephqe9; 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; no osds; Monitor clock skew detected
Thanks Rachana, the information you provided is correct. Alfredo, cephqe3 is the Installer Node which i already mentioned in the comment38 Ansible Installer: cephqe3 I just ran "ansible-playbook site.yml" from this directory. I am following this document for installation https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/version-2/installation-guide-for-red-hat-enterprise-linux/ I am again re-try after purging my cluster.
Thanks Alfredo for your Input. I am able to bring up my Cluster clean. As discussed fsid and clock synchronization was the problem.