Created attachment 1080440 [details] Sample output from ceph-deploy osd create magna045:sdb Description of problem: ceph-deploy osd create seems to finish cleanly, but there are errors in the results. Version-Release number of selected component (if applicable): Rhel 7.2 -- Ceph 1.2.3 How reproducible: 100% of the time (3 for 3 for me) Steps to Reproduce: On two RHEL 7.2 machines 1. run ice_setup on one machine 2. use ceph-deploy to install ceph on the other machine 3. the ceph osd commands are: a. ceph-deploy disk zap magnaXXX:sdb b. ceph-deploy osd create magnaXXX:sdb The command seems to work, but looking at the output (see attached file), there appear to be some problems. The behavior is the same when creating osds on sdc and sdd as well: Actual results: $ sudo ceph osd stat osdmap e1: 0 osds: 0 up, 0 in $ sudo ceph health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds Expected results: A healthy ceph should be running Additional info:
The problems of which I speak in the the attachment are lines like: [WARNIN] partx: /dev/sdb: error adding partitions 1-2
Created attachment 1080512 [details] ceph-deploy mon create-initial output after ceph-deploy install --release firefly was performed.
Created attachment 1080513 [details] Result after ceph-deploy mon create-initial
This problem (the one with mon create-initial failing) happened because ceph-mon, for some reason, was not found on the monitor site. I ran yum install ceph-mon on that site and this command now works.
Hmmm. ceph-disk is not installed on the ceph cluster site. I need to run yum install ceph-osd there too.
After making the changes in the previous comments, ceph osd create magnaXXX:sdX now finishes. The output (including the error adding partition messages) looked similar to the output in the first file attached. I also got the message: [magna045][WARNIN] there is 1 OSD down [magna045][WARNIN] there is 1 OSD out after creating the first osd [magna045][WARNIN] there are 2 OSDs down [magna045][WARNIN] there are 2 OSDs out after creating the second osd, and [magna045][WARNIN] there are 3 OSDs down [magna045][WARNIN] there are 3 OSDs out after the third. This looked ominous to me. Sure enough, ceph health shows: HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
Raising a new BZ against 1.3
(In reply to Warren from comment #3) > Created attachment 1080512 [details] > ceph-deploy mon create-initial output after ceph-deploy install --release > firefly was performed. Is "--release firefly" in the RHCS install docs anywhere? I suspect that will grab packages from ceph.com.
Note: This was unclear from reading the comments: The first problem that I encountered (OSDs appeared to come up but ceph osd stat proved otherwise) happened when I did ceph-deploy install magnaXXX. Later comments (comment #3 to comment #7) happened when I tried the --release firefly option as an attempt to see if this were a workaround.
Created attachment 1080831 [details] ceph-deploy install output without any subscriptions.
The attachment in Comment 12 show the output of ceph-deploy install when no subscriptions are changed and the iso is installed from ice_setup. [magna038][WARNIN] http://magna035.ceph.redhat.com/static/ceph/0.80.8/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to magna035.ceph.redhat.com:80; No route to host" ls /opt/calamari/webapp/content/ceph/0.80.8/repodata shows 403badb7fb44b02aa6b46b0719b345cde3ecca45a93f1a188c7b173a0049b7bf-primary.xml.gz 5f84e6808d08a334da52ddfb24206b7353fefe9363c4d3ef371729b70a6bd687-filelists.xml.gz a2b491c13826ca6f9a64d0cdf058c7dc076ef7e39696db67522fe9f613990e99-other.sqlite.bz2 aea9774252295df69cf674eb008a794c696610a9a0be4abc5bc77db3f7e1a690-filelists.sqlite.bz2 d71bddc322fc500bfa263430117301c8586cc52ee284c0d144163698c20d84c8-primary.sqlite.bz2 f4a9c4f95e797f6298ff78a6e89e18fd022b2303e874b8d873b4802d5c428fa3-other.xml.gz repomd.xml TRANS.TBL So I am guessing that on magna038, http://magna035.ceph.redhat.com/static should point to /opt/calamari/webapp/content but does not.
Ugh. Previous message -- messed up the firewall...
Ceph is now up and healthy. The network has two sites. One runs calamari and was the site installation site. The other is a ceph cluster consisting of one mon and 3 osds. For some reason, this site needed to be rebooted in order for ceph to get healthy.
Closing. Problems here were firewall issues, improper use of --release parameters and confusion caused by 1269684, which I have now opened.