Description of problem: On a Ceph cluster which is upgraded from 1.3.3 to 2.2, the ceph installer api's are not able to fetch the cluster details from the calamari server on the MON node. Actually I am trying to import the cluster into USM. The calamari service is running successfully on the MONs. GET /api/v2/cluster HTTP 200 OK Vary: Accept Content-Type: text/html; charset=utf-8 Allow: GET, HEAD, OPTIONS [] The cluster API is retuning a blank. Version-Release number of selected component (if applicable): ceph version 10.2.5-22.el7cp (5cec6848b914e87dd6178e559dedae8a37cc08a3) calamari-server-1.5.0-1.el7cp.x86_64 salt-minion-2015.5.5-1.el7.noarch salt-selinux-0.0.45-1.el7scon.noarch salt-2015.5.5-1.el7.noarch How reproducible: Not sure Steps to Reproduce: 1. Create a 1.3.3 ceph cluster with 3 MON and 3 OSD. 2. Upgrade the cluster to ceph 2.2 using the documented procedure. 3. executed a take_over_existing_cluster.yml on the upgraded cluster, from a different ansible node. 4. Added a mon to this cluster using site.yml 5. Try to import the cluster to USM. Additional info: The system is still in the same state: mons: magna031 magna046 magna052 osds: magna058 magna061 magna063 Console node: magna028
2017-02-09 06:03:56,930 - ERROR - calamari Uncaught exception Traceback (most recent call last): File "/opt/calamari/venv/bin/calamari-lite", line 9, in <module> load_entry_point('calamari-lite==0.1', 'console_scripts', 'calamari-lite')() File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_lite-0.1-py2.7.egg/calamari_lite/server.py", line 140, in main cthulhu = Manager() File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/manager.py", line 193, in __init__ self.eventer = Eventer(self) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/eventer.py", line 77, in __init__ self.caller = salt.client.Caller(mopts=__opts__) TypeError: __init__() got an unexpected keyword argument 'mopts'
not sure that the traceback in the log is the cause
Root cause probably no admin socket in /var/run/ceph
rebooting magna046 caused a socket to be created something strange is up
and calamari works fine on magna046 after opening the firewall.
history on one monitor says that likely it wasn't rebooted after upgrade. Tejas is this the case? [root@magna052 ubuntu]# history 1 subscription-manager repos --enable=rhel-7-server-rhceph-1.3-mon-rpms 2 systemctl start firewalld 3 systemctl enable firewalld 4 systemctl status firewalld.service 5 firewall-cmd --zone=public --add-port=6789/tcp 6 firewall-cmd --zone=public --add-port=6789/tcp --permanent 7 systemctl enable ntpd.service 8 systemctl start ntpd 9 ntpq -p 10 yum-config-manager --disable epel 11 setenforce 1 12 uname -a 13 systemctl status salt-minion.service 14 subscription-manager repos --disable=rhel-7-server-rhceph-1.3-mon-rpms --disable=rhel-7-server-rhceph-1.3-installer-rpms --disable=rhel-7-server-rhceph-1.3-calamari-rpms 15 systemctl status ceph-mon.magna052.1486542313.319673641.service 16 systemctl stop ceph-mon.magna052.1486542313.319673641.service 17 yum update ceph-mon 18 chown -R ceph:ceph /var/lib/ceph/mon 19 chown -R ceph:ceph /var/log/ceph 20 chown -R ceph:ceph /var/run/ceph 21 chown -R ceph:ceph /etc/ceph 22 touch /.autorelabel 23 udevadm trigger 24 systemctl enable ceph-mon.target 25 systemctl enable ceph-mon@magna052 26 systemctl status ceph-mon 27 systemctl start ceph-mon 28 systemctl status ceph-mon 29 ps -ef | grep ceph 30 systemctl status ntpd.service 31 systemctl start ntpd.service 32 ntpq -p 33 free -h 34 salt-call --local pillar.items | grep ceph.heartbeat 35 ceph -s 36 free -h 37 curl magna028.ceph.redhat.com:8181/setup/agent/ | bash 38 rpm -qa | grep salt 39 rpm -qa | grep agent 40 systemctl restart salt-minion.service 41 history [root@magna052 ubuntu]#
OK, I found the cause of this. We use this regexp (calamari_common/remote/mon_remote.py:service_status): match = re.match("^(.*)-(.*)\.(.*).asok$", os.path.basename(socket_path)) to get the cluster_name, service_type and service_id but the regexp matches the second group (service_type) all the way to the last dot so we get weird type and id when using FQDNs. I'm currently looking at the ways to fix this regexp to match to the first dot.
It turns out we just need to change the line to read match = re.match("^(.*)-([^\.]*)\.(.*).asok$", os.path.basename(socket_path)) and it looks like everything works ok (cluster was discovered, osds were there, ...) Upstream PR: https://github.com/ceph/calamari/pull/502
https://github.com/ceph/calamari/releases/tag/v1.5.1
Verified on calamari build: calamari-server-1.5.2-1.el7cp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html