Description of problem: ======================= cthulhu crashes when osd_metadata is missing from the OSD_map Version-Release number of selected component (if applicable): ============================================================== ceph-deploy-1.5.27.4-1.el7cp.noarch ceph-common-0.94.5-4.el7cp.x86_64 calamari-server-1.3.2-2.el7cp.x86_64 calamari-clients-1.3-2.el7cp.x86_64 salt-2014.1.5-3.el7cp.noarch salt-minion-2014.1.5-3.el7cp.noarch salt-master-2014.1.5-3.el7cp.noarch How reproducible: ================= always Steps to Reproduce: ================== 1.Cluster had ceph-1.3.1 on RHEL-7.1 2. started upgrading it to 1.3.2 z build. 3. after uograding calamari server, web browser gives error - server error - 500. cthulhu log :- [c1@magna048 ~]$ sudo tail -f /var/log/calamari/cthulhu.log cluster_monitor.inject_sync_object(None, sync_type, version, msgpack.unpackb(latest_record.data)) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 351, in inject_sync_object new_object = self._sync_objects.on_fetch_complete(minion_id, sync_type, version, data) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 138, in on_fetch_complete new_object = self.set_map(sync_type, version, data) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 56, in set_map so = self._objects[typ] = typ(version, map_data) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/types.py", line 60, in __init__ self.metadata_by_id = dict([(m['osd'], m) for m in data['osd_metadata']]) KeyError: 'osd_metadata' Actual results: =============== cthulhu crashed as osd_metadata is missing Additional info: =================== [c1@magna048 ~]$ sudo tail -f /var/log/calamari/calamari.log reply_event = bufchan.recv(timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/channel.py", line 267, in recv event = self._input_queue.get(timeout=timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/queue.py", line 200, in get result = waiter.get() File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 568, in get return self.hub.switch() File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 331, in switch return greenlet.switch(self) LostRemote: Lost remote after 10s heartbeat
Gregory's work in progress is @ https://github.com/ceph/calamari/tree/wip-fix-osd-metadata
Fixed in v1.3.3 upstream
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313