Description of problem: Calamari API to create the crushnode/crushrule leaving the calamari in a inconsistent state. Beyond that the POST requests are not honored. I need to do a calamari-initialize bring it back to a stable state. Version-Release number of selected component (if applicable): calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64 How reproducible: Consistent Steps to Reproduce: Create crush node and crush rule using calamari API as below root general { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 } rule general { ruleset 19 type replicated min_size 1 max_size 10 step take general step chooseleaf firstn 0 type osd step emit }
Nishanth, Would you please provide me access to an environment where this is happening? or would you please provide /var/log/calamari/*.log as attachments? thank you, G
Nishanth, There are few problems here: steps to reproduce are incomplete that is not valid json What endpoint is causing the failure is is /crush_rule or /crush_node further you're running out of date calamari [root@dhcp47-48 ~]# rpm -qa | grep calamari-server calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64 latest is: calamari-server-1.4.0-0.1.rc5.el7cp.x86_64.rpm available here: http://puddle.ceph.redhat.com/puddles/ceph/2/2016-03-24.1/CEPH-2.repo
I have updated to the latest package and restarted I am now seeing this error in the logs 2016-03-31 12:47:54,883 - WARNING - calamari.request_collection on_completion: unknown jid f532d318-f72b-4032-b316-b4c302321d6a, return: Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 735, in run_job_thread result = run_job(cmd, args) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 715, in run_job args['since']) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 418, in get_cluster_object with ClusterHandle(ClusterHandle(cluster_name)) as cluster_handle: File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 82, in __enter__ conf_file = os.path.join(SRC_DIR, self.cluster_name + ".conf") TypeError: unsupported operand type(s) for +: 'instance' and 'str'
This one is easy. I can provide a fix and a new package by end of day today
*** Bug 1322907 has been marked as a duplicate of this bug. ***
https://github.com/ceph/calamari/pull/411
need pm_ack
need qa_ack
One more observation, one this crush map/rule issue happens, the api /api/v2/cluster/{fsid}/server, lists the osd node names as "general". I feel this is something messed up with conf and crush bucket/map. Below is a sample output -------------------------------- [ { "fqdn": "dhcp47-98.lab.eng.blr.redhat.com", "hostname": "dhcp47-98.lab.eng.blr.redhat.com", "services": [ { "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", "type": "mon", "id": "c", "running": true } ], "frontend_addr": "10.70.47.98", "backend_addr": null, "frontend_iface": null, "backend_iface": null, "managed": true, "last_contact": "2016-04-04T05:43:31.851396+00:00", "boot_time": "2016-03-29T18:22:11+00:00", "ceph_version": "0.94.5-9.el7cp" }, { "fqdn": "general", "hostname": "general", "services": [ { "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", "type": "osd", "id": "2", "running": true }, { "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", "type": "osd", "id": "1", "running": true }, { "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", "type": "osd", "id": "0", "running": true } ], "frontend_addr": "10.70.47.95", "backend_addr": "10.70.47.95", "frontend_iface": null, "backend_iface": null, "managed": false, "last_contact": null, "boot_time": null, "ceph_version": null } ] --------------------------------
crush_node create is messing up the crush map. When you post for the first time it creates the crush node but I could see that it tampers with other crush node entries by making the items empty. Please see the result of the GET after the POST.(https://bugzilla.redhat.com/show_bug.cgi?id=1322905 ) [{"bucket_type": "root", "name": "default", "id": -1, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": [{"id": -2, "weight": 0.0, "pos": 0}]}, {"bucket_type": "host", "name": "dhcp47-44", "id": -2, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "general", "id": -3, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "general121", "id": -4, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "generalzzzzzzz", "id": -5, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": [{"id": 0, "weight": 0.0, "pos": 0}]}] I could see the below error in the logs: 2016-04-01 09:55:24,536 - ERROR - calamari RpcInterface !! create Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 35, in wrap rc = attr(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 210, in create return cluster.request_create(CRUSH_NODE, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 391, in request_create return self._request('create', obj_type, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 375, in _request request = getattr(request_factory, method)(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 50, in create self._add_items(name, bucket_type, items) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 84, in _add_items hostname = self._get_hostname_where_osd_runs(id) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 96, in _get_hostname_where_osd_runs str(osd_id))).hostname AttributeError: 'NoneType' object has no attribute 'hostname' ==> calamari.log <== 2016-04-01 09:55:24,541 - ERROR - django.request Internal Server Error: /api/v2/cluster/58eaf578-1e8c-4359-9896-f9ae21e1ed7b/crush_node Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view return self.dispatch(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 91, in dispatch return super(RPCViewSet, self).dispatch(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 399, in dispatch response = self.handle_exception(exc) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 108, in handle_exception return super(RPCViewSet, self).handle_exception(exc) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 396, in dispatch response = handler(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py", line 131, in create create_response = self.client.create(fsid, CRUSH_NODE, serializer.get_data()) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 260, in <lambda> return lambda *args, **kargs: self(method, *args, **kargs) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 245, in __call__ return self._process_response(request_event, bufchan, timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 55, in _process_response result = super(ProfiledRpcClient, self)._process_response(request_event, bufchan, timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 220, in _process_response reply_event, self._handle_remote_error) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py", line 44, in process_answer raise exception RemoteError: Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 148, in _async_task functor.pattern.process_call(self._context, bufchan, event, functor) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py", line 30, in process_call result = functor(*req_event.args) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/decorators.py", line 44, in __call__ return self._functor(*args, **kargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 35, in wrap rc = attr(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 210, in create return cluster.request_create(CRUSH_NODE, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 391, in request_create return self._request('create', obj_type, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 375, in _request request = getattr(request_factory, method)(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 50, in create self._add_items(name, bucket_type, items) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 84, in _add_items hostname = self._get_hostname_where_osd_runs(id) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 96, in _get_hostname_where_osd_runs str(osd_id))).hostname AttributeError: 'NoneType' object has no attribute 'hostname'
I believe this is the issue we're looking for: 2016-04-14 06:09:10,461 - WARNING - calamari Abandoning fetch for osd_map started at 2016-04-14 11:09:00.150400+00:00 2016-04-14 06:09:10,591 - WARNING - calamari.request_collection on_completion: unknown jid c2dc3136-5639-4e10-b615-6c3a8bbdb632, return: Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 733, in run_job_thread result = run_job(cmd, args) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 713, in run_job args['since']) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 482, in get_cluster_object assert ret == 0 AssertionError
I'm working on a test that reproduces this error.
I am still seeing issue in: calamari-server-1.4.0-0.5.rc8.el7cp Crush map after cluster creation: =========================== # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host dhcp46-58 { id -2 # do not change unnecessarily # weight 0.015 alg straw hash 0 # rjenkins1 item osd.0 weight 0.015 } root default { id -1 # do not change unnecessarily # weight 0.015 alg straw hash 0 # rjenkins1 item dhcp46-58 weight 0.015 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map ================= Now I send a POST request to create a crush node POST http://10.70.46.139:8002/api/v2/cluster/deedcb4c-a67a-4997-93a6-92149ad2622a/crush_node Crush map after POST: ================================ # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host dhcp46-58 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp46-58 weight 0.000 } root general { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map ============================ note that "item osd.0 weight 0.000" entry in "host dhcp46-58" is removed.
Nishanth, Would you please provide the data that you are posting and help me look for errors in the log during this event? This OSD.0 is getting parented to the general node, Is that is what you are asking for in the create?
I expect a configuation as below after my post reqest: ==================== # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host dhcp46-58 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp46-58 weight 0.000 <======== } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp46-58 weight 0.000 <=========== } root general { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map ================================================ if remove osd.0 from default, it will affect any pool/rbds created with replicated_ruleset(uses 'dafault') and the CLI commands like 'rbd ls' hangs for ever. But I edited the crush map(as above) manually and updated to the cluster, it worked without any issues, so I expect that this is a valid configuration.
I understand that Nishanth. What I need to progress is the actual POST data for creating the crush node that causes the issue. Would you please provide that?
{"bucket_type": "root", "name": "general", "items": [{"id": 0, "weight": 0.0, "pos": 0}]}
Still need an Ubuntu build with this fix (rc9?).