Bug 1322905
| Summary: | Crush rule update leave the calamari in a inconsistent state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Nishanth Thomas <nthomas> |
| Component: | Calamari | Assignee: | Christina Meno <gmeno> |
| Calamari sub component: | Back-end | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | urgent | CC: | ceph-eng-bugs, federico, hnallurv, kdreyer, nlevine, nthomas, shtripat, vsarmila |
| Version: | 2.0 | ||
| Target Milestone: | rc | ||
| Target Release: | 2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | calamari-server-1.4.0-0.5.rc8.el7cp | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-28 14:37:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1291304 | ||
Nishanth, Would you please provide me access to an environment where this is happening? or would you please provide /var/log/calamari/*.log as attachments? thank you, G Nishanth, There are few problems here: steps to reproduce are incomplete that is not valid json What endpoint is causing the failure is is /crush_rule or /crush_node further you're running out of date calamari [root@dhcp47-48 ~]# rpm -qa | grep calamari-server calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64 latest is: calamari-server-1.4.0-0.1.rc5.el7cp.x86_64.rpm available here: http://puddle.ceph.redhat.com/puddles/ceph/2/2016-03-24.1/CEPH-2.repo I have updated to the latest package and restarted
I am now seeing this error in the logs
2016-03-31 12:47:54,883 - WARNING - calamari.request_collection on_completion: unknown jid f532d318-f72b-4032-b316-b4c302321d6a, return: Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 735, in run_job_thread
result = run_job(cmd, args)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 715, in run_job
args['since'])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 418, in get_cluster_object
with ClusterHandle(ClusterHandle(cluster_name)) as cluster_handle:
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 82, in __enter__
conf_file = os.path.join(SRC_DIR, self.cluster_name + ".conf")
TypeError: unsupported operand type(s) for +: 'instance' and 'str'
This one is easy. I can provide a fix and a new package by end of day today *** Bug 1322907 has been marked as a duplicate of this bug. *** need pm_ack need qa_ack One more observation, one this crush map/rule issue happens, the api /api/v2/cluster/{fsid}/server, lists the osd node names as "general". I feel this is something messed up with conf and crush bucket/map.
Below is a sample output
--------------------------------
[
{
"fqdn": "dhcp47-98.lab.eng.blr.redhat.com",
"hostname": "dhcp47-98.lab.eng.blr.redhat.com",
"services": [
{
"fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185",
"type": "mon",
"id": "c",
"running": true
}
],
"frontend_addr": "10.70.47.98",
"backend_addr": null,
"frontend_iface": null,
"backend_iface": null,
"managed": true,
"last_contact": "2016-04-04T05:43:31.851396+00:00",
"boot_time": "2016-03-29T18:22:11+00:00",
"ceph_version": "0.94.5-9.el7cp"
},
{
"fqdn": "general",
"hostname": "general",
"services": [
{
"fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185",
"type": "osd",
"id": "2",
"running": true
},
{
"fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185",
"type": "osd",
"id": "1",
"running": true
},
{
"fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185",
"type": "osd",
"id": "0",
"running": true
}
],
"frontend_addr": "10.70.47.95",
"backend_addr": "10.70.47.95",
"frontend_iface": null,
"backend_iface": null,
"managed": false,
"last_contact": null,
"boot_time": null,
"ceph_version": null
}
]
--------------------------------
crush_node create is messing up the crush map. When you post for the first time it creates the crush node but I could see that it tampers with other crush node entries by making the items empty. Please see the result of the GET after the POST.(https://bugzilla.redhat.com/show_bug.cgi?id=1322905 ) [{"bucket_type": "root", "name": "default", "id": -1, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": [{"id": -2, "weight": 0.0, "pos": 0}]}, {"bucket_type": "host", "name": "dhcp47-44", "id": -2, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "general", "id": -3, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "general121", "id": -4, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": []}, {"bucket_type": "root", "name": "generalzzzzzzz", "id": -5, "weight": 0.0, "alg": "straw", "hash": "rjenkins1", "items": [{"id": 0, "weight": 0.0, "pos": 0}]}] I could see the below error in the logs: 2016-04-01 09:55:24,536 - ERROR - calamari RpcInterface !! create Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 35, in wrap rc = attr(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 210, in create return cluster.request_create(CRUSH_NODE, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 391, in request_create return self._request('create', obj_type, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 375, in _request request = getattr(request_factory, method)(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 50, in create self._add_items(name, bucket_type, items) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 84, in _add_items hostname = self._get_hostname_where_osd_runs(id) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 96, in _get_hostname_where_osd_runs str(osd_id))).hostname AttributeError: 'NoneType' object has no attribute 'hostname' ==> calamari.log <== 2016-04-01 09:55:24,541 - ERROR - django.request Internal Server Error: /api/v2/cluster/58eaf578-1e8c-4359-9896-f9ae21e1ed7b/crush_node Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view return self.dispatch(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 91, in dispatch return super(RPCViewSet, self).dispatch(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 399, in dispatch response = self.handle_exception(exc) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 108, in handle_exception return super(RPCViewSet, self).handle_exception(exc) File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 396, in dispatch response = handler(request, *args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py", line 131, in create create_response = self.client.create(fsid, CRUSH_NODE, serializer.get_data()) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 260, in <lambda> return lambda *args, **kargs: self(method, *args, **kargs) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 245, in __call__ return self._process_response(request_event, bufchan, timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 55, in _process_response result = super(ProfiledRpcClient, self)._process_response(request_event, bufchan, timeout) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 220, in _process_response reply_event, self._handle_remote_error) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py", line 44, in process_answer raise exception RemoteError: Traceback (most recent call last): File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 148, in _async_task functor.pattern.process_call(self._context, bufchan, event, functor) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py", line 30, in process_call result = functor(*req_event.args) File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/decorators.py", line 44, in __call__ return self._functor(*args, **kargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 35, in wrap rc = attr(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py", line 210, in create return cluster.request_create(CRUSH_NODE, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 391, in request_create return self._request('create', obj_type, attributes) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 375, in _request request = getattr(request_factory, method)(*args, **kwargs) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 50, in create self._add_items(name, bucket_type, items) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 84, in _add_items hostname = self._get_hostname_where_osd_runs(id) File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py", line 96, in _get_hostname_where_osd_runs str(osd_id))).hostname AttributeError: 'NoneType' object has no attribute 'hostname' I believe this is the issue we're looking for:
2016-04-14 06:09:10,461 - WARNING - calamari Abandoning fetch for osd_map started at 2016-04-14 11:09:00.150400+00:00
2016-04-14 06:09:10,591 - WARNING - calamari.request_collection on_completion: unknown jid c2dc3136-5639-4e10-b615-6c3a8bbdb632, return: Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 733, in run_job_thread
result = run_job(cmd, args)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 713, in run_job
args['since'])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 482, in get_cluster_object
assert ret == 0
AssertionError
I'm working on a test that reproduces this error. I am still seeing issue in:
calamari-server-1.4.0-0.5.rc8.el7cp
Crush map after cluster creation:
===========================
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host dhcp46-58 {
id -2 # do not change unnecessarily
# weight 0.015
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.015
}
root default {
id -1 # do not change unnecessarily
# weight 0.015
alg straw
hash 0 # rjenkins1
item dhcp46-58 weight 0.015
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
=================
Now I send a POST request to create a crush node
POST http://10.70.46.139:8002/api/v2/cluster/deedcb4c-a67a-4997-93a6-92149ad2622a/crush_node
Crush map after POST:
================================
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host dhcp46-58 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
root default {
id -1 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item dhcp46-58 weight 0.000
}
root general {
id -3 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.000
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
============================
note that "item osd.0 weight 0.000" entry in "host dhcp46-58" is removed.
Nishanth, Would you please provide the data that you are posting and help me look for errors in the log during this event? This OSD.0 is getting parented to the general node, Is that is what you are asking for in the create? I expect a configuation as below after my post reqest:
====================
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host dhcp46-58 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item dhcp46-58 weight 0.000 <========
}
root default {
id -1 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item dhcp46-58 weight 0.000 <===========
}
root general {
id -3 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.000
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
================================================
if remove osd.0 from default, it will affect any pool/rbds created with replicated_ruleset(uses 'dafault') and the CLI commands like 'rbd ls' hangs for ever. But I edited the crush map(as above) manually and updated to the cluster, it worked without any issues, so I expect that this is a valid configuration.
I understand that Nishanth. What I need to progress is the actual POST data for creating the crush node that causes the issue. Would you please provide that? {"bucket_type": "root", "name": "general", "items": [{"id": 0, "weight": 0.0, "pos": 0}]}
Still need an Ubuntu build with this fix (rc9?). |
Description of problem: Calamari API to create the crushnode/crushrule leaving the calamari in a inconsistent state. Beyond that the POST requests are not honored. I need to do a calamari-initialize bring it back to a stable state. Version-Release number of selected component (if applicable): calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64 How reproducible: Consistent Steps to Reproduce: Create crush node and crush rule using calamari API as below root general { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 } rule general { ruleset 19 type replicated min_size 1 max_size 10 step take general step chooseleaf firstn 0 type osd step emit }