1322905 – Crush rule update leave the calamari in a inconsistent state

Bug 1322905 - Crush rule update leave the calamari in a inconsistent state

Summary: Crush rule update leave the calamari in a inconsistent state

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Calamari
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	2.0
Assignee:	Christina Meno
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1291304
TreeView+	depends on / blocked

Reported:	2016-03-31 15:28 UTC by Nishanth Thomas
Modified:	2016-05-09 18:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:	calamari-server-1.4.0-0.5.rc8.el7cp
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-28 14:37:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nishanth Thomas 2016-03-31 15:28:35 UTC

Description of problem:
Calamari API to create the crushnode/crushrule leaving the calamari in a inconsistent state. Beyond that the POST requests are not honored. I need to do a calamari-initialize bring it back to a stable state. 

Version-Release number of selected component (if applicable):
calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64

How reproducible:
Consistent

Steps to Reproduce:

Create crush node and crush rule using calamari API as below
root general {
        id -3           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.000
}
rule general {
        ruleset 19
        type replicated
        min_size 1
        max_size 10
        step take general
        step chooseleaf firstn 0 type osd
        step emit
}

Comment 2 Christina Meno 2016-03-31 16:25:42 UTC

Nishanth,

Would you please provide me access to an environment where this is happening?
or
would you please provide /var/log/calamari/*.log as attachments?

thank you,
G

Comment 3 Christina Meno 2016-03-31 17:50:41 UTC

Nishanth,

There are few problems here:
steps to reproduce are incomplete
that is not valid json

What endpoint is causing the failure is is 
/crush_rule or /crush_node

further you're running out of date calamari

[root@dhcp47-48 ~]# rpm -qa | grep calamari-server
calamari-server-1.4.0-rc3_12_g87e0928.el7.centos.x86_64

latest is:
calamari-server-1.4.0-0.1.rc5.el7cp.x86_64.rpm 

available here:
http://puddle.ceph.redhat.com/puddles/ceph/2/2016-03-24.1/CEPH-2.repo

Comment 4 Christina Meno 2016-03-31 17:51:19 UTC

I have updated to the latest package and restarted

I am now seeing this error in the logs
2016-03-31 12:47:54,883 - WARNING - calamari.request_collection on_completion: unknown jid f532d318-f72b-4032-b316-b4c302321d6a, return: Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 735, in run_job_thread
    result = run_job(cmd, args)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 715, in run_job
    args['since'])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 418, in get_cluster_object
    with ClusterHandle(ClusterHandle(cluster_name)) as cluster_handle:
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 82, in __enter__
    conf_file = os.path.join(SRC_DIR, self.cluster_name + ".conf")
TypeError: unsupported operand type(s) for +: 'instance' and 'str'

Comment 5 Christina Meno 2016-03-31 19:13:34 UTC

This one is easy. I can provide a fix and a new package by end of day today

Comment 6 Christina Meno 2016-03-31 19:34:15 UTC

*** Bug 1322907 has been marked as a duplicate of this bug. ***

Comment 7 Christina Meno 2016-03-31 20:46:11 UTC

https://github.com/ceph/calamari/pull/411

Comment 8 Christina Meno 2016-04-01 21:18:35 UTC

need pm_ack

Comment 9 Christina Meno 2016-04-01 21:20:20 UTC

need qa_ack

Comment 10 Shubhendu Tripathi 2016-04-04 09:54:30 UTC

One more observation, one this crush map/rule issue happens, the api /api/v2/cluster/{fsid}/server, lists the osd node names as "general". I feel this is something messed up with conf and crush bucket/map.

Below is a sample output


--------------------------------
[
    {
        "fqdn": "dhcp47-98.lab.eng.blr.redhat.com", 
        "hostname": "dhcp47-98.lab.eng.blr.redhat.com", 
        "services": [
            {
                "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", 
                "type": "mon", 
                "id": "c", 
                "running": true
            }
        ], 
        "frontend_addr": "10.70.47.98", 
        "backend_addr": null, 
        "frontend_iface": null, 
        "backend_iface": null, 
        "managed": true, 
        "last_contact": "2016-04-04T05:43:31.851396+00:00", 
        "boot_time": "2016-03-29T18:22:11+00:00", 
        "ceph_version": "0.94.5-9.el7cp"
    }, 
    {
        "fqdn": "general", 
        "hostname": "general", 
        "services": [
            {
                "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", 
                "type": "osd", 
                "id": "2", 
                "running": true
            }, 
            {
                "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", 
                "type": "osd", 
                "id": "1", 
                "running": true
            }, 
            {
                "fsid": "b95dbe5d-b880-4cd7-bcaf-d97a4f82b185", 
                "type": "osd", 
                "id": "0", 
                "running": true
            }
        ], 
        "frontend_addr": "10.70.47.95", 
        "backend_addr": "10.70.47.95", 
        "frontend_iface": null, 
        "backend_iface": null, 
        "managed": false, 
        "last_contact": null, 
        "boot_time": null, 
        "ceph_version": null
    }
]
--------------------------------

Comment 12 Nishanth Thomas 2016-04-12 02:45:04 UTC

crush_node create is messing up the crush map. When you post for the
first time it creates the crush node but I could see that it tampers with
other crush node entries by making the items empty. Please see the result of
the GET after the POST.(https://bugzilla.redhat.com/show_bug.cgi?id=1322905
)

[{"bucket_type": "root", "name": "default", "id": -1, "weight": 0.0, "alg":
"straw", "hash": "rjenkins1", "items": [{"id": -2, "weight": 0.0, "pos":
0}]},
{"bucket_type": "host", "name": "dhcp47-44", "id": -2, "weight": 0.0, "alg":
"straw", "hash": "rjenkins1", "items": []},
{"bucket_type": "root", "name": "general", "id": -3, "weight": 0.0, "alg":
"straw", "hash": "rjenkins1", "items": []},
{"bucket_type": "root", "name": "general121", "id": -4, "weight": 0.0,
"alg": "straw", "hash": "rjenkins1", "items": []},
{"bucket_type": "root", "name": "generalzzzzzzz", "id": -5, "weight": 0.0,
"alg": "straw", "hash": "rjenkins1", "items": [{"id": 0, "weight": 0.0,
"pos": 0}]}]

I could see the below error in the logs:

2016-04-01 09:55:24,536 - ERROR - calamari RpcInterface !! create
Traceback (most recent call last):
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py",
line 35, in wrap
    rc = attr(*args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py",
line 210, in create
    return cluster.request_create(CRUSH_NODE, attributes)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py",
line 391, in request_create
    return self._request('create', obj_type, attributes)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py",
line 375, in _request
    request = getattr(request_factory, method)(*args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 50, in create
    self._add_items(name, bucket_type, items)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 84, in _add_items
    hostname = self._get_hostname_where_osd_runs(id)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 96, in _get_hostname_where_osd_runs
    str(osd_id))).hostname
AttributeError: 'NoneType' object has no attribute 'hostname'

==> calamari.log <==
2016-04-01 09:55:24,541 - ERROR - django.request Internal Server Error:
/api/v2/cluster/58eaf578-1e8c-4359-9896-f9ae21e1ed7b/crush_node
Traceback (most recent call last):
  File
"/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py",
line 115, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py",
line 78, in view
    return self.dispatch(request, *args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py",
line 91, in dispatch
    return super(RPCViewSet, self).dispatch(request, *args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py",
line 77, in wrapped_view
    return view_func(*args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py",
line 399, in dispatch
    response = self.handle_exception(exc)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py",
line 108, in handle_exception
    return super(RPCViewSet, self).handle_exception(exc)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py",
line 396, in dispatch
    response = handler(request, *args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py",
line 131, in create
    create_response = self.client.create(fsid, CRUSH_NODE,
serializer.get_data())
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py",
line 260, in <lambda>
    return lambda *args, **kargs: self(method, *args, **kargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py",
line 245, in __call__
    return self._process_response(request_event, bufchan, timeout)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py",
line 55, in _process_response
    result = super(ProfiledRpcClient, self)._process_response(request_event,
bufchan, timeout)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py",
line 220, in _process_response
    reply_event, self._handle_remote_error)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py",
line 44, in process_answer
    raise exception
RemoteError: Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py",
line 148, in _async_task
    functor.pattern.process_call(self._context, bufchan, event, functor)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/patterns.py",
line 30, in process_call
    result = functor(*req_event.args)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/decorators.py", line
44, in __call__
    return self._functor(*args, **kargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py",
line 35, in wrap
    rc = attr(*args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/rpc.py",
line 210, in create
    return cluster.request_create(CRUSH_NODE, attributes)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py",
line 391, in request_create
    return self._request('create', obj_type, attributes)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py",
line 375, in _request
    request = getattr(request_factory, method)(*args, **kwargs)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 50, in create
    self._add_items(name, bucket_type, items)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 84, in _add_items
    hostname = self._get_hostname_where_osd_runs(id)
  File
"/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/crush_node_request_factory.py",
line 96, in _get_hostname_where_osd_runs
    str(osd_id))).hostname
AttributeError: 'NoneType' object has no attribute 'hostname'

Comment 13 Christina Meno 2016-04-14 12:31:21 UTC

I believe this is the issue we're looking for:

2016-04-14 06:09:10,461 - WARNING - calamari Abandoning fetch for osd_map started at 2016-04-14 11:09:00.150400+00:00
2016-04-14 06:09:10,591 - WARNING - calamari.request_collection on_completion: unknown jid c2dc3136-5639-4e10-b615-6c3a8bbdb632, return: Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 733, in run_job_thread
    result = run_job(cmd, args)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 713, in run_job
    args['since'])
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_common-0.1-py2.7.egg/calamari_common/remote/mon_remote.py", line 482, in get_cluster_object
    assert ret == 0
AssertionError

Comment 14 Christina Meno 2016-04-14 12:49:11 UTC

I'm working on a test that reproduces this error.

Comment 16 Nishanth Thomas 2016-04-18 11:54:14 UTC

I am still seeing issue in:

calamari-server-1.4.0-0.5.rc8.el7cp

Crush map after cluster creation:
===========================
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host dhcp46-58 {
        id -2           # do not change unnecessarily
        # weight 0.015
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.015
}
root default {
        id -1           # do not change unnecessarily
        # weight 0.015
        alg straw
        hash 0  # rjenkins1
        item dhcp46-58 weight 0.015
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map
=================

Now I send a POST request to create a crush node

POST http://10.70.46.139:8002/api/v2/cluster/deedcb4c-a67a-4997-93a6-92149ad2622a/crush_node

Crush map after POST:

================================

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host dhcp46-58 {
        id -2           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
}
root default {
        id -1           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item dhcp46-58 weight 0.000
}
root general {
        id -3           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.000
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map
============================

note that "item osd.0 weight 0.000" entry in "host dhcp46-58" is removed.

Comment 17 Christina Meno 2016-04-20 13:09:45 UTC

Nishanth,

Would you please provide the data that you are posting and help me look for errors in the log during this event?

This OSD.0 is getting parented to the general node, Is that is what you are asking for in the create?

Comment 18 Nishanth Thomas 2016-04-20 13:20:24 UTC

I expect a configuation as below after my post reqest:

====================
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host dhcp46-58 {
        id -2           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item dhcp46-58 weight 0.000      <========
}
root default {
        id -1           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item dhcp46-58 weight 0.000     <===========    
}
root general {
        id -3           # do not change unnecessarily
        # weight 0.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.000
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map
================================================

if remove osd.0 from default, it will affect any pool/rbds created with replicated_ruleset(uses 'dafault') and the CLI commands like 'rbd ls' hangs for ever. But I edited the crush map(as above) manually and updated to the cluster, it worked without any issues, so I expect that this is a valid configuration.

Comment 19 Christina Meno 2016-04-20 16:27:25 UTC

I understand that Nishanth.
What I need to progress is the actual POST data for creating the crush node that causes the issue.

Would you please provide that?

Comment 20 Nishanth Thomas 2016-04-21 14:28:28 UTC

{"bucket_type": "root", "name": "general", "items": [{"id": 0, "weight": 0.0, "pos": 0}]}

Comment 21 Ken Dreyer (Red Hat) 2016-04-25 16:50:52 UTC

Still need an Ubuntu build with this fix (rc9?).

Note You need to log in before you can comment on or make changes to this bug.