Bug 1638092

Summary:	Default crush rule is not enforced
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Gregory Charot <gcharot>
Component:	Ceph-Ansible	Assignee:	Dimitri Savineau <dsavinea>
Status:	CLOSED ERRATA	QA Contact:	Vasishta <vashastr>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	3.0	CC:	anharris, aschoen, ceph-eng-bugs, dsavinea, gabrioux, gcharot, gfidente, gmeno, nthomas, pasik, sankarshan, tchandra, tserlin, yrabl
Target Milestone:	z2
Target Release:	3.2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-ansible-3.2.10-1.el7cp Ubuntu: ceph-ansible_3.2.10-2redhat1	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-30 15:56:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730

Description Gregory Charot 2018-10-10 16:03:40 UTC

Description of problem:

When using ceph ansible to define a crush rule it is possible to define it as the default rule. The default rule ID is not enforced

Version-Release number of selected component (if applicable):
ceph-ansible-3.1.5-1.el7cp.noarch

How reproducible:

Always

Steps to Reproduce:
1.
    crush_rule_config: true
    crush_rules:
      - name: standard
        root: standard_root
        type: rack
        default: true
      - name: fast
        root: fast_root
        type: rack
        default: false
2. Create a pool
3. Check the pool crush_rule value

Actual results:

Rules are created but the default rule is not enforced

Expected results:

Rule configured as default should become the default rule.

Additional info:

# ceph osd pool create get_schwifty 64

# ceph osd pool get get_schwifty crush_rule
crush_rule: replicated_rule

Rules are created:
# ceph osd crush rule ls
replicated_rule
standard
fast

Looking at the running mon config, the default rule is set -1 which means “pick the rule with the lowest numerical ID and use that"

# ceph --admin-daemon  /var/run/ceph/ceph-mon.lab-controller01.asok config get osd_pool_default_crush_rule
{
    "osd_pool_default_crush_rule": "-1"
}

Looking at the ceph.conf file, it should contain the osd_pool_default_crush_rule = 1 parameter [1] but it does not
# grep osd_pool_default_crush_rule /etc/ceph/ceph.conf
-> empty

[1] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/crush_rules.yml#L54

Looking at ceph-ansible logs we can see that the Mons "running" config seems to be correctly updated:

2018-10-10 08:56:44,348 p=32237 u=mistral |  TASK [ceph-mon : insert new default crush rule into daemon to prevent restart] ***
2018-10-10 08:56:44,348 p=32237 u=mistral |  Wednesday 10 October 2018  08:56:44 -0400 (0:00:00.083)       0:03:41.021 *****
2018-10-10 08:56:44,828 p=32237 u=mistral |  ok: [172.16.0.25 -> 172.16.0.22] => (item=172.16.0.22) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-lab-controller01", "ceph", "--cluster", "ceph", "daemon", "mon.lab-controller01", "config", "set", "osd_pool_defau
lt_crush_rule", "1"], "delta": "0:00:00.210402", "end": "2018-10-10 12:56:46.953167", "failed": false, "item": "172.16.0.22", "rc": 0, "start": "2018-10-10 12:56:46.742765", "stderr": "", "stderr_lines": [], "stdout": "{\n    \"success\": \"osd_pool_default_crush_rule = '
1' (not observed, change may require restart) \"\n}", "stdout_lines": ["{", "    \"success\": \"osd_pool_default_crush_rule = '1' (not observed, change may require restart) \"", "}"]}
2018-10-10 08:56:45,287 p=32237 u=mistral |  ok: [172.16.0.25 -> 172.16.0.24] => (item=172.16.0.24) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-lab-controller03", "ceph", "--cluster", "ceph", "daemon", "mon.lab-controller03", "config", "set", "osd_pool_defau
lt_crush_rule", "1"], "delta": "0:00:00.211802", "end": "2018-10-10 12:56:47.413902", "failed": false, "item": "172.16.0.24", "rc": 0, "start": "2018-10-10 12:56:47.202100", "stderr": "", "stderr_lines": [], "stdout": "{\n    \"success\": \"osd_pool_default_crush_rule = '
1' (not observed, change may require restart) \"\n}", "stdout_lines": ["{", "    \"success\": \"osd_pool_default_crush_rule = '1' (not observed, change may require restart) \"", "}"]}
2018-10-10 08:56:45,752 p=32237 u=mistral |  ok: [172.16.0.25 -> 172.16.0.25] => (item=172.16.0.25) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-lab-controller02", "ceph", "--cluster", "ceph", "daemon", "mon.lab-controller02", "config", "set", "osd_pool_defau
lt_crush_rule", "1"], "delta": "0:00:00.227841", "end": "2018-10-10 12:56:47.878438", "failed": false, "item": "172.16.0.25", "rc": 0, "start": "2018-10-10 12:56:47.650597", "stderr": "", "stderr_lines": [], "stdout": "{\n    \"success\": \"osd_pool_default_crush_rule = '
1' (not observed, change may require restart) \"\n}", "stdout_lines": ["{", "    \"success\": \"osd_pool_default_crush_rule = '1' (not observed, change may require restart) \"", "}"]}

The "add new default crush rule to ceph.conf" task appears to run fine but the option is not present in ceph.conf ...

2018-10-10 08:56:45,793 p=32237 u=mistral |  TASK [ceph-mon : add new default crush rule to ceph.conf] **********************
2018-10-10 08:56:45,793 p=32237 u=mistral |  Wednesday 10 October 2018  08:56:45 -0400 (0:00:01.444)       0:03:42.466 *****
2018-10-10 08:56:46,335 p=32237 u=mistral |  changed: [172.16.0.25 -> 172.16.0.22] => (item=172.16.0.22) => {"changed": true, "failed": false, "gid": 0, "group": "root", "item": "172.16.0.22", "mode": "0644", "msg": "option added", "owner": "root", "path": "/etc/ceph/ceph
.conf", "secontext": "system_u:object_r:etc_t:s0", "size": 1973, "state": "file", "uid": 0}
2018-10-10 08:56:46,564 p=32237 u=mistral |  changed: [172.16.0.25 -> 172.16.0.24] => (item=172.16.0.24) => {"changed": true, "failed": false, "gid": 0, "group": "root", "item": "172.16.0.24", "mode": "0644", "msg": "option added", "owner": "root", "path": "/etc/ceph/ceph
.conf", "secontext": "system_u:object_r:etc_t:s0", "size": 1973, "state": "file", "uid": 0}
2018-10-10 08:56:46,811 p=32237 u=mistral |  changed: [172.16.0.25 -> 172.16.0.25] => (item=172.16.0.25) => {"changed": true, "failed": false, "gid": 0, "group": "root", "item": "172.16.0.25", "mode": "0644", "msg": "option added", "owner": "root", "path": "/etc/ceph/ceph
.conf", "secontext": "system_u:object_r:etc_t:s0", "size": 1973, "state": "file", "uid": 0}

So my guess is that the default rule ID is indeed added to the Mons running config but adding it to ceph.conf somehow fails. The Mons are then restarted so they fall back to osd_pool_default_crush_rule = -1

Comment 3 Dimitri Savineau 2019-03-05 21:39:10 UTC

So the issue is still present in ceph-ansible stable-3.2 but it depends on the ceph cluster architecture.

As Greg reported, the osd_pool_default_crush_rule parameter seems to be apply correctly in the ceph.conf file (according to the ansible logs) but finally not present.

In fact, the parameter is correctly set during the mons configuration (crush_rules.yml tasks [1]) but the mgrs configuration (which is executed after the mons) will erase it by reapplying the default ceph.conf template + overrides [2] (via the ceph-config role).

So when I try with mon and mgr collocated on the same host I have the same issue than Greg.
I suppose this come from a TripleO environment for him. Can you confirm this point ?

When I try with mon and mgr on dedicated hosts, this works as expected

# ceph osd pool create get_schwifty 64
pool 'get_schwifty' created
# ceph osd pool get get_schwifty crush_rule
crush_rule: standard
# ceph osd crush rule ls
replicated_rule
standard
ceph --admin-daemon /var/run/ceph/ceph-mon.mon0.asok config get osd_pool_default_crush_rule
{
    "osd_pool_default_crush_rule": "1"
}
# grep osd_pool_default_crush_rule /etc/ceph/ceph.conf 
osd_pool_default_crush_rule = 1


[1] https://github.com/ceph/ceph-ansible/blob/stable-3.2/roles/ceph-mon/tasks/crush_rules.yml
[2] https://github.com/ceph/ceph-ansible/blob/stable-3.2/roles/ceph-config/tasks/main.yml#L173-L189

Comment 4 Gregory Charot 2019-03-06 10:16:55 UTC

Yes it comes from a TripleO environment. We talked about it with Seb some times ago and it indeed seems the config gets erased when the manager config gets written. IIRC the outcome was to avoid using a "ini_file" task.

Comment 8 Vasishta 2019-04-24 09:15:43 UTC

Working fine with ceph-ansible-3.2.13-1.el7cp.noarch
Moving to VERIFIED state.

Comment 10 errata-xmlrpc 2019-04-30 15:56:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911