Bug 1714227

Summary:	[RFE][Backport] Add Support for a Second Ceph Storage Tier deployment capability through director
Product:	Red Hat OpenStack	Reporter:	Gregory Charot <gcharot>
Component:	openstack-tripleo-heat-templates	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	dcadzow, johfulto, mburns, mgeary, sputhenp, yrabl
Target Milestone:	z7	Keywords:	FeatureBackport, TestOnly, Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-8.0.0-0.20180103192340.el7ost puppet-tripleo-8.1.1-0.20180102165828.el7ost	Doc Type:	Enhancement
Doc Text:	This update adds support for a second ceph Storage Tier deployment capability through director.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-10 13:05:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1420861
Bug Blocks:	1671061

Description Gregory Charot 2019-05-27 12:57:34 UTC

This bug was initially created as a copy of Bug #1420861

I am copying this bug because: 

We have several customers willing to use this feature in OSP13 (Long Life). Since this feature mainly depends on ceph-ansible and the version shipped in OSP13 is supporting it, the RFE should be testonly.



This RFE Bugzilla has been created in order to track decisions and developments relating to a request for the ability to deploy second tier Ceph storage for OpenStack Platform, through OSP director.

Comment 3 Giulio Fidente 2019-05-28 11:58:23 UTC

The necessary code changes for puppet-tripleo and tripleo-heat-templates landed in OSP13 already, as per BZ #1309550

Tiering of the Ceph pools can also be configured after the overcloud deployment using device classes; for example, assuming operators use the tripleo parameter "CinderRbdExtraPools", as per [1] to create an additional "tier2" pool, it
can be later assigned to a specific (ssd) device classe with:

# ceph osd crush rule create-replicated faster default host ssd
# ceph osd pool set tier2 crush_rule faster

I think the only piece missing for OSP13 would be to backport the docs; what was added in OSP14 via BZ #1654792 should be added in OSP13 docs as well, docs change tracked by BZ #1671061

1. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/deploying_an_overcloud_with_containerized_red_hat_ceph/index#proc_ceph-configuring-block-storage-use-new-pool-assembly_ceph-second-tier-storage

Comment 7 Jason Joyce 2019-06-07 18:01:45 UTC

According to our records, this should be resolved by openstack-tripleo-heat-templates-8.3.1-18.el7ost.  This build is available now.

Comment 8 Jason Joyce 2019-06-07 18:01:47 UTC

According to our records, this should be resolved by puppet-tripleo-8.4.1-5.el7ost.  This build is available now.

Comment 9 Yogev Rabl 2019-06-21 17:08:47 UTC

The deployment failed with the error:

2019-06-20 15:27:16,798 p=25230 u=mistral |  failed: [192.168.24.6 -> 192.168.24.23] (item=[{u'application': u'rbd', u'pg_num': 64, u'name': u'vms', u'rule_na
me': u'standard'}, {'_ansible_parsed': True, 'stderr_lines': [u"Error ENOENT: unrecognized pool 'vms'"], u'cmd': [u'docker', u'exec', u'ceph-mon-controller-0'
, u'ceph', u'--cluster', u'ceph', u'osd', u'pool', u'get', u'vms', u'size'], u'end': u'2019-06-20 19:27:12.114155', '_ansible_no_log': False, '_ansible_delega
ted_vars': {'ansible_delegated_host': u'192.168.24.23', 'ansible_host': u'192.168.24.23'}, '_ansible_item_result': True, u'changed': True, u'invocation': {u'm
odule_args': {u'creates': None, u'executable': None, u'_uses_shell': False, u'_raw_params': u'docker exec ceph-mon-controller-0 ceph --cluster ceph osd pool g
et vms size', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin': None}}, u'stdout': u'', 'item': {u'application': u'rbd', u'pg_num': 64
, u'name': u'vms', u'rule_name': u'standard'}, u'delta': u'0:00:00.442574', '_ansible_item_label': {u'application': u'rbd', u'pg_num': 64, u'name': u'vms', u'
rule_name': u'standard'}, u'stderr': u"Error ENOENT: unrecognized pool 'vms'", u'rc': 2, u'msg': u'non-zero return code', 'stdout_lines': [], 'failed_when_res
ult': False, u'start': u'2019-06-20 19:27:11.671581', '_ansible_ignore_errors': None, u'failed': False}]) => {"changed": false, "cmd": ["docker", "exec", "cep
h-mon-controller-0", "ceph", "--cluster", "ceph", "osd", "pool", "create", "vms", "64", "64", "standard", "1"], "delta": "0:00:00.423014", "end": "2019-06-20 
19:27:16.752713", "item": [{"application": "rbd", "name": "vms", "pg_num": 64, "rule_name": "standard"}, {"_ansible_delegated_vars": {"ansible_delegated_host"
: "192.168.24.23", "ansible_host": "192.168.24.23"}, "_ansible_ignore_errors": null, "_ansible_item_label": {"application": "rbd", "name": "vms", "pg_num": 64
, "rule_name": "standard"}, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": true, "cmd": ["docker", "exec", "ceph-
mon-controller-0", "ceph", "--cluster", "ceph", "osd", "pool", "get", "vms", "size"], "delta": "0:00:00.442574", "end": "2019-06-20 19:27:12.114155", "failed"
: false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "docker exec ceph-mon-controller-0 ceph --cluster ceph osd pool get vms si
ze", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": {"applica
tion": "rbd", "name": "vms", "pg_num": 64, "rule_name": "standard"}, "msg": "non-zero return code", "rc": 2, "start": "2019-06-20 19:27:11.671581", "stderr": 
"Error ENOENT: unrecognized pool 'vms'", "stderr_lines": ["Error ENOENT: unrecognized pool 'vms'"], "stdout": "", "stdout_lines": []}], "msg": "non-zero retur
n code", "rc": 2, "start": "2019-06-20 19:27:16.329699", "stderr": "Error ENOENT: specified rule standard doesn't exist", "stderr_lines": ["Error ENOENT: spec
ified rule standard doesn't exist"], "stdout": "", "stdout_lines": []}

Comment 10 Yogev Rabl 2019-06-21 17:35:46 UTC

The crush map configuration is 
CephAnsibleExtraConfig:
        create_crush_tree: true
        crush_rules:
            - name: standard
              root: standard_root
              type: rack
              default: true
            - name: fast
              root: fast_root
              type: rack
              default: false
    CephPools:
        - name: tier2
          pg_num: 64
          rule_name: fast
          application: rbd

        - name: volumes
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: vms
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: backups
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: images
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: metrics
          pg_num: 64
          rule_name: standard
          application: openstack_gnocchi

Comment 11 Gregory Charot 2019-06-21 17:50:29 UTC

Can you ceph osd dump, the pools are created ? 

Can you paste your NodeDataLookup param ?

Comment 12 John Fulton 2019-06-21 20:47:51 UTC

1. Heat environment input: http://ix.io/1Mqz
2. TripleO genereated inventory: http://ix.io/1Mr0

The NodeDataLookup param was quoted instead of being passed as JSON as per #1
Looks like it was translated into the inventory as per #2. Also, you can see the crush_rules in the inventory in #1. 

No pools were created. 

The deployment failed because it tried to create the pool with a rule that didn't yet exist. E.g. here I'm re-running the tasks from the ansible log which failed:

[root@controller-0 ~]# docker exec ceph-mon-controller-0 ceph --cluster ceph osd pool create tier2 64 64 fast 1
Error ENOENT: specified rule fast doesn't exist
[root@controller-0 ~]# docker exec ceph-mon-controller-0 ceph --cluster ceph osd pool get tier2 size
Error ENOENT: unrecognized pool 'tier2'
[root@controller-0 ~]# 

No fast rule:

[root@controller-0 ~]# ceph osd crush rule ls
replicated_rule
[root@controller-0 ~]# 

So why wasn't the rule created? 

Deployment was run using ceph-ansible 3.2.15 (with the 3-18 ceph container (not the latest)) and here's the tasks that would create the crush rule:

 https://github.com/ceph/ceph-ansible/blob/v3.2.15/roles/ceph-mon/tasks/crush_rules.yml

None of these tasks ran according to the ceph-ansible logs:

[root@undercloud-0 mistral]# cat ceph-install-workflow.log | grep "configure crush hierarchy" 
[root@undercloud-0 mistral]# cat ceph-install-workflow.log | grep "create configured crush rules"
[root@undercloud-0 mistral]# cat ceph-install-workflow.log | grep "get id for new default crush rule"
[root@undercloud-0 mistral]# cat ceph-install-workflow.log | grep "set_fact info_ceph_default_crush_rule_yaml"
[root@undercloud-0 mistral]# cat ceph-install-workflow.log | grep "insert new default crush rule into daemon to prevent restart" 
[root@undercloud-0 mistral]#

We do see the crush_rules.yml was included but its tasks were skipped:  http://ix.io/1Mrh

So we need to see what condition failed and ask if it's because we didn't pass something we should have (user error) or if it's a bug in ceph-ansible.

Comment 13 John Fulton 2019-06-21 20:50:35 UTC

(In reply to Gregory Charot from comment #11)
> Can you ceph osd dump, the pools are created ? 

[root@controller-0 ~]# ceph osd dump | curl -F 'f:1=<-' ix.io
http://ix.io/1MrN
[root@controller-0 ~]# 

no pools were created

[root@controller-0 ~]# ceph df | curl -F 'f:1=<-' ix.io
http://ix.io/1MrP
[root@controller-0 ~]# 

> Can you paste your NodeDataLookup param ?

Heat environment input: http://ix.io/1Mqz

I think this input should be pure JSON, not single-quoted JSON

Comment 15 John Fulton 2019-06-24 15:46:32 UTC

(In reply to John Fulton from comment #12)
> So why wasn't the rule created? 
...
> We do see the crush_rules.yml was included but its tasks were skipped: 
> http://ix.io/1Mrh
> 
> So we need to see what condition failed and ask if it's because we didn't
> pass something we should have (user error) or if it's a bug in ceph-ansible.

https://github.com/ceph/ceph-ansible/blob/v3.2.15/roles/ceph-mon/tasks/main.yml#L35

So the THT from comment #10 should have had:

  crush_rules_config: true

Comment 16 Gregory Charot 2019-06-24 15:50:06 UTC

very true thanks for spotting that!

Comment 17 John Fulton 2019-06-24 17:06:47 UTC

Yogev,

I restored your original environment files but added the following under CephAnsibleExtraConfig:

  crush_rules_config: true

I then deleted your overcloud and redeployed and it finished deploying Ceph [1]. 

I'm setting this bug back to ON_QA so that you may test with the additional 'crush_rules_config: true' parameter.


Note that I used your original string for NodeDataLookup. Thanks to Giulio for spotting the missing crush_rules_config. 


[1]
[root@controller-0 ~]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    330GiB     300GiB      30.3GiB          9.20 
POOLS:
    NAME        ID     USED     %USED     MAX AVAIL     OBJECTS 
    tier2       1        0B         0       62.9GiB           0 
    metrics     2        0B         0       31.5GiB           0 
    volumes     3        0B         0       31.5GiB           0 
    images      4        0B         0       31.5GiB           0 
    backups     5        0B         0       31.5GiB           0 
    vms         6        0B         0       31.5GiB           0 
[root@controller-0 ~]# ceph osd crush rule ls
replicated_rule
standard
fast
[root@controller-0 ~]#

Comment 18 Yogev Rabl 2019-06-24 19:19:32 UTC

Verified that its working with the following configuration:

    CephAnsibleExtraConfig:
        create_crush_tree: true
        crush_rule_config: true
        crush_rules:
            - name: standard
              root: standard_root
              type: rack
              default: true
            - name: fast
              root: fast_root
              type: rack
              default: false
    CephPools:
        - name: tier2
          pg_num: 64
          rule_name: fast
          application: rbd

        - name: volumes
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: vms
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: backups
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: images
          pg_num: 64
          rule_name: standard
          application: rbd

        - name: metrics
          pg_num: 64
          rule_name: standard
          application: openstack_gnocchi

    CephAnsibleDisksConfig:
        devices:
            - '/dev/vdb'
            - '/dev/vdc'
            - '/dev/vdd'
            - '/dev/vde'
            - '/dev/vdf'
        osd_scenario: lvm
        osd_objectstore: bluestore
        journal_size: 512

    NodeDataLookup: '{"d336f6d2-60b7-4a50-82d0-2e43c30e47e8": {"osd_crush_location": {"root": "standard_root", "rack": "rack1_std", "host": "ceph-0"}},"6b17e3d9-f3d1-4888-8687-ad98d77cb44f": {"osd_crush_location": {"root": "standard_root", "rack": "rack2_std", "host": "ceph-1"}},"c9c3dd3e-0980-4994-95fa-6478e87f5752": {"osd_crush_location": {"root": "fast_root", "rack": "rack3_std", "host": "ceph-2"}},"58f926d8-5d97-4051-9f31-e76c6b435255": {"osd_crush_location": {"root": "fast_root", "rack": "rack1_fast", "host": "ceph-3"}},"fa18be32-e9e5-4bb2-ac83-4c83c497b9b2": {"osd_crush_location": {"root": "fast_root", "rack": "rack2_fast", "host": "ceph-4"}},"0dee4a82-64cb-41cd-939e-39b784317cab": {"osd_crush_location": {"root": "fast_root", "rack": "rack3_fast", "host": "ceph-5"}}}'

    CinderRbdExtraPools:
        - tier2

Comment 20 errata-xmlrpc 2019-07-10 13:05:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738