Bug 1539852
Summary: | [OSP13][Deployment] Overcloud deployment fails during ControllerDeployment_Step4, ceph fails "ObjectNotFound: error opening pool 'metrics'\", | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Omri Hochman <ohochman> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | John Fulton <johfulto> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 13.0 (Queens) | CC: | flucifre, gfidente, johfulto, jomurphy, mburns, rhel-osp-director-maint, sasha, yrabl | ||||
Target Milestone: | beta | Keywords: | Triaged | ||||
Target Release: | 13.0 (Queens) | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-8.0.0-0.20180215092254 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-06-27 13:43:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1545383 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Omri Hochman
2018-01-29 18:05:57 UTC
The problem is how tripleo set up the ceph-ansible deployment. - ceph-ansible call has no input via extra vars [1] - inventory has no input [2] - thus no arguments were passed to ceph-ansible - thus, ceph-ansible skipped all of its tasks, no input provided, and the playbook run returned no error [3] [1] 2018-01-28 11:27:28.996 31461 DEBUG oslo_concurrency.processutils [req-1b57e863-20e4-414b-8b0c-62d37514f64b f8716113cb2d44259eeebf97c3570146 d2ad266cecf9419f9fd906d2c916d998 - default default] CMD "ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --inventory-file /tmp/ansible-mistral-action60aVA0/inventory.yaml --private-key /tmp/ansible-mistral-action60aVA0/ssh_private_key --skip-tags package-install,with_pkg" returned: 0 in 873.287s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409 [2] As per https://github.com/fultonj/tripleo-ceph-ansible/blob/master/get-inventory.sh: Inventory from 2018-01-28 16:27:30 { "mgr_ips": [ "192.168.0.8", "192.168.0.19", "192.168.0.17" ], "mon_ips": [ "192.168.0.8", "192.168.0.19", "192.168.0.17" ], "mds_ips": [], "osd_ips": [ "192.168.0.16", "192.168.0.12", "192.168.0.13" ], "rbdmirror_ips": [], "rgw_ips": [], "client_ips": [ "192.168.0.15" ], "nfs_ips": [] } [3] 2018-01-28 11:16:15,456 p=3697 u=mistral | TASK [ceph-mon : create openstack pool(s)] ************************************* 2018-01-28 11:16:15,513 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'images'}) 2018-01-28 11:16:15,536 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}) 2018-01-28 11:16:15,559 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'backups'}) 2018-01-28 11:16:15,581 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'vms'}) 2018-01-28 11:16:15,600 p=3697 u=mistral | skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'volumes'}) How parameters are passed from Mistral to ceph-ansible was changed recently [1]. I suspect a patch is missing in the puddle so I want to get to the bottom of that next. In short, if params are not getting passed via extra-vars (A) then they need to be passed via the inventory (B). Seems like we have A but not B and that's the problem. We need to get B into the puddle. [1] https://review.openstack.org/#/c/528755/1/workbooks/ceph-ansible.yaml Reproducing: Environment: ansible-tripleo-ipsec-0.0.1-0.20180119094817.5e80d4f.el7ost.noarch openstack-tripleo-common-8.3.1-0.20180123050218.el7ost.noarch instack-undercloud-8.1.1-0.20180117134321.el7ost.noarch openstack-tripleo-puppet-elements-8.0.0-0.20180117092204.120eca8.el7ost.noarch openstack-tripleo-common-containers-8.3.1-0.20180123050218.el7ost.noarch openstack-tripleo-image-elements-8.0.0-0.20180117094122.02d0985.el7ost.noarch puppet-tripleo-8.2.0-0.20180122224519.9fd3379.el7ost.noarch openstack-tripleo-heat-templates-8.0.0-0.20180122224016.el7ost.noarch openstack-tripleo-validations-8.1.1-0.20180119231917.2ff3c79.el7ost.noarch openstack-tripleo-ui-8.1.1-0.20180122135122.aef02d8.el7ost.noarch python-tripleoclient-9.0.1-0.20180119233147.el7ost.noarch (In reply to John Fulton from comment #1) > The problem is how tripleo set up the ceph-ansible deployment. > > - ceph-ansible call has no input via extra vars [1] > - inventory has no input [2] I was wrong. My get-inventory script was pulling the IP group, BUT that wasn't necessarily the inventory that was passed. I observed the same evidence on my own system but found from examining each mistral task that the correct parameters were passed: (undercloud) [stack@hci-director ~]$ mistral task-get-result $TASK_ID | jq . | sed -e 's/\\n/\n/g' -e 's/\\"/"/g' | head | curl -F 'f:1=<-' ix.io http://ix.io/EXi (undercloud) [stack@hci-director ~]$ > [2] > As per > https://github.com/fultonj/tripleo-ceph-ansible/blob/master/get-inventory.sh: > > Inventory from 2018-01-28 16:27:30 > > { > "mgr_ips": [ > "192.168.0.8", > "192.168.0.19", > "192.168.0.17" > ], > "mon_ips": [ > "192.168.0.8", > "192.168.0.19", > "192.168.0.17" > ], > "mds_ips": [], > "osd_ips": [ > "192.168.0.16", > "192.168.0.12", > "192.168.0.13" > ], > "rbdmirror_ips": [], > "rgw_ips": [], > "client_ips": [ > "192.168.0.15" > ], > "nfs_ips": [] > } Created attachment 1388719 [details]
output of journalctl CONTAINER_NAME=ceph-mon-overcloud-controller-i (for i in 0,1,2)
It seems that the TripleO to ceph-ansible of this issue is working in that the correct commands were run and ansible indicates that they were run with success. The problem seems internal to the Ceph cluster; it indicated it received the request to create the pools but didn't create all of them.
The attached ceph monitor logs show that the monitors did receive the request to create the pools, e.g. there is a line like this for every pool on one of the 3 mons:
Jan 30 20:40:15 overcloud-controller-0 dockerd-current[20108]: 2018-01-30 20:40:15.618318 7fa8cc16c700 0 log_channel(audit) log [INF] : from='client.? 10.19.95.14:0/299906719' entity='client.admin' cmd=[{"prefix": "osd pool create", "pg_num": 128, "pool": "volumes"}]: dispatch
However, there is only one of the above where the last word is "completed" in place of "dispatch" and that is for the images pool.
The ceph-ansible logs OK each of these requests since from ansible's point of view the command was run on the ceph cluster which basically implied it would create the pool. It just seems it hasn't done it.
2018-01-24 15:23:05,690 p=20315 u=mistral | ok: [192.168.0.18] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'images'}) => {
2018-01-24 15:23:08,625 p=20315 u=mistral | ok: [192.168.0.18] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}) => {
If you login to the cluster itself the pool isn't there but you're still able to manually create it.
[root@overcloud-controller-0 /]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
11145G 11144G 646M 0
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
images 1 0 0 3529G 0
[root@overcloud-controller-0 /]# ceph osd pool create metrics 128
pool 'metrics' created
[root@overcloud-controller-0 /]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
11145G 11144G 648M 0
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
images 1 0 0 3529G 0
metrics 2 0 0 3529G 0
[root@overcloud-controller-0 /]#
The pools were not created and ansible [1] returned the following message from ceph: "Error ERANGE: pg_num 128 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)" The workaround is to change any of the above three variables to satisfy the following function when we create, for OpenStack by default, seven pools: https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5670-L5698 This is new to OSP13 because it's using RHCS3 which has the above feature. The problem is that EVERY OSP13 deployment that doesn't override the defaults will have this problem. Here's one workaround which satisfies the function above: parameter_defaults: CephPoolDefaultSize: 3 CephPoolDefaultPgNum: 128 CephConfigOverrides: mon_max_pg_per_osd: 3072 In the above case I increased mon_max_pg_per_osd based on the closest power of 2 greater than (* 128 3 7). Next steps: 1. Can ceph-ansible catch this earlier with some form of validation (open RFE) 2. Do we need to change more defaults in OSP's THT (remember [3])? [1] grep Error /var/log/mistral/ceph-install-workflow.log | grep 128 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1502878#c3 [3] https://review.openstack.org/#/c/506330/ (In reply to John Fulton from comment #7) > Next steps: > > 1. Can ceph-ansible catch this earlier with some form of validation (open > RFE) https://bugzilla.redhat.com/show_bug.cgi?id=1541152 > 2. Do we need to change more defaults in OSP's THT (remember [3])? On the agenda a the next DFG:Ceph stand up call. (In reply to John Fulton from comment #8) > (In reply to John Fulton from comment #7) > > 2. Do we need to change more defaults in OSP's THT (remember [3])? > > On the agenda a the next DFG:Ceph stand up call. We discussed this today and have the following plan: 1. THT's low-memory-usage.yaml [1] fits this pattern and we will put something like the workaround from comment #7 there. 2. If as per yrabl, there is a minimum of 3 OSDs required for OSP in our docs, then the osp13 version of those docs need an update to be consistent w/ RHCS3 which is 5. (needinfo to Federico so he can check on reasoning behind 3). The above are based on the following reasoning: - The defaults should fit the minimum supported production deployment and those testing with less than that should have an easy way to override them provided they understand it's not for production. Next steps: - upstream code change to THT low-memory-usage.yaml - upstream code change to ceph-ansible on sanity check (bz 1541152) - based on confirmation from Federico, docbug should be opened to have osp13 have new ceph defaults [1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/low-memory-usage.yaml (In reply to John Fulton from comment #9) > 2. If as per yrabl, there is a minimum of 3 OSDs required for OSP in our > docs, then the osp13 version of those docs need an update to be consistent > w/ RHCS3 which is 5. (needinfo to Federico so he can check on reasoning > behind 3). As per a conversation with yrabl in IRC, the docs do not require a minimum of 3 OSDs, they require a minimum of 3 ceph storage servers. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html/director_installation_and_usage/chap-requirements#sect-Environment_Requirements > Next steps: > - upstream code change to THT low-memory-usage.yaml > - upstream code change to ceph-ansible on sanity check (bz 1541152) > - based on confirmation from Federico, docbug should be opened to have osp13 > have new ceph defaults There is no need for confirmation from Federico (clearing needinfo). I will open a docbug for OSP13 to have the new ceph3 defaults. To not run into this issue either: 1. Use hardware that complies with ceph recommended practices 2. Override the defaults if using a development or test-only environment In order to make #2 easier, simply use '-e environments/low-memory-usage.yaml' with your deployment and after the proposed change to this file merges, the issue should go away. https://review.openstack.org/#/c/544588/ (In reply to John Fulton from comment #9) > - [...] docbug should be opened to have osp13 have new ceph defaults https://bugzilla.redhat.com/show_bug.cgi?id=1545383 In reply to John Fulton from comment #8) > (In reply to John Fulton from comment #7) > > Next steps: > > > > 1. Can ceph-ansible catch this earlier with some form of validation (open > > RFE) > > https://bugzilla.redhat.com/show_bug.cgi?id=1541152 Validations are still desirable, but if ceph-ansible had failed when it failed to create the pool that would also address this issue as per the BZ below. https://bugzilla.redhat.com/show_bug.cgi?id=1546185 Environment: openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch Encountered: ["Error ERANGE: pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"], "stdout": "", "stdout_lines": []} Was able to w/a by including /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml with the deployment. unable to reproduce with : openstack-tripleo-heat-templates-8.0.2-0.20180327213843.f25e2d8.el7ost.noarch *** Bug 1562172 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |