Bug 1539852

Summary: [OSP13][Deployment] Overcloud deployment fails during ControllerDeployment_Step4, ceph fails "ObjectNotFound: error opening pool 'metrics'\",
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: openstack-tripleo-heat-templatesAssignee: John Fulton <johfulto>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: flucifre, gfidente, johfulto, jomurphy, mburns, rhel-osp-director-maint, sasha, yrabl
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.0-0.20180215092254 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:43:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1545383    
Bug Blocks:    
Attachments:
Description Flags
output of journalctl CONTAINER_NAME=ceph-mon-overcloud-controller-i (for i in 0,1,2) none

Description Omri Hochman 2018-01-29 18:05:57 UTC
[OSP13][Deployment] Overcloud deployment during ControllerDeployment_Step4 with ceph fails "ObjectNotFound: error opening pool 'metrics'\",

Environment:
-------------
openstack-tripleo-heat-templates-8.0.0-0.20180103192341.el7ost.noarch
openstack-tripleo-ui-7.4.3-4.el7ost.noarch
openstack-tripleo-common-containers-8.3.1-0.20180103233643.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20171228195253.002c4ca.el7ost.noarch
openstack-tripleo-common-8.3.1-0.20180103233643.el7ost.noarch
python-tripleoclient-8.1.1-0.20171231084755.el7ost.noarch
openstack-tripleo-validations-8.1.1-0.20171221173840.ac39a91.el7ost.noarch
puppet-tripleo-8.1.1-0.20180102165828.el7ost.noarch
openstack-tripleo-image-elements-8.0.0-0.20180103224254.aad6322.el7ost.noarch


Completed upload for docker image docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-api:2018-01-22.1
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-metricd:2018-01-22.1

Steps : 
--------
- Attempt to deploy osp13 (on Bare-Metal)
  3 controller 1 compute 3 ceph   

(undercloud) [stack@undercloud74 ~]$ echo -e `heat deployment-show 37bc65f5-6986-4f27-b583-986333b648a4`|grep -i error
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
 \"Error running ['docker', 'run', '--name', 'gnocchi_db_sync', '--label', 'config_id=tripleo_step4', '--label', 'container_name=gnocchi_db_sync', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703\\"], \\"user\\": \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro\\", \\"/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro\\", \\"/var/log/containers/gnocchi:/var/log/gnocchi\\", \\"/var/log/containers/httpd/gnocchi-api:/var/log/httpd\\", \\"/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro\\"], \\"image\\": \\"192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1\\", \\"detach\\": false, \\"net\\": \\"host\\", \\"privileged\\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703', '--net=host', '--privileged=false', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro', '--volume=/var/log/containers/gnocchi:/var/log/gnocchi', '--volume=/var/log/containers/httpd/gnocchi-api:/var/log/httpd', '--volume=/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro', '192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1']. [1]\",
 \"ObjectNotFound: error opening pool 'metrics'\",
(undercloud) [stack@undercloud74 ~]$
(undercloud) [stack@undercloud74 ~]$
(undercloud) [stack@undercloud74 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+
| 3b94d14f-b2cf-4fbc-9cff-e4533293c1a3 | overcloud  | d2ad266cecf9419f9fd906d2c916d998 | CREATE_FAILED | 2018-01-28T15:25:13Z | None         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+--------------+


(undercloud) [stack@undercloud74 ~]$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step4.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 414f185b-cacc-410e-9132-fcc96665df26
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "b2d7a40a9667: Download complete",
            "b2d7a40a9667: Pull complete",
            "Digest: sha256:041c65774210c6eba133bc5b87ad90cf1654d40e1a04d58fde9bd8ccb0950040"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/ce929c1c-7d55-4da5-9358-55509bd82043_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=6    changed=2    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 37bc65f5-6986-4f27-b583-986333b648a4
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "b2d7a40a9667: Download complete",
            "b2d7a40a9667: Pull complete",
            "Digest: sha256:041c65774210c6eba133bc5b87ad90cf1654d40e1a04d58fde9bd8ccb0950040"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/734e8770-f29b-4e1a-a607-91ff888f1aa0_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=6    changed=2    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |

overcloud.AllNodesDeploySteps.ControllerDeployment_Step4.2:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: dbd3a923-0b1b-45c0-98f4-8d8cb1cc77dd
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "b2d7a40a9667: Download complete",
            "b2d7a40a9667: Pull complete",
            "Digest: sha256:041c65774210c6eba133bc5b87ad90cf1654d40e1a04d58fde9bd8ccb0950040"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/57e17b78-eb77-4b5a-aac7-b84123fc0495_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=6    changed=2    unreachable=0    failed=1

    (truncated, view all with --long)
  deploy_stderr: |

(undercloud) [stack@undercloud74 ~]$


(undercloud) [stack@undercloud74 ~]$ heat resource-list overcloud -n 5 | grep -v COMPLETE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                             | physical_resource_id                                                                                                                                                                 | resource_type                                                                                                                   | resource_status | updated_time         | stack_name                                                                                                                                               |
+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| AllNodesDeploySteps                       | cb312b95-a3be-42cb-af50-23f029096d6e                                                                                                                                                 | OS::TripleO::PostDeploySteps                                                                                                    | CREATE_FAILED   | 2018-01-28T15:25:17Z | overcloud                                                                                                                                                |
| ControllerDeployment_Step4                | c472797c-461f-4e6a-b969-5e52943b31bd                                                                                                                                                 | OS::TripleO::DeploymentSteps                                                                                                    | CREATE_FAILED   | 2018-01-28T15:52:40Z | overcloud-AllNodesDeploySteps-a3so2x6cjanc                                                                                                               |
| 0                                         | 37bc65f5-6986-4f27-b583-986333b648a4                                                                                                                                                 | OS::Heat::StructuredDeployment                                                                                                  | CREATE_FAILED   | 2018-01-28T16:56:01Z | overcloud-AllNodesDeploySteps-a3so2x6cjanc-ControllerDeployment_Step4-ut5m7bq57jzc                                                                       |
| 1                                         | 414f185b-cacc-410e-9132-fcc96665df26                                                                                                                                                 | OS::Heat::StructuredDeployment                                                                                                  | CREATE_FAILED   | 2018-01-28T16:56:01Z | overcloud-AllNodesDeploySteps-a3so2x6cjanc-ControllerDeployment_Step4-ut5m7bq57jzc                                                                       |
| 2                                         | dbd3a923-0b1b-45c0-98f4-8d8cb1cc77dd                                                                                                                                                 | OS::Heat::StructuredDeployment                                                                                                  | CREATE_FAILED   | 2018-01-28T16:56:01Z | overcloud-AllNodesDeploySteps-a3so2x6cjanc-ControllerDeployment_Step4-ut5m7bq57jzc                                                                       |
+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 1 John Fulton 2018-01-29 18:09:59 UTC
The problem is how tripleo set up the ceph-ansible deployment. 

- ceph-ansible call has no input via extra vars [1]
- inventory has no input [2]
- thus no arguments were passed to ceph-ansible
- thus, ceph-ansible skipped all of its tasks, no input provided, and the playbook run returned no error [3]

[1]
2018-01-28 11:27:28.996 31461 DEBUG oslo_concurrency.processutils [req-1b57e863-20e4-414b-8b0c-62d37514f64b f8716113cb2d44259eeebf97c3570146 d2ad266cecf9419f9fd906d2c916d998 - default default] CMD "ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --inventory-file /tmp/ansible-mistral-action60aVA0/inventory.yaml --private-key /tmp/ansible-mistral-action60aVA0/ssh_private_key --skip-tags package-install,with_pkg" returned: 0 in 873.287s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409


[2]
As per https://github.com/fultonj/tripleo-ceph-ansible/blob/master/get-inventory.sh:

Inventory from 2018-01-28 16:27:30

{
    "mgr_ips": [
        "192.168.0.8", 
        "192.168.0.19", 
        "192.168.0.17"
    ], 
    "mon_ips": [
        "192.168.0.8", 
        "192.168.0.19", 
        "192.168.0.17"
    ], 
    "mds_ips": [], 
    "osd_ips": [
        "192.168.0.16", 
        "192.168.0.12", 
        "192.168.0.13"
    ], 
    "rbdmirror_ips": [], 
    "rgw_ips": [], 
    "client_ips": [
        "192.168.0.15"
    ], 
    "nfs_ips": []
}

[3]

2018-01-28 11:16:15,456 p=3697 u=mistral |  TASK [ceph-mon : create openstack pool(s)] *************************************
2018-01-28 11:16:15,513 p=3697 u=mistral |  skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'images'}) 
2018-01-28 11:16:15,536 p=3697 u=mistral |  skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}) 
2018-01-28 11:16:15,559 p=3697 u=mistral |  skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'backups'}) 
2018-01-28 11:16:15,581 p=3697 u=mistral |  skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'vms'}) 
2018-01-28 11:16:15,600 p=3697 u=mistral |  skipping: [192.168.0.8] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'volumes'})

Comment 2 John Fulton 2018-01-29 18:14:42 UTC
How parameters are passed from Mistral to ceph-ansible was changed recently [1]. I suspect a patch is missing in the puddle so I want to get to the bottom of that next. In short, if params are not getting passed via extra-vars (A) then they need to be passed via the inventory (B). Seems like we have A but not B and that's the problem. We need to get B into the puddle. 

[1] https://review.openstack.org/#/c/528755/1/workbooks/ceph-ansible.yaml

Comment 4 Alexander Chuzhoy 2018-01-30 14:29:56 UTC
Reproducing:

Environment:
ansible-tripleo-ipsec-0.0.1-0.20180119094817.5e80d4f.el7ost.noarch
openstack-tripleo-common-8.3.1-0.20180123050218.el7ost.noarch
instack-undercloud-8.1.1-0.20180117134321.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20180117092204.120eca8.el7ost.noarch
openstack-tripleo-common-containers-8.3.1-0.20180123050218.el7ost.noarch
openstack-tripleo-image-elements-8.0.0-0.20180117094122.02d0985.el7ost.noarch
puppet-tripleo-8.2.0-0.20180122224519.9fd3379.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180122224016.el7ost.noarch
openstack-tripleo-validations-8.1.1-0.20180119231917.2ff3c79.el7ost.noarch
openstack-tripleo-ui-8.1.1-0.20180122135122.aef02d8.el7ost.noarch
python-tripleoclient-9.0.1-0.20180119233147.el7ost.noarch

Comment 5 John Fulton 2018-01-31 03:15:16 UTC
(In reply to John Fulton from comment #1)
> The problem is how tripleo set up the ceph-ansible deployment. 
> 
> - ceph-ansible call has no input via extra vars [1]
> - inventory has no input [2]

I was wrong. My get-inventory script was pulling the IP group, BUT that wasn't necessarily the inventory that was passed. I observed the same evidence on my own system but found from examining each mistral task that the correct parameters were passed: 

(undercloud) [stack@hci-director ~]$ mistral task-get-result $TASK_ID | jq . | sed -e 's/\\n/\n/g' -e 's/\\"/"/g' | head | curl -F 'f:1=<-' ix.io 
http://ix.io/EXi
(undercloud) [stack@hci-director ~]$ 

> [2]
> As per
> https://github.com/fultonj/tripleo-ceph-ansible/blob/master/get-inventory.sh:
> 
> Inventory from 2018-01-28 16:27:30
> 
> {
>     "mgr_ips": [
>         "192.168.0.8", 
>         "192.168.0.19", 
>         "192.168.0.17"
>     ], 
>     "mon_ips": [
>         "192.168.0.8", 
>         "192.168.0.19", 
>         "192.168.0.17"
>     ], 
>     "mds_ips": [], 
>     "osd_ips": [
>         "192.168.0.16", 
>         "192.168.0.12", 
>         "192.168.0.13"
>     ], 
>     "rbdmirror_ips": [], 
>     "rgw_ips": [], 
>     "client_ips": [
>         "192.168.0.15"
>     ], 
>     "nfs_ips": []
> }

Comment 6 John Fulton 2018-01-31 03:34:02 UTC
Created attachment 1388719 [details]
output of journalctl CONTAINER_NAME=ceph-mon-overcloud-controller-i (for i in 0,1,2)

It seems that the TripleO to ceph-ansible of this issue is working in that the correct commands were run and ansible indicates that they were run with success. The problem seems internal to the Ceph cluster; it indicated it received the request to create the pools but didn't create all of them. 

The attached ceph monitor logs show that the monitors did receive the request to create the pools, e.g. there is a line like this for every pool on one of the 3 mons:

Jan 30 20:40:15 overcloud-controller-0 dockerd-current[20108]: 2018-01-30 20:40:15.618318 7fa8cc16c700  0 log_channel(audit) log [INF] : from='client.? 10.19.95.14:0/299906719' entity='client.admin' cmd=[{"prefix": "osd pool create", "pg_num": 128, "pool": "volumes"}]: dispatch

However, there is only one of the above where the last word is "completed" in place of "dispatch" and that is for the images pool.

The ceph-ansible logs OK each of these requests since from ansible's point of view the command was run on the ceph cluster which basically implied it would create the pool. It just seems it hasn't done it. 

2018-01-24 15:23:05,690 p=20315 u=mistral |  ok: [192.168.0.18] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'images'}) => {
2018-01-24 15:23:08,625 p=20315 u=mistral |  ok: [192.168.0.18] => (item={u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}) => {

If you login to the cluster itself the pool isn't there but you're still able to manually create it. 

[root@overcloud-controller-0 /]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    11145G     11144G         646M             0 
POOLS:
    NAME       ID     USED     %USED     MAX AVAIL     OBJECTS 
    images     1         0         0         3529G           0 
[root@overcloud-controller-0 /]# ceph osd pool create metrics 128
pool 'metrics' created
[root@overcloud-controller-0 /]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    11145G     11144G         648M             0 
POOLS:
    NAME        ID     USED     %USED     MAX AVAIL     OBJECTS 
    images      1         0         0         3529G           0 
    metrics     2         0         0         3529G           0 
[root@overcloud-controller-0 /]#

Comment 7 John Fulton 2018-02-01 15:33:50 UTC
The pools were not created and ansible [1] returned the following message from ceph:

"Error ERANGE:  pg_num 128 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"

The workaround is to change any of the above three variables to satisfy the following function when we create, for OpenStack by default, seven pools:

 https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5670-L5698

This is new to OSP13 because it's using RHCS3 which has the above feature. The problem is that EVERY OSP13 deployment that doesn't override the defaults will have this problem. 

Here's one workaround which satisfies the function above:

parameter_defaults:
  CephPoolDefaultSize: 3
  CephPoolDefaultPgNum: 128
  CephConfigOverrides:
    mon_max_pg_per_osd: 3072

In the above case I increased mon_max_pg_per_osd based on the closest power of 2 greater than (* 128 3 7). Next steps:

1. Can ceph-ansible catch this earlier with some form of validation (open RFE)
2. Do we need to change more defaults in OSP's THT (remember [3])?

[1] grep Error /var/log/mistral/ceph-install-workflow.log | grep 128
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1502878#c3
[3] https://review.openstack.org/#/c/506330/

Comment 8 John Fulton 2018-02-01 21:36:30 UTC
(In reply to John Fulton from comment #7)
> Next steps:
> 
> 1. Can ceph-ansible catch this earlier with some form of validation (open
> RFE)

https://bugzilla.redhat.com/show_bug.cgi?id=1541152 

> 2. Do we need to change more defaults in OSP's THT (remember [3])?

On the agenda a the next DFG:Ceph stand up call.

Comment 9 John Fulton 2018-02-06 15:09:27 UTC
(In reply to John Fulton from comment #8)
> (In reply to John Fulton from comment #7)
> > 2. Do we need to change more defaults in OSP's THT (remember [3])?
> 
> On the agenda a the next DFG:Ceph stand up call.

We discussed this today and have the following plan:

1. THT's low-memory-usage.yaml [1] fits this pattern and we will put something like the workaround from comment #7 there. 

2. If as per yrabl, there is a minimum of 3 OSDs required for OSP in our docs, then the osp13 version of those docs need an update to be consistent w/ RHCS3 which is 5. (needinfo to Federico so he can check on reasoning behind 3). 

The above are based on the following reasoning:

- The defaults should fit the minimum supported production deployment and those testing with less than that should have an easy way to override them provided they understand it's not for production. 

Next steps:
- upstream code change to THT low-memory-usage.yaml
- upstream code change to ceph-ansible on sanity check (bz 1541152)
- based on confirmation from Federico, docbug should be opened to have osp13 have new ceph defaults


[1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/low-memory-usage.yaml

Comment 12 John Fulton 2018-02-14 14:23:40 UTC
(In reply to John Fulton from comment #9)
> 2. If as per yrabl, there is a minimum of 3 OSDs required for OSP in our
> docs, then the osp13 version of those docs need an update to be consistent
> w/ RHCS3 which is 5. (needinfo to Federico so he can check on reasoning
> behind 3). 

As per a conversation with yrabl in IRC, the docs do not require a minimum of 3 OSDs, they require a minimum of 3 ceph storage servers. 

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html/director_installation_and_usage/chap-requirements#sect-Environment_Requirements

> Next steps:
> - upstream code change to THT low-memory-usage.yaml
> - upstream code change to ceph-ansible on sanity check (bz 1541152)
> - based on confirmation from Federico, docbug should be opened to have osp13
> have new ceph defaults

There is no need for confirmation from Federico (clearing needinfo). I will open a docbug for OSP13 to have the new ceph3 defaults.

Comment 13 John Fulton 2018-02-14 18:30:56 UTC
To not run into this issue either:

1. Use hardware that complies with ceph recommended practices
2. Override the defaults if using a development or test-only environment

In order to make #2 easier, simply use '-e environments/low-memory-usage.yaml' with your deployment and after the proposed change to this file merges, the issue should go away.

 https://review.openstack.org/#/c/544588/

Comment 14 John Fulton 2018-02-14 19:10:13 UTC
(In reply to John Fulton from comment #9)
> - [...] docbug should be opened to have osp13 have new ceph defaults

https://bugzilla.redhat.com/show_bug.cgi?id=1545383

Comment 15 John Fulton 2018-02-16 14:51:24 UTC
In reply to John Fulton from comment #8)
> (In reply to John Fulton from comment #7)
> > Next steps:
> > 
> > 1. Can ceph-ansible catch this earlier with some form of validation (open
> > RFE)
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1541152 

Validations are still desirable, but if ceph-ansible had failed when it failed to create the pool that would also address this issue as per the BZ below. 

 https://bugzilla.redhat.com/show_bug.cgi?id=1546185

Comment 17 Alexander Chuzhoy 2018-03-22 14:24:27 UTC
Environment:
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch

Encountered:

["Error ERANGE:  pg_num 64 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"], "stdout": "", "stdout_lines": []}


Was able to w/a by including /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml with the deployment.

Comment 18 Omri Hochman 2018-04-04 02:15:26 UTC
unable to reproduce with : openstack-tripleo-heat-templates-8.0.2-0.20180327213843.f25e2d8.el7ost.noarch

Comment 19 John Fulton 2018-04-11 12:29:23 UTC
*** Bug 1562172 has been marked as a duplicate of this bug. ***

Comment 21 errata-xmlrpc 2018-06-27 13:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086