Bug 1664112
Summary: | Cache size is not created correctly in a hyperconverged installation when using the is_hci flag | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Eliad Cohen <elicohen> |
Component: | Ceph-Ansible | Assignee: | Dimitri Savineau <dsavinea> |
Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> |
Severity: | medium | Docs Contact: | Bara Ancincova <bancinco> |
Priority: | medium | ||
Version: | 3.2 | CC: | anharris, aschoen, ceph-eng-bugs, ceph-qe-bugs, dsavinea, edonnell, elicohen, gabrioux, gfidente, gmeno, jbrier, johfulto, mburrows, nojha, nthomas, nweinber, pasik, pnguyen, tchandra, tserlin |
Target Milestone: | z2 | ||
Target Release: | 3.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-ansible-3.2.29-1.el7cp Ubuntu: ceph-ansible_3.2.29-2redhat1 | Doc Type: | Bug Fix |
Doc Text: |
.The value of `osd_memory_target` for HCI deployment is calculated properly
Previously, the calculation of the number of OSDs was not implemented for containerized deployment; the default value was `0`. Consequently, the calculation of the value of the BlueStore `osd_memory_target` option for Hyper-converged infrastructure (HCI) deployment was not correct. With this update, the number of OSDs is reported correctly for containerized deployment, and the value of `osd_memory_target` for the HCI configuration is calculated properly.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-12-19 17:59:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1578730, 1726135 |
Description
Eliad Cohen
2019-01-07 17:56:49 UTC
See also: ceph.conf result at: https://pastebin.com/raw/d4ePRFHG Assigning to Neha since she worked on the initial implementation. Can we have someone from QE reproduce this? Hi Neha, I think the issue here is that `num_osds` never got defined in the ceph-config role: 2019-01-04 15:30:51,087 p=7977 u=mistral | TASK [ceph-config : count number of osds for ceph-disk scenarios] ************** 2019-01-04 15:30:51,088 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:16 2019-01-04 15:30:51,088 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.035) 0:00:58.029 ******** 2019-01-04 15:30:51,111 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,122 p=7977 u=mistral | TASK [ceph-config : count number of osds for lvm scenario] ********************* 2019-01-04 15:30:51,122 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:23 2019-01-04 15:30:51,122 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.034) 0:00:58.064 ******** 2019-01-04 15:30:51,145 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,156 p=7977 u=mistral | TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *** 2019-01-04 15:30:51,156 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:30 2019-01-04 15:30:51,156 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.033) 0:00:58.098 ******** 2019-01-04 15:30:51,177 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,188 p=7977 u=mistral | TASK [ceph-config : set_fact num_osds from the output of 'ceph-volume lvm batch --report'] *** 2019-01-04 15:30:51,188 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:47 2019-01-04 15:30:51,188 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.032) 0:00:58.130 ******** 2019-01-04 15:30:51,211 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,222 p=7977 u=mistral | TASK [ceph-config : run 'ceph-volume lvm list' to see how many osds have already been created] *** 2019-01-04 15:30:51,222 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:55 2019-01-04 15:30:51,222 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.034) 0:00:58.164 ******** 2019-01-04 15:30:51,245 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,256 p=7977 u=mistral | TASK [ceph-config : set_fact num_osds from the output of 'ceph-volume lvm list'] *** 2019-01-04 15:30:51,256 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:66 2019-01-04 15:30:51,256 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.033) 0:00:58.198 ******** 2019-01-04 15:30:51,281 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} so it's set to 0 as default value [1] (which is not a good default value I think) it means it doesn't enter in the conditions [2][3] and it takes the default value [4] of `osd_memory_target` [1] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L155 [2] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L157 [3] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L162 [4] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L168 I think Guillaume's comment here https://bugzilla.redhat.com/show_bug.cgi?id=1664112#c4 makes sense. So, we have two aspects: 1. Currently, if number of osds are not determined correctly by the code, we default to 0. This does not permit any kind of further automation to calculate the value of osd_memory_target(meaning: none of the math is done). To prevent this, we can say that we will default to 1, and at least allow the calculation to happen, though it might not be perfect. This is easy to get into 3.x 2. We need to ensure that "num_osds" is populated correctly under all circumstances(understand why it was not done correctly in this case). A solution to this will be eradicate this problem. I am not sure about the timeline for this. Guillaume, what are your thoughts? sticking to default 1 for num_osds isn't enough, it means there's an issue with the current implementation, we must understand how we can fall in a case where it doesn't detect any OSD and fix it. I can assist you to reproduce and figure out how to fix it, let me know. Guilluame, sure let's reproduce this. Do you already have an environment for it or else Elliad can help with that. (In reply to Neha Ojha from comment #8) > Guilluame, sure let's reproduce this. Do you already have an environment for > it or else Elliad can help with that. Eliad can you contact Neha to provide an env? Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Tested this today with an OSP13 deployment with 3 controller and 3 hci-ceph-all nodes. Specified override.hcicephall.memory=131072 in infrared for deployment with 5 OSDs per node, such that the memory target allocation for each node should have been larger than 4GB. However, upon inspection of the ceph.conf file on the hci-ceph-all nodes the value of osd memory target was 4294967296, i.e. the default value. Jordan (infrared plugin we use for automated testing of ceph integration) patch with the test I used today can be found here: https://review.gerrithub.io/c/rhos-infra/jordan/+/468500 Some additional details: ansible_memtotal_mb (found on hci-ceph-all nodes after deployment): 128773 Core Puddle: 2019-09-05.1 Ceph Image: 3-31 ceph-ansible version: ceph-ansible-3.2.24-1.el7cp.noarch puppet-ceph: puppet-ceph-2.5.1-2.git372379b.el7ost.noarch Should this be changed back to ASSIGNED ? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:4353 |