Description of problem: Upon deploying a stack using infrared generated playbooks in a hyperconverged configuration and with is_hci set to true (or false) - The value set for the cache falls-back to the default value of 4294967296 (4GB?). Upon further scrutiny two things arise: 1. It seems that when running the playbooks the number of osd's is not calculated (in roles/ceph-config/tasks/main.yml, always skipped, defaults to zero). This may lead to a problem when calculating the cache size. In the deployment scenario we should have calculated 1 from the device list. 2. The code itself compares a value of the memory in megabytes (ansible_memtotal_mb is compared to the value of osd_memory_target, roles/ceph-config/templates/ceph.conf.j2) - For a machine with 20 GB, 1 OSD, safety factor of 0.2, the conservative allotment per OSD will be 4,096MB yet when compared to osd_memory_target (which is in Bytes - four billion), the conditional will be false. - This was the scenario run here but the cache was not calculated correctly. Version-Release number of selected component (if applicable): ceph-ansible-3.2.0-1.el7cp.noarch How reproducible: Deploy hyperconverged nodes with ample ram for osdcompute machines (Scenario was demonstrated on OSP 13). Consistent on installations and updates Steps to Reproduce: 1.deploy OSP13 using infrared, hyperconverged. 2.Use the, provide 20GB+ memory to the hyperconverged nodes is_hci flag set to true, single OSD. (To clarify - tried once with 20 then again with 32 as an update) 3. Look at ceph.conf for results Actual results: 1.Theoretical cache should be 4.2GB but defaults to 4, when using 32GB theoretical cache should be 6.5GB Expected results: Cache should be calculated as per the formula outlined in code MAX[(RAM*SafetyFactor)/#OSDS | 4GB] Additional info: [1] ceph_osd_tree.json https://pastebin.com/raw/VqhAWQAq [2] internal.yaml https://pastebin.com/raw/vcWbBDjB [3] inventory.yaml https://pastebin.com/raw/KZgvD0ek [4] ceph.conf
See also: ceph.conf result at: https://pastebin.com/raw/d4ePRFHG
Assigning to Neha since she worked on the initial implementation.
Can we have someone from QE reproduce this?
Hi Neha, I think the issue here is that `num_osds` never got defined in the ceph-config role: 2019-01-04 15:30:51,087 p=7977 u=mistral | TASK [ceph-config : count number of osds for ceph-disk scenarios] ************** 2019-01-04 15:30:51,088 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:16 2019-01-04 15:30:51,088 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.035) 0:00:58.029 ******** 2019-01-04 15:30:51,111 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,122 p=7977 u=mistral | TASK [ceph-config : count number of osds for lvm scenario] ********************* 2019-01-04 15:30:51,122 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:23 2019-01-04 15:30:51,122 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.034) 0:00:58.064 ******** 2019-01-04 15:30:51,145 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,156 p=7977 u=mistral | TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *** 2019-01-04 15:30:51,156 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:30 2019-01-04 15:30:51,156 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.033) 0:00:58.098 ******** 2019-01-04 15:30:51,177 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,188 p=7977 u=mistral | TASK [ceph-config : set_fact num_osds from the output of 'ceph-volume lvm batch --report'] *** 2019-01-04 15:30:51,188 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:47 2019-01-04 15:30:51,188 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.032) 0:00:58.130 ******** 2019-01-04 15:30:51,211 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,222 p=7977 u=mistral | TASK [ceph-config : run 'ceph-volume lvm list' to see how many osds have already been created] *** 2019-01-04 15:30:51,222 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:55 2019-01-04 15:30:51,222 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.034) 0:00:58.164 ******** 2019-01-04 15:30:51,245 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} 2019-01-04 15:30:51,256 p=7977 u=mistral | TASK [ceph-config : set_fact num_osds from the output of 'ceph-volume lvm list'] *** 2019-01-04 15:30:51,256 p=7977 u=mistral | task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:66 2019-01-04 15:30:51,256 p=7977 u=mistral | Friday 04 January 2019 15:30:51 -0500 (0:00:00.033) 0:00:58.198 ******** 2019-01-04 15:30:51,281 p=7977 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} so it's set to 0 as default value [1] (which is not a good default value I think) it means it doesn't enter in the conditions [2][3] and it takes the default value [4] of `osd_memory_target` [1] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L155 [2] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L157 [3] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L162 [4] https://github.com/ceph/ceph-ansible/blob/v3.2.0/roles/ceph-config/templates/ceph.conf.j2#L168
I think Guillaume's comment here https://bugzilla.redhat.com/show_bug.cgi?id=1664112#c4 makes sense. So, we have two aspects: 1. Currently, if number of osds are not determined correctly by the code, we default to 0. This does not permit any kind of further automation to calculate the value of osd_memory_target(meaning: none of the math is done). To prevent this, we can say that we will default to 1, and at least allow the calculation to happen, though it might not be perfect. This is easy to get into 3.x 2. We need to ensure that "num_osds" is populated correctly under all circumstances(understand why it was not done correctly in this case). A solution to this will be eradicate this problem. I am not sure about the timeline for this. Guillaume, what are your thoughts?
sticking to default 1 for num_osds isn't enough, it means there's an issue with the current implementation, we must understand how we can fall in a case where it doesn't detect any OSD and fix it. I can assist you to reproduce and figure out how to fix it, let me know.
Guilluame, sure let's reproduce this. Do you already have an environment for it or else Elliad can help with that.
(In reply to Neha Ojha from comment #8) > Guilluame, sure let's reproduce this. Do you already have an environment for > it or else Elliad can help with that. Eliad can you contact Neha to provide an env?
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri
Tested this today with an OSP13 deployment with 3 controller and 3 hci-ceph-all nodes. Specified override.hcicephall.memory=131072 in infrared for deployment with 5 OSDs per node, such that the memory target allocation for each node should have been larger than 4GB. However, upon inspection of the ceph.conf file on the hci-ceph-all nodes the value of osd memory target was 4294967296, i.e. the default value. Jordan (infrared plugin we use for automated testing of ceph integration) patch with the test I used today can be found here: https://review.gerrithub.io/c/rhos-infra/jordan/+/468500 Some additional details: ansible_memtotal_mb (found on hci-ceph-all nodes after deployment): 128773 Core Puddle: 2019-09-05.1 Ceph Image: 3-31 ceph-ansible version: ceph-ansible-3.2.24-1.el7cp.noarch puppet-ceph: puppet-ceph-2.5.1-2.git372379b.el7ost.noarch
Should this be changed back to ASSIGNED ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:4353