Description of problem: Followed Redhat's documentation for creating templates for resource isolation on a hyper converged node, redeployed Openstack, restarted all nodes, Ceph-OSDs alone were using resources that were out of band from the resource pool that was defined. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/hyperconverged_infrastructure_guide/index#proc_configuring-resource-isolation-on-hyperconverged-nodes_hci Version-Release number of selected component (if applicable): Openstack 16.2.4 Steps to Reproduce: 1. Create or modify templates for resource isolation per documentation above: [stack@undercloud ~]$ cat templates/hci-resource.yaml parameter_defaults: ComputeHCIParameters: NovaReservedHostMemory: 26624 [stack@undercloud ~]$ cat templates/storage-container-config.yaml parameter_defaults: CephAnsibleExtraConfig: is_hci: true 2. Include the template files in your openstack deployment and redeploy 3. Reboot Nodes Actual results: Ceph-OSDs were taking up 32GB of RAM, the Ceph-OSDS and Nova Overhead should have only been given 26GBs of RAM for resource isolation Expected results: Ceph-OSDs and Nova Overhead being capped at 26GB RAM. Additional info: Had to do a workaround of setting the ceph osd memory target directly to resolve this issue: [stack@undercloud ~]$ cat templates/ceph-resource.yaml parameter_defaults: CephConfigOverrides: osd: osd_memory_target: 2147483648
(In reply to daniel.jameson.1 from comment #0) > Description of problem: > Followed Redhat's documentation for creating templates for resource > isolation on a hyper converged node, redeployed Openstack, restarted all > nodes, Ceph-OSDs alone were using resources that were out of band from the > resource pool that was defined. How did you determine they were out of band? E.g. did you look at the value of osd_memory_target on the HCI node or run some other command? > Version-Release number of selected component (if applicable): > Openstack 16.2.4 What version of ceph-ansible were you using? Can you share the output of running `rpm -q ceph-ansible` on the undercloud? > Steps to Reproduce: > 1. Create or modify templates for resource isolation per documentation above: > > [stack@undercloud ~]$ cat templates/hci-resource.yaml > parameter_defaults: > ComputeHCIParameters: > NovaReservedHostMemory: 26624 > > > [stack@undercloud ~]$ cat templates/storage-container-config.yaml > parameter_defaults: > CephAnsibleExtraConfig: > is_hci: true I see you're using "is_hci: true". That's good. More on that below regarding osd_memory_target. > Actual results: Ceph-OSDs were taking up 32GB of RAM, the Ceph-OSDS and Nova > Overhead should have only been given 26GBs of RAM for resource isolation What command did you use to determine that they were using 32GB of RAM? Why should it be 26 GB? Do you think it's because you set `NovaReservedHostMemory: 26624`? If so that's not exactly how it works. NovaReservedHostMemory tells the Nova scheduler to not schedule VMs on an HCI node which would require the last 26624 MB. I.e. If the nova scheduler sees a host with X GB of RAM and wants to determine if it can run a VM there, then it must count the available resources on that host, not has having X GB of RAM but as having X-26 GB of RAM. So just because the Nova scheduler will reserve that memory, doesn't mean that the Nova scheduler then puts an upper bound on the Ceph OSD. We want to tell the Nova scheduler to not use memory the OSD would use (so NovaReservedHostMemory is the right thing to do) but to limit the OSD memory we set osd_memory_target. More on that below. > Expected results: > Ceph-OSDs and Nova Overhead being capped at 26GB RAM. > > Additional info: > > Had to do a workaround of setting the ceph osd memory target directly to > resolve this issue: > > [stack@undercloud ~]$ cat templates/ceph-resource.yaml > parameter_defaults: > CephConfigOverrides: > osd: > osd_memory_target: 2147483648 This is along the right lines. Let's look at how osd_memory_target is set by ceph-ansible when is_hci is true. https://github.com/ceph/ceph-ansible/pull/3113/files As you can probably see in the PR above, the total host memory is multiplied by a different safety_factor (depending on is_hci). That value is then divided by the number of OSDs and the osd_memory_target is then set. If you don't agree with that calculation, you don't have to use it, e.g. you can directly override the osd_memory_target in the ceph.conf as you have done. So I'm curious what value for osd_memory_target ceph-ansible computed in your environment. Would you please provided it? Is it inline with the above calculation?
Hi Daniel, It's been 6 days and I haven't heard from you. I'll keep this bug report open for another week. John
Since it's been another week and I haven't heard back, I'm going to close this bug. If you want to re-open it please provide the requested info.