Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2203429

Summary:	Configuring resource Isolation on Hyperconverged Nodes does not work
Product:	Red Hat OpenStack	Reporter:	daniel.jameson.1
Component:	ceph-ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	16.2 (Train)	CC:	bshephar, eharney, fpantano, gfidente, jjoyce, johfulto, jschluet, slinaber, tvignaud
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-12 11:58:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description daniel.jameson.1 2023-05-12 16:13:26 UTC

Description of problem:
Followed Redhat's documentation for creating templates for resource isolation on a hyper converged node, redeployed Openstack, restarted all nodes, Ceph-OSDs alone were using resources that were out of band from the resource pool that was defined. 

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/hyperconverged_infrastructure_guide/index#proc_configuring-resource-isolation-on-hyperconverged-nodes_hci

Version-Release number of selected component (if applicable):
Openstack 16.2.4


Steps to Reproduce:
1. Create or modify templates for resource isolation per documentation above:

[stack@undercloud ~]$ cat templates/hci-resource.yaml
parameter_defaults:
  ComputeHCIParameters:
    NovaReservedHostMemory: 26624


[stack@undercloud ~]$ cat templates/storage-container-config.yaml
parameter_defaults:
  CephAnsibleExtraConfig:
    is_hci: true


2. Include the template files in your openstack deployment and redeploy
3. Reboot Nodes

Actual results: Ceph-OSDs were taking up 32GB of RAM, the Ceph-OSDS and Nova Overhead should have only been given 26GBs of RAM for resource isolation 


Expected results:
Ceph-OSDs and Nova Overhead being capped at 26GB RAM.

Additional info:

Had to do a workaround of setting the ceph osd memory target directly to resolve this issue:

[stack@undercloud ~]$ cat templates/ceph-resource.yaml
parameter_defaults:
  CephConfigOverrides:
    osd:
      osd_memory_target: 2147483648

Comment 1 John Fulton 2023-05-30 19:09:38 UTC

(In reply to daniel.jameson.1 from comment #0)
> Description of problem:
> Followed Redhat's documentation for creating templates for resource
> isolation on a hyper converged node, redeployed Openstack, restarted all
> nodes, Ceph-OSDs alone were using resources that were out of band from the
> resource pool that was defined.

How did you determine they were out of band? E.g. did you look at the value of osd_memory_target on the HCI node or run some other command?

> Version-Release number of selected component (if applicable):
> Openstack 16.2.4

What version of ceph-ansible were you using? 
Can you share the output of running `rpm -q ceph-ansible` on the undercloud?

> Steps to Reproduce:
> 1. Create or modify templates for resource isolation per documentation above:
> 
> [stack@undercloud ~]$ cat templates/hci-resource.yaml
> parameter_defaults:
>   ComputeHCIParameters:
>     NovaReservedHostMemory: 26624
> 
> 
> [stack@undercloud ~]$ cat templates/storage-container-config.yaml
> parameter_defaults:
>   CephAnsibleExtraConfig:
>     is_hci: true

I see you're using "is_hci: true". That's good. More on that below regarding osd_memory_target.

> Actual results: Ceph-OSDs were taking up 32GB of RAM, the Ceph-OSDS and Nova
> Overhead should have only been given 26GBs of RAM for resource isolation

What command did you use to determine that they were using 32GB of RAM?

Why should it be 26 GB? Do you think it's because you set `NovaReservedHostMemory: 26624`? If so that's not exactly how it works.

NovaReservedHostMemory tells the Nova scheduler to not schedule VMs on an HCI node which would require the last 26624 MB. I.e. If the nova scheduler sees a host with X GB of RAM and wants to determine if it can run a VM there, then it must count the available resources on that host, not has having X GB of RAM but as having X-26 GB of RAM.

So just because the Nova scheduler will reserve that memory, doesn't mean that the Nova scheduler then puts an upper bound on the Ceph OSD. We want to tell the Nova scheduler to not use memory the OSD would use (so NovaReservedHostMemory is the right thing to do) but to limit the OSD memory we set osd_memory_target. More on that below.
 
> Expected results:
> Ceph-OSDs and Nova Overhead being capped at 26GB RAM.
> 
> Additional info:
> 
> Had to do a workaround of setting the ceph osd memory target directly to
> resolve this issue:
> 
> [stack@undercloud ~]$ cat templates/ceph-resource.yaml
> parameter_defaults:
>   CephConfigOverrides:
>     osd:
>       osd_memory_target: 2147483648

This is along the right lines.

Let's look at how osd_memory_target is set by ceph-ansible when is_hci is true.

  https://github.com/ceph/ceph-ansible/pull/3113/files

As you can probably see in the PR above, the total host memory is multiplied by a different safety_factor (depending on is_hci). That value is then divided by the number of OSDs and the osd_memory_target is then set.

If you don't agree with that calculation, you don't have to use it, e.g. you can directly override the osd_memory_target in the ceph.conf as you have done.

So I'm curious what value for osd_memory_target ceph-ansible computed in your environment. Would you please provided it? Is it inline with the above calculation?

Comment 2 John Fulton 2023-06-05 12:37:31 UTC

Hi Daniel,

It's been 6 days and I haven't heard from you. I'll keep this bug report open for another week.

  John

Comment 3 John Fulton 2023-06-12 11:58:19 UTC

Since it's been another week and I haven't heard back, I'm going to close this bug. If you want to re-open it please provide the requested info.