1638148 – [RFE][HCI] set osd max memory based on osd container memory

Bug 1638148 - [RFE][HCI] set osd max memory based on osd container memory

Summary: [RFE][HCI] set osd max memory based on osd container memory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	3.2
Assignee:	Neha Ojha
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:	1637153
Blocks:	1644347
TreeView+	depends on / blocked

Reported:	2018-10-10 22:08 UTC by Vikhyat Umrao
Modified:	2019-01-03 19:02 UTC (History)
CC List:	16 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.2.0-0.1.rc1.el7cp Ubuntu: ceph-ansible_3.2.0~rc1-2redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-03 19:02:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	https://github.com/ceph ceph-ansible pull 3263	0	None	None	None	2018-10-26 15:54:34 UTC
Red Hat Product Errata	RHBA-2019:0020	0	None	None	None	2019-01-03 19:02:17 UTC

Description Vikhyat Umrao 2018-10-10 22:08:47 UTC

Description of problem:
[RFE][HCI] set osd max memory based on osd container memory
This is an extension or a fix for this bug:

[RFE] set osd max memory based on host memory
https://bugzilla.redhat.com/show_bug.cgi?id=1595003
https://github.com/ceph/ceph-ansible/pull/3113

Version-Release number of selected component (if applicable):
RHCS 3.1

BZ1595003 handles this for non-containerized(non-HCI) environment pretty well but in the container(HCI) environment we should not consider complete OSD host/node memory because container OSD will only get 5G from option ceph_osd_docker_memory_limit.

https://bugzilla.redhat.com/show_bug.cgi?id=1591876
https://github.com/ceph/ceph-ansible/pull/2775

- I had discussed with Neha for this and We think in HCI check it should be:

- {% set _osd_memory_target = (ansible_memtotal_mb * hci_safety_factor / _num_osds) %}

+ {% set _osd_memory_target = (ceph_osd_docker_memory_limit * hci_safety_factor) %}

- This ceph_osd_docker_memory_limit defaults to 5G and we are giving 4G to BlueStore cache by default because there is no code for having BlueStore cache less than 4G.

- So either we need to change the default osd_memory_target for the container(HCI) to less than 4G or we need to bump the default for ceph_osd_docker_memory_limit to 8G or something?

Maybe we need to talk to the performance team for these defaults. We also need to think about the hci_safety_factor default value.

The reason for all this is the container OSD will not have access to memory more than ceph_osd_docker_memory_limit configuration.

Comment 1 Ben England 2018-10-12 17:57:26 UTC

You have to set the container CGRoup memory limit based on size of OSD cache rather than what you said above here?  So add a fixed amount to the OSD cache size and make that the CGroup limit, right?

Comment 5 Ben England 2018-10-19 18:50:37 UTC

During Bluestore discussion we came up with this idea to prevent the OSD (or other daemon) from getting into a situation where the CGroup memory limit for a Ceph daemon container was less than the amount of memory that it needed.  During daemon startup, if it can determine the container ID (available from "docker inspect container-name" for example), it can get the memory limit at

/sys/fs/cgroup/memory/system.slice/docker-$containerid.scope/memory.limit_in_bytes

and exit with an error message if this is not sufficient (example: Bluestore OSDs that do caching in userspace and need lots of memory).

This doesn't fix orchestration software or prevent the daemon from consuming too much memory, it just ensures that we don't get into a situation where the daemon is likely to die just because the CGroup limit was accidentally set too low, and informs the sysadmin of this right away.

Comment 6 Sébastien Han 2018-10-30 13:03:19 UTC

Present in https://github.com/ceph/ceph-ansible/releases/tag/v3.2.0rc1

Comment 11 errata-xmlrpc 2019-01-03 19:02:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0020

Note You need to log in before you can comment on or make changes to this bug.