Bug 2025811

Summary:	Upgrading to Satellite 6.9.6 and above introduces an increase in system memory consumption causing Pulp activities to fail with OOM at certain times
Product:	Red Hat Satellite	Reporter:	Sayan Das <saydas>
Component:	Installation	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	isinghal
Severity:	high	Docs Contact:
Priority:	high
Version:	6.9.6	CC:	ahumbe, cdonnell, dalley, dhjoshi, ehelms, hakon.gislason, isinghal, jhutar, lvrtelov, mmccune, peter.vreman, pmendezh, sadas, tasander
Target Milestone:	6.11.0	Keywords:	Performance, PrioBumpGSS, Triaged
Target Release:	Unused
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-05 14:30:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sayan Das 2021-11-23 05:54:17 UTC

Description of problem:

After the auto-tuning of Puma Threads and Workers were introduced in Satellite 6.9.6 and above via https://bugzilla.redhat.com/show_bug.cgi?id=1998309 (Parent BZ# 1998291) , Several Satellite users have complained about servers hitting OOM during Daily Repo sync or CV publish+promote , Just because pulp or postgres or mongo can not get enough memory to process the tasks and around half of the system memory is consumed by Puma workers.

Version-Release number of selected component (if applicable):

Satellite 6.9.6 and above

Satellite 6.10

How reproducible:

Always [For Satellite customers]

Steps to Reproduce:
1. Install a Satellite 6.8 , having 8 vCPU and 32 GB ram
2. Enable and Sync 30 big + small repos. Create few CVs using the same.
3. Create a daily sync plan for all those repos, to run around midnight.

A) Note down overall Memory usage during each of those activities.

4. Upgrade the same satellite to Satellite 6.9.6 or 6.9.7 , and observe how the satellite is behaving during the same activities. Specifically monitor the memory usage by foreman.service.

5. Upgrade the same satellite to Satellite 6.10 and repeat the same monitoring. Check how much memory is being occupied by foreman.service and postgresql service.

Actual results:

Puma worker count increased to 1.5 times of the vCPU count and that consumes lot of system resources causing other Pulp related services to not able to get enough memory to complete their tasks at certain times and results in OOM.

Expected results:

The puma workers value needs to be tuned in a way that It should not overconsume the system memory and should not be the cause of OOM. Both CPU and Memory should be part of same calculation or else, let's just do this:

If vCPU count is upto 12, then set the count of workers as the value of $nproc.

If the vCPU count is 16 or above, always default the value to 12 or 14.

Obviously, It does not has to be the same values if we have a better formula with us.

Additional info:

Since 6.9.6, Installer automatically sets the count of Puma threads to 5 and Puma workers to the 1.5 * $nproc

For workers, It basically calculates two values based on CPU and Memory and which ever is the lowest, It choses that as the value to use.

min(
$facts['processors']['count'] * 1.5,
($facts['memory']['system']['total_bytes']/(1024 * 1024 * 1024)) - 1.5
)

If I am not wrong, 100% of the cases, the second formula with memory will not matter as that will always be higher and the installer will by default set the worker value as the (1.5 * $nproc).

For a 16 vCPU satellite : 24 workers

For an 8 vCPU satellite : 12 workers.

Each worker can consume from 750 MB to 1.5 GB memory on average (never seen going beyond 2 GB) and if you calculate in that way then,

For a 16 vCPU satellite : 24 workers can consume around 18 GB to 35 GB system memory.

For an 8 vCPU satellite : 12 workers can consume around 9 to 18 GB system memory.

So if just these components will consume that much system memory, then Just try to think what will happen with other services or processes here like Postgres, mongodb, qdrouterd, puppetserver, tomcat, etc which are known to require some good amount of memory to function properly as well.

It's not that every single customer will allocate 50+ GB memory for a 16 vCPU satellite. In many scenarios, You will see 8 vCPU or 16 vCPU systems having 32 GB or 20 GB or 40 GB or 74 GB memory allocated.

Again for a Customer having 64 GB memory with 16 vCPUs, It puma workers are consuming 35 GB memory just of themselves, that is 54% of total system memory consumed already.

Comment 2 Peter Vreman 2021-11-23 11:37:35 UTC

I build a small script myself to write every 10 seconds the status-text from the foreman systemd unit that is providing the actual usage of pumma.
while true; do echo "$(date +%Y%m%d-%H%M%S) $(systemctl show foreman --prop=StatusText)"; sleep 10; done

This reveals that in my situation with 200-250 clients attached only max 30 threads are used. On an other Sat6 server having more content changes shows a neglitable increase due to API calls, because most API calls must be sequential due to functional dependencies of the steps to update content. And parallism in the API calls on the content part is therefor rare. And even in that case lets assuma 10 API calls concurrent are providing some headroom. There are also almost no long running API calls, they are all async and dynflowized.

The highest parallelism I was able to trigger was from a daily cron-job without splay that was calling 'insighst-client --compliance' on ~50 servers, this showed a peak use of ~20 threads in puma at the start of the insights-client where it is querying the cloud.redhat.com for the host details on assigned compliance policies. After that the runtimes of the complaince clients is already starting to vary in duration and for puma there are almost no concurrent requests done anymore.

The default from Sat6 with the 16 cpu was 24 workers x 5 threads = 120 threads in total.
Extrapolating my real-world use on the foreman requests the 120 puma threads are maybe relevant for a 10000+ clients setup, but not for a setup <500 clients.

What is missing here is a reporting functionality how much puma threads are actually used. Based on this input a more realistic value for the number of puma workers can be set.
It is a known that puma needs 1.0-1.5GB per worker. If you chose as default 1.5x that means 1.5x1.5GB a mnemory consumption of 2.25GB/CPU. This ratio does not match the current resource requirements documented. Especially also thinking of all the other resource databases,candlepin,pulp,puppet needing to run ont he same server

Comment 14 errata-xmlrpc 2022-07-05 14:30:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Comment 22 Red Hat Bugzilla 2023-09-15 01:50:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days