Bug 2064583

Summary:	High memory usage of foreman-proxy during a scaled bulk REX job
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Remote Execution	Assignee:	Adam Ruzicka <aruzicka>
Status:	CLOSED ERRATA	QA Contact:	Peter Ondrejka <pondrejk>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.10.3	CC:	ahumbe, aruzicka, ehelms, lstejska, pmendezh
Target Milestone:	6.11.0	Keywords:	Triaged
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	tfm-rubygem-smart_proxy_dynflow-0.6.3	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-05 14:34:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2022-03-16 08:17:53 UTC

Description of problem:
Executing a scaled REX job, memory usage of the foreman-proxy being in charge is growing over the time. The growth can be few GBs of RSS and very depends on number of hosts applied and *namely* on the task executed (the longer task with bigger output, the higher mem.usage).

The reason is foreman-proxy keeps outcome of all the tasks in a db kept in memory. That data are purged every 24 hours, which is configurable via:

:execution_plan_cleaner_age: SECONDS

in /etc/foreman-proxy/settings.d/dynflow.yml .

Anyway, just decreasing this value has almost no effect - ruby does not free the allocated memory.

To force ruby to deallocate, the service must run with MALLOC_ARENA_MAX=2, i.e. have in /usr/lib/systemd/system/foreman-proxy.service :

[Service]
..
Environment=MALLOC_ARENA_MAX=2

This parameter puts ruby a hard limit on memory arenas to allocate / use, which prevents memory fragmentation and inability to dealloc it afterwards (please correct me if I misunderstood the parameter).

(kudos to aruzicka++ who found both tunables)

Some testing I did:
- having just cleaner_age set (to e.g. 60 or even 10 seconds, extremely low) has very minimal impact
- having just MALLOC_ARENA_MAX prevents 2/3 of the excessive memory increase, but it does not sufficiently scale (running more tests => higher memory usage)
- combining both options makes memory usage flat over time

Therefore, I am requesting adding those two tunables to a default installation (which means the component should be installer, after blessing from aruzicka?)

Version-Release number of selected component (if applicable):
Sat 6.10

How reproducible:
100%

Steps to Reproduce:
1. Have Sat (ideally with no external Caps running) and many hosts
2. Invoke various REX jobs repeatedly (I used "run command 'date'" and "run command 'sleep 60'" and "Apply Ansible roles")
3. Monitor foreman-proxy RSS usage over time

Actual results:
3. Memory usage grows over time (at least for one day), linearly to the number of REX jobs executed and to the size of jobs output (Ansible ones are big (but they are much slower to execute sompared to the dummy 'date' one, also)

Expected results:
After initial memory increase (natural to any process under a load), memory usage should be stable.

Additional info:

Comment 1 Pavel Moravec 2022-03-16 08:33:23 UTC

I have no idea what recommended / default value of :execution_plan_cleaner_age: should be. Maybe 3600 seconds? Or just 600, e.g.? The lower the value is, the less memory usage, but bit higher probability dynflow/sidekiq wont fetch the data from foreman-proxy memory before purging them (in some extreme situation when dynflow workers are on due?)?

I used very low values (10 seconds and 60 seconds, where 10s was bit better but really no big deal) as I wanted to complete the comparison tests in a reasonable time. But I guess these values are too low for defaults.

Comment 3 Pavel Moravec 2022-03-16 08:44:01 UTC

Having it in tuning guide: good point, it makes sense for (REX-)scaled environments - until this BZ is fixed and the tunable is automatically applied.

Comment 4 Adam Ruzicka 2022-03-16 08:57:51 UTC

> which means the component should be installer

Partially. I think the unit file for foreman-proxy service is not generated but comes straight from its package. Let's just leave it like this and I'll take care of it.

If everything works as expected then the outputs and results are sent from the capsule to sat, without needing to keep them around for too long. However, if this upload fails, then Satellite checks on the capsule every 15 minutes. I'd say we shouldn't go below 2*15 minutes to stay on the safe side, especially if there's no real benefit in going too low.

On a side note, this would be an ideal use case for ephemeral execution plans (plans which destroy themselves if they finish successfully).

Comment 5 Bryan Kearney 2022-03-16 12:05:43 UTC

Upstream bug assigned to aruzicka

Comment 8 Bryan Kearney 2022-03-22 00:05:18 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34624 has been resolved.

Comment 9 Pavel Moravec 2022-03-22 08:54:09 UTC

I can confirm that - apart one specific test case - memory usage remains stable when repeatedly running many REX jobs of the three types, when having MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 .

The one specific test case is the most simple one: run "date" on 300-ish systems concurrently, in a loop. Memory usage exhibit few sudden increases over the time, followed by flat memory usage.

I am running much longer test to see if these steps in memory increase were some hiccups we can ignore, or something to concern. Until I comment here more within the next week, we can ignore it.

Comment 10 Pavel Moravec 2022-03-23 10:42:56 UTC

(In reply to Pavel Moravec from comment #9)
> I can confirm that - apart one specific test case - memory usage remains
> stable when repeatedly running many REX jobs of the three types, when having
> MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 .
> 
> The one specific test case is the most simple one: run "date" on 300-ish
> systems concurrently, in a loop. Memory usage exhibit few sudden increases
> over the time, followed by flat memory usage.
> 
> I am running much longer test to see if these steps in memory increase were
> some hiccups we can ignore, or something to concern. Until I comment here
> more within the next week, we can ignore it.

The test got stabilised after one day on pretty low values, so we can ignore the hiccup from previous test.

Comment 12 Peter Ondrejka 2022-04-19 12:47:43 UTC

Checked on Satellite 6.11 snap 16, I confirm that both the MALLOC_ARENA_MAX setting has been added and the execution plan cleaner interval has been increased.

Comment 18 errata-xmlrpc 2022-07-05 14:34:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498