2064583 – High memory usage of foreman-proxy during a scaled bulk REX job

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2064583 - High memory usage of foreman-proxy during a scaled bulk REX job

Summary: High memory usage of foreman-proxy during a scaled bulk REX job

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Remote Execution
Sub Component:
Version:	6.10.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	6.11.0
Assignee:	Adam Ruzicka
QA Contact:	Peter Ondrejka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-16 08:17 UTC by Pavel Moravec
Modified:	2022-07-05 14:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:	tfm-rubygem-smart_proxy_dynflow-0.6.3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-05 14:34:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	34624	Normal	Closed	Set MALLOC_ARENA_MAX to counter memory bloat in production	2022-03-17 14:18:12 UTC
Foreman Issue Tracker	34625	Normal	Closed	Make execution plan cleaner more aggressive	2022-03-24 15:54:56 UTC
Red Hat Knowledge Base (Solution)	6816591	None	None	None	2022-03-16 08:17:52 UTC
Red Hat Product Errata	RHSA-2022:5498	None	None	None	2022-07-05 14:34:38 UTC

Description Pavel Moravec 2022-03-16 08:17:53 UTC

Description of problem:
Executing a scaled REX job, memory usage of the foreman-proxy being in charge is growing over the time. The growth can be few GBs of RSS and very depends on number of hosts applied and *namely* on the task executed (the longer task with bigger output, the higher mem.usage).

The reason is foreman-proxy keeps outcome of all the tasks in a db kept in memory. That data are purged every 24 hours, which is configurable via:

:execution_plan_cleaner_age: SECONDS

in /etc/foreman-proxy/settings.d/dynflow.yml .

Anyway, just decreasing this value has almost no effect - ruby does not free the allocated memory.

To force ruby to deallocate, the service must run with MALLOC_ARENA_MAX=2, i.e. have in /usr/lib/systemd/system/foreman-proxy.service :

[Service]
..
Environment=MALLOC_ARENA_MAX=2

This parameter puts ruby a hard limit on memory arenas to allocate / use, which prevents memory fragmentation and inability to dealloc it afterwards (please correct me if I misunderstood the parameter).

(kudos to aruzicka++ who found both tunables)

Some testing I did:
- having just cleaner_age set (to e.g. 60 or even 10 seconds, extremely low) has very minimal impact
- having just MALLOC_ARENA_MAX prevents 2/3 of the excessive memory increase, but it does not sufficiently scale (running more tests => higher memory usage)
- combining both options makes memory usage flat over time

Therefore, I am requesting adding those two tunables to a default installation (which means the component should be installer, after blessing from aruzicka?)

Version-Release number of selected component (if applicable):
Sat 6.10

How reproducible:
100%

Steps to Reproduce:
1. Have Sat (ideally with no external Caps running) and many hosts
2. Invoke various REX jobs repeatedly (I used "run command 'date'" and "run command 'sleep 60'" and "Apply Ansible roles")
3. Monitor foreman-proxy RSS usage over time

Actual results:
3. Memory usage grows over time (at least for one day), linearly to the number of REX jobs executed and to the size of jobs output (Ansible ones are big (but they are much slower to execute sompared to the dummy 'date' one, also)

Expected results:
After initial memory increase (natural to any process under a load), memory usage should be stable.

Additional info:

Comment 1 Pavel Moravec 2022-03-16 08:33:23 UTC

I have no idea what recommended / default value of :execution_plan_cleaner_age: should be. Maybe 3600 seconds? Or just 600, e.g.? The lower the value is, the less memory usage, but bit higher probability dynflow/sidekiq wont fetch the data from foreman-proxy memory before purging them (in some extreme situation when dynflow workers are on due?)?

I used very low values (10 seconds and 60 seconds, where 10s was bit better but really no big deal) as I wanted to complete the comparison tests in a reasonable time. But I guess these values are too low for defaults.

Comment 3 Pavel Moravec 2022-03-16 08:44:01 UTC

Having it in tuning guide: good point, it makes sense for (REX-)scaled environments - until this BZ is fixed and the tunable is automatically applied.

Comment 4 Adam Ruzicka 2022-03-16 08:57:51 UTC

> which means the component should be installer

Partially. I think the unit file for foreman-proxy service is not generated but comes straight from its package. Let's just leave it like this and I'll take care of it.

If everything works as expected then the outputs and results are sent from the capsule to sat, without needing to keep them around for too long. However, if this upload fails, then Satellite checks on the capsule every 15 minutes. I'd say we shouldn't go below 2*15 minutes to stay on the safe side, especially if there's no real benefit in going too low.

On a side note, this would be an ideal use case for ephemeral execution plans (plans which destroy themselves if they finish successfully).

Comment 5 Bryan Kearney 2022-03-16 12:05:43 UTC

Upstream bug assigned to aruzicka

Comment 8 Bryan Kearney 2022-03-22 00:05:18 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34624 has been resolved.

Comment 9 Pavel Moravec 2022-03-22 08:54:09 UTC

I can confirm that - apart one specific test case - memory usage remains stable when repeatedly running many REX jobs of the three types, when having MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 .

The one specific test case is the most simple one: run "date" on 300-ish systems concurrently, in a loop. Memory usage exhibit few sudden increases over the time, followed by flat memory usage.

I am running much longer test to see if these steps in memory increase were some hiccups we can ignore, or something to concern. Until I comment here more within the next week, we can ignore it.

Comment 10 Pavel Moravec 2022-03-23 10:42:56 UTC

(In reply to Pavel Moravec from comment #9)
> I can confirm that - apart one specific test case - memory usage remains
> stable when repeatedly running many REX jobs of the three types, when having
> MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 .
> 
> The one specific test case is the most simple one: run "date" on 300-ish
> systems concurrently, in a loop. Memory usage exhibit few sudden increases
> over the time, followed by flat memory usage.
> 
> I am running much longer test to see if these steps in memory increase were
> some hiccups we can ignore, or something to concern. Until I comment here
> more within the next week, we can ignore it.

The test got stabilised after one day on pretty low values, so we can ignore the hiccup from previous test.

Comment 12 Peter Ondrejka 2022-04-19 12:47:43 UTC

Checked on Satellite 6.11 snap 16, I confirm that both the MALLOC_ARENA_MAX setting has been added and the execution plan cleaner interval has been increased.

Comment 18 errata-xmlrpc 2022-07-05 14:34:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Note You need to log in before you can comment on or make changes to this bug.