Description of problem: Executing a scaled REX job, memory usage of the foreman-proxy being in charge is growing over the time. The growth can be few GBs of RSS and very depends on number of hosts applied and *namely* on the task executed (the longer task with bigger output, the higher mem.usage). The reason is foreman-proxy keeps outcome of all the tasks in a db kept in memory. That data are purged every 24 hours, which is configurable via: :execution_plan_cleaner_age: SECONDS in /etc/foreman-proxy/settings.d/dynflow.yml . Anyway, just decreasing this value has almost no effect - ruby does not free the allocated memory. To force ruby to deallocate, the service must run with MALLOC_ARENA_MAX=2, i.e. have in /usr/lib/systemd/system/foreman-proxy.service : [Service] .. Environment=MALLOC_ARENA_MAX=2 This parameter puts ruby a hard limit on memory arenas to allocate / use, which prevents memory fragmentation and inability to dealloc it afterwards (please correct me if I misunderstood the parameter). (kudos to aruzicka++ who found both tunables) Some testing I did: - having just cleaner_age set (to e.g. 60 or even 10 seconds, extremely low) has very minimal impact - having just MALLOC_ARENA_MAX prevents 2/3 of the excessive memory increase, but it does not sufficiently scale (running more tests => higher memory usage) - combining both options makes memory usage flat over time Therefore, I am requesting adding those two tunables to a default installation (which means the component should be installer, after blessing from aruzicka?) Version-Release number of selected component (if applicable): Sat 6.10 How reproducible: 100% Steps to Reproduce: 1. Have Sat (ideally with no external Caps running) and many hosts 2. Invoke various REX jobs repeatedly (I used "run command 'date'" and "run command 'sleep 60'" and "Apply Ansible roles") 3. Monitor foreman-proxy RSS usage over time Actual results: 3. Memory usage grows over time (at least for one day), linearly to the number of REX jobs executed and to the size of jobs output (Ansible ones are big (but they are much slower to execute sompared to the dummy 'date' one, also) Expected results: After initial memory increase (natural to any process under a load), memory usage should be stable. Additional info:
I have no idea what recommended / default value of :execution_plan_cleaner_age: should be. Maybe 3600 seconds? Or just 600, e.g.? The lower the value is, the less memory usage, but bit higher probability dynflow/sidekiq wont fetch the data from foreman-proxy memory before purging them (in some extreme situation when dynflow workers are on due?)? I used very low values (10 seconds and 60 seconds, where 10s was bit better but really no big deal) as I wanted to complete the comparison tests in a reasonable time. But I guess these values are too low for defaults.
Having it in tuning guide: good point, it makes sense for (REX-)scaled environments - until this BZ is fixed and the tunable is automatically applied.
> which means the component should be installer Partially. I think the unit file for foreman-proxy service is not generated but comes straight from its package. Let's just leave it like this and I'll take care of it. If everything works as expected then the outputs and results are sent from the capsule to sat, without needing to keep them around for too long. However, if this upload fails, then Satellite checks on the capsule every 15 minutes. I'd say we shouldn't go below 2*15 minutes to stay on the safe side, especially if there's no real benefit in going too low. On a side note, this would be an ideal use case for ephemeral execution plans (plans which destroy themselves if they finish successfully).
Upstream bug assigned to aruzicka
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34624 has been resolved.
I can confirm that - apart one specific test case - memory usage remains stable when repeatedly running many REX jobs of the three types, when having MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 . The one specific test case is the most simple one: run "date" on 300-ish systems concurrently, in a loop. Memory usage exhibit few sudden increases over the time, followed by flat memory usage. I am running much longer test to see if these steps in memory increase were some hiccups we can ignore, or something to concern. Until I comment here more within the next week, we can ignore it.
(In reply to Pavel Moravec from comment #9) > I can confirm that - apart one specific test case - memory usage remains > stable when repeatedly running many REX jobs of the three types, when having > MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 . > > The one specific test case is the most simple one: run "date" on 300-ish > systems concurrently, in a loop. Memory usage exhibit few sudden increases > over the time, followed by flat memory usage. > > I am running much longer test to see if these steps in memory increase were > some hiccups we can ignore, or something to concern. Until I comment here > more within the next week, we can ignore it. The test got stabilised after one day on pretty low values, so we can ignore the hiccup from previous test.
Checked on Satellite 6.11 snap 16, I confirm that both the MALLOC_ARENA_MAX setting has been added and the execution plan cleaner interval has been increased.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5498