Bug 2064583
Summary: | High memory usage of foreman-proxy during a scaled bulk REX job | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
Component: | Remote Execution | Assignee: | Adam Ruzicka <aruzicka> |
Status: | CLOSED ERRATA | QA Contact: | Peter Ondrejka <pondrejk> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.10.3 | CC: | ahumbe, aruzicka, ehelms, lstejska, pmendezh |
Target Milestone: | 6.11.0 | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | tfm-rubygem-smart_proxy_dynflow-0.6.3 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-07-05 14:34:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pavel Moravec
2022-03-16 08:17:53 UTC
I have no idea what recommended / default value of :execution_plan_cleaner_age: should be. Maybe 3600 seconds? Or just 600, e.g.? The lower the value is, the less memory usage, but bit higher probability dynflow/sidekiq wont fetch the data from foreman-proxy memory before purging them (in some extreme situation when dynflow workers are on due?)? I used very low values (10 seconds and 60 seconds, where 10s was bit better but really no big deal) as I wanted to complete the comparison tests in a reasonable time. But I guess these values are too low for defaults. Having it in tuning guide: good point, it makes sense for (REX-)scaled environments - until this BZ is fixed and the tunable is automatically applied. > which means the component should be installer
Partially. I think the unit file for foreman-proxy service is not generated but comes straight from its package. Let's just leave it like this and I'll take care of it.
If everything works as expected then the outputs and results are sent from the capsule to sat, without needing to keep them around for too long. However, if this upload fails, then Satellite checks on the capsule every 15 minutes. I'd say we shouldn't go below 2*15 minutes to stay on the safe side, especially if there's no real benefit in going too low.
On a side note, this would be an ideal use case for ephemeral execution plans (plans which destroy themselves if they finish successfully).
Upstream bug assigned to aruzicka Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34624 has been resolved. I can confirm that - apart one specific test case - memory usage remains stable when repeatedly running many REX jobs of the three types, when having MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 . The one specific test case is the most simple one: run "date" on 300-ish systems concurrently, in a loop. Memory usage exhibit few sudden increases over the time, followed by flat memory usage. I am running much longer test to see if these steps in memory increase were some hiccups we can ignore, or something to concern. Until I comment here more within the next week, we can ignore it. (In reply to Pavel Moravec from comment #9) > I can confirm that - apart one specific test case - memory usage remains > stable when repeatedly running many REX jobs of the three types, when having > MALLOC_ARENA_MAX=2 and :execution_plan_cleaner_age: 1800 . > > The one specific test case is the most simple one: run "date" on 300-ish > systems concurrently, in a loop. Memory usage exhibit few sudden increases > over the time, followed by flat memory usage. > > I am running much longer test to see if these steps in memory increase were > some hiccups we can ignore, or something to concern. Until I comment here > more within the next week, we can ignore it. The test got stabilised after one day on pretty low values, so we can ignore the hiccup from previous test. Checked on Satellite 6.11 snap 16, I confirm that both the MALLOC_ARENA_MAX setting has been added and the execution plan cleaner interval has been increased. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5498 |