Bug 2122872

Summary:	pulpcore-api gunicorn process consumes high memory
Product:	Red Hat Satellite	Reporter:	Hao Chang Yu <hyu>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Shweta Singh <shwsingh>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.11.2	CC:	ahumbe, casl, dalley, david.deaderick, dkliban, ehelms, fperalta, ggainey, gubben, hakon.gislason, iballou, jalviso, Ken.Fowler, momran, parathi, pcreech, pdwyer, peter.vreman, pmoravec, rchan, risantam, rlavi, sadas, satellite6-bugs, saydas, tasander, wpinheir, zhunting
Target Milestone:	6.14.0	Keywords:	Performance, PrioBumpGSS
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pulpcore-3.18.22-1, pulpcore-3.21.12-1, pulpcore-3.22.9-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2211957 (view as bug list)		Environment:
Last Closed:	2023-11-08 14:18:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hao Chang Yu 2022-08-31 05:22:51 UTC

Description of problem:
The pulpcore-api gunicorn process can easily consume up to 3GB+ memory when syncing a large repository (while doing Katello index content). 


Steps to Reproduce:
1. Change the gunicorn setting to 1 worker so that we can produce the issue easily.
~~~
Edit "/etc/systemd/system/pulpcore-api.service"

ExecStart=/usr/libexec/pulpcore/gunicorn pulpcore.app.wsgi:application \
          --timeout 90 \
          -w 1 \  <============== Set this to 1
          --access-logfile - \
~~~


2. Restart pulpcore
~~~
systemctl daemon-reload
systemctl restart pulpcore-api
~~~


3. Run the index content against large repository in foreman-rake console
~~~
foreman-rake console
Katello::Repository.find_by_relative_path("Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os").index_content
Katello::Repository.find_by_relative_path("Default_Organization/Library/content/dist/rhel8/8/x86_64/appstream/os").index_content
exit
~~~


Actual results:
~~~
# After first index content
pulp     XXXXX 21.7  7.7 3021216 2460500 ?     S    15:08   2:19 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1

# After second index content
pulp     XXXXX 10.9 11.0 4081824 3521108 ?     S    15:08   3:46 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1


Expected results:
Less memory consumption

Comment 1 Sayan Das 2022-08-31 06:14:15 UTC

CC'ing dalley as he was interested in the OOM stuff I reported in https://bugzilla.redhat.com/show_bug.cgi?id=2025811#c16 which includes a good memory usage by gunicorn workers as well.

Comment 3 Daniel Alley 2022-08-31 11:53:37 UTC

There is some discussion on https://community.theforeman.org/t/katello-4-5-foreman-3-3-memory-leak-in-gunicorn/29658 where we were able to confirm that process recycling does help keep the memory usage down, although we're still trying to find out why the processes use so much in the first place.

Comment 8 Daniel Alley 2022-09-12 04:32:39 UTC

>I think I had already tried amended the api call to exclude the "changelogs"
field completely in Katello but didn't see any significant change on memory
usage. I will try again just to make sure I didn't make any mistake.

>I think we should be able to exclude such large fields from the call. AFAIK,
Katello doesn't use the changelongs field.

It might be excluded from the API result but it would still be queried from the database, unfortunately.  That is very difficult to get around.

I did a bit of work on this a while back but hit a roadblock in implementation.  See the description of [0] and PR alongside it. 

[0] https://github.com/pulp/pulpcore/issues/2138

Comment 9 Hao Chang Yu 2022-09-12 05:22:21 UTC

(In reply to Daniel Alley from comment #8)
> >I think I had already tried amended the api call to exclude the "changelogs"
> field completely in Katello but didn't see any significant change on memory
> usage. I will try again just to make sure I didn't make any mistake.
> 
> >I think we should be able to exclude such large fields from the call. AFAIK,
> Katello doesn't use the changelongs field.
> 
> It might be excluded from the API result but it would still be queried from
> the database, unfortunately.  That is very difficult to get around.
> 

Yeah, excluded all large fields didn't make any huge difference in memory usage.

# Exclude fields
out = rpm_api.list({ "offset" => 0, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" })
out = rpm_api.list({ "offset" => 1000, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" })
out = rpm_api.list({ "offset" => 2000, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" })

~~~
pulp        9508 18.2  7.2 2879080 2318280 ?     S    15:09   0:37 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
~~~

VS


# No fields exclude
irb(main):032:0> out = rpm_api.list({ "offset" => 0, repository_version: repo.version_href, limit: 1000})
irb(main):033:0> out = rpm_api.list({ "offset" => 1000, repository_version: repo.version_href, limit: 1000})
irb(main):034:0> out = rpm_api.list({ "offset" => 2000, repository_version: repo.version_href, limit: 1000})

~~~
pulp       10067 28.3  9.2 3503272 2942032 ?     S    15:12   1:13 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
~~~

Comment 10 Hao Chang Yu 2022-09-12 06:02:00 UTC

Confirmed Daniel's theory about the changelogs causing huge memory usage.

For testing purpose, I did a dirty hack on pulp_rpm "package.py" file by commenting the "changelogs" field declaration and then perform the Katello index content. The memory usage went down to around 1GB. It was staying in around 450MB most of the time.

~~~
pulp       12893 15.8  3.4 1678564 1117504 ?     S    15:41   2:53 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s
~~~

Comment 14 Daniel Alley 2022-09-19 14:21:50 UTC

Sayan, you can try the instructions from https://bugzilla.redhat.com/show_bug.cgi?id=2122344

Comment 15 Hao Chang Yu 2022-09-22 01:21:12 UTC

For customer that can't upgrade to 6.11 can try the following workaround. Based on my test, the gunicorn memory usage flows between 850MB to 1.2GB after amending the value.

1) Reduce the bullk_load_size to 250.

hammer settings set --name "bulk_load_size" --value 250


2) Then restart Satellite to free all memory:

foreman-maintain service restart

Comment 17 Daniel Alley 2022-09-23 16:00:36 UTC

We now have 2 independent confirmations (one from Sayan, one from an upstream user) that these instructions significantly mitigate / resolve the issue: https://bugzilla.redhat.com/show_bug.cgi?id=2122344

Comment 18 Sayan Das 2022-09-23 16:10:57 UTC

Yup.. The changelog-trimming indeed makes a difference when it comes to keeping the memory consumption in check.. 


So a big yes, to have the command documented somewhere in product docs and suggested to execute [ if the environment was running since 6.10.2 or before ]

Comment 19 Daniel Alley 2022-09-23 16:15:16 UTC

https://github.com/pulp/pulp_rpm/pull/2784 Should also be able to resolve it slowly over time in the event that they neglect to run the command.

Comment 20 Daniel Alley 2022-09-27 00:50:47 UTC

Marking this as POST on the basis that the 6.12 upgrade steps ought to resolve the issue - and even if they are skipped, newer versions of the RPM plugin will slowly address the issue as repos are re-synced.

Comment 23 Sayan Das 2022-09-29 09:45:24 UTC

(In reply to Daniel Alley from comment #22)
> It's probably worth filing an issue for Puma, that has come up over and over
> again as being a bit of a memory hog.


I think if Puma and Postgres resources are properly tuned as per the use case of customer, We can easily get past this challenge. 

By tune, it's not just adjusting a few parameter values but also ensuring that the postgres DB itself does not have big tables for facts\tasks\sessions\audits\logs etc. 


There is one situation that is tricky to fix i.e. when a huge amount of network facts are being processed by satellite from one or more hosts, The related apache request takes way too much time to get processed, and individual Puma workers can easily eat up to 8 or 12 GB of memory each. 

That is a really tricky situation to troubleshoot but the investigation in https://bugzilla.redhat.com/show_bug.cgi?id=2020961 and https://access.redhat.com/solutions/6466501 covers that part i guess. 


Long story short, As long as nothing unusual is causing pulpcore-workers or gunicorn processes to overconsume system resources, Puma and Postgres should be manageable and tuned as needed.

Comment 27 Pavel Moravec 2023-01-10 14:17:42 UTC

(In reply to Daniel Alley from comment #26)
> First, are you certain that the issue is gunicorn, specifically?  Memory is
> a shared resource, and even if gunicorn is using too much of it, other
> things could be contributing.  It is worth getting a snapshot of the
> system's memory consumption and which processes are responsible even if it
> seems like the OOM killer is going after one in particular.

In that case, gunicorn processes used >5GB altogether during OOM killer event, which happened just after many repos sync. And sync shall call index content, which matches the trigger of this bug. But yes, just previous statement "gunicorn processes consumed a lot of memory" does not necessarily imply this bug is hit.

Comment 29 Charles Slivkoff 2023-02-22 17:31:26 UTC

Could this be incorporated somehow with future 6.11.z updates?

Comment 32 Daniel Alley 2023-03-30 00:00:18 UTC

I don't think the tasks are relevant, except maybe indirectly.  I think the issue is purely in the API instance / webserver.  Pulp does a sync task -> Katello performs an index by querying tons of metadata about the RPMs -> that memory leaks, somehow.  The task memory usage ought to be OK but the completion of the tasks triggers other operations which load up the webserver, where the problem seems to be.

Comment 33 Daniel Alley 2023-03-30 03:09:22 UTC

Leaving this note for myself - try using this method for profiling: https://github.com/plasma-umass/scalene/issues/35#issuecomment-1200057714

I've tried several approaches so far but most memory profilers I've tried seem to cause gunicorn to not work at all.

Comment 34 Ian Ballou 2023-03-30 04:37:26 UTC

Daniel, have you noticed if anything that Katello is doing is unnecessary?  Just want to make sure that you're not seeing duplicate calls or anything like that.

Comment 35 Daniel Alley 2023-04-12 01:36:55 UTC

We may - may - have a fix for this now.  It's quite difficult to tell where the memory allocation comes from, but some improvements that are being tested currently could speculatively help with this and a few other issues.

Comment 36 Daniel Alley 2023-05-24 04:42:56 UTC

Hey Ian, does Katello use any of the "files", "changelogs", "requires", "provides" fields - does it index them or display them at all?

Comment 37 Daniel Alley 2023-05-24 05:28:51 UTC

Sayan, did you ever get around to checking on comment #31?  I have a reproducer and I have verified that the workaround (setting --max-requests to some value like 10 (ten)) works for me.

I still don't know what the leak is caused by (it might be deep inside Django) but I know what is leaking - it's all these changelogs strings, files strings and so forth that Satellite probably does not even use from the requests.

Comment 42 Hao Chang Yu 2023-06-20 02:30:13 UTC

(In reply to Daniel Alley from comment #39)
> >So, For some reason, They had to run the changelog trimming for the second time which helped a bit. https://github.com/pulp/pulpcore/issues/2250#issuecomment-1069455371 was also implemented for them and hence I am not sure if it was this config or the second trimming that helped here.
> 
> Both definitely help. I have filed
> https://github.com/theforeman/puppet-pulpcore/issues/285 for adjusting the
> systemd service files in satellite so that perhaps this can get shipped. 
> Unfortunately every attempt I've made at figuring out what precisely is
> leaking has failed as tools that would normally help with this seem to break

Any idea why it needs to run changelog trimming multiple times? I think I also notice that a couple of cases that customers have already upgraded their Satellite (the trim change-log script should have run by the installer). Some time later, they still have the memory issue and suggested them to run the script again which helped.

Comment 43 Daniel Alley 2023-06-20 02:41:06 UTC

I don't think it would help a second time, unless it's a false impact (from restarting or something) or the first time it didn't actually run.

Comment 45 Daniel Alley 2023-06-22 16:56:03 UTC

Making some comments public

Comment 49 Hao Chang Yu 2023-07-06 12:18:04 UTC

I can confirm this is causing the issue. Caching without maxsize specified.

https://github.com/pulp/pulpcore/pull/2826/files


Still haven't check why the cache for getting the access policy can be that large. Feel like the content list results are cached too.

Comment 50 Hao Chang Yu 2023-07-07 04:13:02 UTC

(In reply to Hao Chang Yu from comment #49)
> I can confirm this is causing the issue. Caching without maxsize specified.
> 
> https://github.com/pulp/pulpcore/pull/2826/files
> 
> 
> Still haven't check why the cache for getting the access policy can be that
> large. Feel like the content list results are cached too.

Issue persist even setting the maxsize=128. This cache shouldn't be large because my request doesn't really return any access policy (None). I guess the issue is could be the caching is preventing the GC the whole request.

Comment 51 Peter Vreman 2023-07-10 07:36:10 UTC

Based on feedback in my case on this topic i applied a oneliner fix to disable the LRU cache and with this the memory leak is gone for me. Even after the weekly refreshing of multiple CVs with RedHat release content from CDN and updating the CCVs with this update CV the memory usage is really stable again like i am used to on 6.11 and 6.12

~~~
--- /usr/lib/python3.9/site-packages/pulpcore/app/access_policy.py
+++ /usr/lib/python3.9/site-packages/pulpcore/app/access_policy.py
@@ -12,7 +12,7 @@
     """

     @staticmethod
-    @lru_cache
+#    @lru_cache
     def get_access_policy(view):
         """
         Retrieves the AccessPolicy from the DB or None if it doesn't exist.
~~~


Second confirmation i can make is that the preload also is a nice improvement that also works nicely on 6.11 reducing in my case the memory footprint by 1.5-2GB sustained, which is in my case on a 32GB VM a nice improvement, especially in todays Cloud keeping the memory under control has direct positive impact on costs, because if i would need 36GB sustained if means in the end a 64GB VM is needed and that is double costs

Comment 55 Daniel Alley 2023-07-19 19:29:06 UTC

I'm moving this back to POST because there is a new and far superior patch, thanks to Peter and Hao.

Comment 62 wclark 2023-08-07 17:05:43 UTC

I removed https://github.com/pulp/pulpcore/pull/4090 from this BZ and created https://bugzilla.redhat.com/show_bug.cgi?id=2229788 to track it, because this BZ was originally created against Satellite 6.11.2 (the caching mechanism for the Pulpcore API Access Policy wasn't introduced until Satellite 6.13.0) and this BZ already got cloned (2211957) with a fix released for for 6.13.2 (which contained some installer changes but did not contain the newer Pulp PR).

Please let me know though if this is not correct for any reason. This BZ is already tracking delivery of the satellite-installer (via puppet-pulpcore) changes for the Satellite 6.14 release, so maybe we could re-attach the newer Pulpcore patch to it, but that feels like overloading this BZ and I don't want to create confusion for QE, support, or end users (and we'd still need a new BZ for the 6.13.z backport of the additional patch, separate from the already released installer changes on that version).

Comment 63 Daniel Alley 2023-08-07 22:34:37 UTC

I think that might just make it more confusing, tbh.

You're correct that the original issue was reported in 6.11 whereas a separate regression is now being fixed is 6.13. But the original issue seems to have been fixed in 6.12 [0] so far as I can tell, and 6.12 hasn't had any complaints about memory usage either upstream or downstream, whereas 6.10, 6.11, and 6.13 (before this regression fix) have, and several of the users (both upstream and downstream) that previously had issues with memory reported that they were able to use 6.12 without issue.

Given the current stage in the lifecycle of 6.11 and 6.12, I feel like just calling it "fixed in 6.14 and 6.13.z" is fine.  We've already backported as much of [0] as we can to 6.11, so there isn't much else we can do apart from suggest an upgrade, which users will need to do anyway as EOL is approaching.

So, IMO, keeping this BZ as a "meta-issue" is fine.  

[0] Probably because of "changelog trimming" and a couple of other memory usage improvements

Comment 65 Pavel Moravec 2023-08-17 16:04:31 UTC

Just for a reference, the same root cause (searching for packages with big changelog triggers high memory usage) can also affect incremental CV update (or I assume also publish of CV with filters). Here pulp is requested to *generate* metadata and starts to consume gigabytes of memory when having backtrace:

  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/pulp_rpm/app/tasks/publishing.py", line 550, in generate_repo_metadata
    for package in packages.order_by("name", "evr").iterator():
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/query.py", line 353, in _iterator
    yield from self._iterable_class(self, chunked_fetch=use_chunked_fetch, chunk_size=chunk_size)
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/query.py", line 68, in __iter__
    for row in compiler.results_iter(results):
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1118, in apply_converters
    for row in map(list, rows):
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1646, in cursor_iter
    for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1646, in <lambda>
    for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel):
  File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/utils.py", line 97, in inner
    return func(*args, **kwargs)


Also here, rpm-trim-changelogs is a workaround.

Comment 68 errata-xmlrpc 2023-11-08 14:18:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.14 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6818