Bug 2122872
Summary: | pulpcore-api gunicorn process consumes high memory | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Hao Chang Yu <hyu> | |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | |
Status: | CLOSED ERRATA | QA Contact: | Shweta Singh <shwsingh> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.11.2 | CC: | ahumbe, casl, dalley, david.deaderick, dkliban, ehelms, fperalta, ggainey, gubben, hakon.gislason, iballou, jalviso, Ken.Fowler, momran, parathi, pcreech, pdwyer, peter.vreman, pmoravec, rchan, risantam, rlavi, sadas, satellite6-bugs, saydas, tasander, wpinheir, zhunting | |
Target Milestone: | 6.14.0 | Keywords: | Performance, PrioBumpGSS | |
Target Release: | Unused | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pulpcore-3.18.22-1, pulpcore-3.21.12-1, pulpcore-3.22.9-1 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2211957 (view as bug list) | Environment: | ||
Last Closed: | 2023-11-08 14:18:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Hao Chang Yu
2022-08-31 05:22:51 UTC
CC'ing dalley as he was interested in the OOM stuff I reported in https://bugzilla.redhat.com/show_bug.cgi?id=2025811#c16 which includes a good memory usage by gunicorn workers as well. There is some discussion on https://community.theforeman.org/t/katello-4-5-foreman-3-3-memory-leak-in-gunicorn/29658 where we were able to confirm that process recycling does help keep the memory usage down, although we're still trying to find out why the processes use so much in the first place. >I think I had already tried amended the api call to exclude the "changelogs" field completely in Katello but didn't see any significant change on memory usage. I will try again just to make sure I didn't make any mistake. >I think we should be able to exclude such large fields from the call. AFAIK, Katello doesn't use the changelongs field. It might be excluded from the API result but it would still be queried from the database, unfortunately. That is very difficult to get around. I did a bit of work on this a while back but hit a roadblock in implementation. See the description of [0] and PR alongside it. [0] https://github.com/pulp/pulpcore/issues/2138 (In reply to Daniel Alley from comment #8) > >I think I had already tried amended the api call to exclude the "changelogs" > field completely in Katello but didn't see any significant change on memory > usage. I will try again just to make sure I didn't make any mistake. > > >I think we should be able to exclude such large fields from the call. AFAIK, > Katello doesn't use the changelongs field. > > It might be excluded from the API result but it would still be queried from > the database, unfortunately. That is very difficult to get around. > Yeah, excluded all large fields didn't make any huge difference in memory usage. # Exclude fields out = rpm_api.list({ "offset" => 0, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" }) out = rpm_api.list({ "offset" => 1000, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" }) out = rpm_api.list({ "offset" => 2000, repository_version: repo.version_href, limit: 1000, exclude_fields: "changelogs,requires,files,provides" }) ~~~ pulp 9508 18.2 7.2 2879080 2318280 ? S 15:09 0:37 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" ~~~ VS # No fields exclude irb(main):032:0> out = rpm_api.list({ "offset" => 0, repository_version: repo.version_href, limit: 1000}) irb(main):033:0> out = rpm_api.list({ "offset" => 1000, repository_version: repo.version_href, limit: 1000}) irb(main):034:0> out = rpm_api.list({ "offset" => 2000, repository_version: repo.version_href, limit: 1000}) ~~~ pulp 10067 28.3 9.2 3503272 2942032 ? S 15:12 1:13 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" ~~~ Confirmed Daniel's theory about the changelogs causing huge memory usage. For testing purpose, I did a dirty hack on pulp_rpm "package.py" file by commenting the "changelogs" field declaration and then perform the Katello index content. The memory usage went down to around 1GB. It was staying in around 450MB most of the time. ~~~ pulp 12893 15.8 3.4 1678564 1117504 ? S 15:41 2:53 /usr/bin/python3.8 /usr/bin/gunicorn pulpcore.app.wsgi:application --timeout 90 -w 1 --access-logfile - --access-logformat pulp [%({correlation-id}o)s]: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s ~~~ Sayan, you can try the instructions from https://bugzilla.redhat.com/show_bug.cgi?id=2122344 For customer that can't upgrade to 6.11 can try the following workaround. Based on my test, the gunicorn memory usage flows between 850MB to 1.2GB after amending the value. 1) Reduce the bullk_load_size to 250. hammer settings set --name "bulk_load_size" --value 250 2) Then restart Satellite to free all memory: foreman-maintain service restart We now have 2 independent confirmations (one from Sayan, one from an upstream user) that these instructions significantly mitigate / resolve the issue: https://bugzilla.redhat.com/show_bug.cgi?id=2122344 Yup.. The changelog-trimming indeed makes a difference when it comes to keeping the memory consumption in check.. So a big yes, to have the command documented somewhere in product docs and suggested to execute [ if the environment was running since 6.10.2 or before ] https://github.com/pulp/pulp_rpm/pull/2784 Should also be able to resolve it slowly over time in the event that they neglect to run the command. Marking this as POST on the basis that the 6.12 upgrade steps ought to resolve the issue - and even if they are skipped, newer versions of the RPM plugin will slowly address the issue as repos are re-synced. (In reply to Daniel Alley from comment #22) > It's probably worth filing an issue for Puma, that has come up over and over > again as being a bit of a memory hog. I think if Puma and Postgres resources are properly tuned as per the use case of customer, We can easily get past this challenge. By tune, it's not just adjusting a few parameter values but also ensuring that the postgres DB itself does not have big tables for facts\tasks\sessions\audits\logs etc. There is one situation that is tricky to fix i.e. when a huge amount of network facts are being processed by satellite from one or more hosts, The related apache request takes way too much time to get processed, and individual Puma workers can easily eat up to 8 or 12 GB of memory each. That is a really tricky situation to troubleshoot but the investigation in https://bugzilla.redhat.com/show_bug.cgi?id=2020961 and https://access.redhat.com/solutions/6466501 covers that part i guess. Long story short, As long as nothing unusual is causing pulpcore-workers or gunicorn processes to overconsume system resources, Puma and Postgres should be manageable and tuned as needed. (In reply to Daniel Alley from comment #26) > First, are you certain that the issue is gunicorn, specifically? Memory is > a shared resource, and even if gunicorn is using too much of it, other > things could be contributing. It is worth getting a snapshot of the > system's memory consumption and which processes are responsible even if it > seems like the OOM killer is going after one in particular. In that case, gunicorn processes used >5GB altogether during OOM killer event, which happened just after many repos sync. And sync shall call index content, which matches the trigger of this bug. But yes, just previous statement "gunicorn processes consumed a lot of memory" does not necessarily imply this bug is hit. Could this be incorporated somehow with future 6.11.z updates? I don't think the tasks are relevant, except maybe indirectly. I think the issue is purely in the API instance / webserver. Pulp does a sync task -> Katello performs an index by querying tons of metadata about the RPMs -> that memory leaks, somehow. The task memory usage ought to be OK but the completion of the tasks triggers other operations which load up the webserver, where the problem seems to be. Leaving this note for myself - try using this method for profiling: https://github.com/plasma-umass/scalene/issues/35#issuecomment-1200057714 I've tried several approaches so far but most memory profilers I've tried seem to cause gunicorn to not work at all. Daniel, have you noticed if anything that Katello is doing is unnecessary? Just want to make sure that you're not seeing duplicate calls or anything like that. We may - may - have a fix for this now. It's quite difficult to tell where the memory allocation comes from, but some improvements that are being tested currently could speculatively help with this and a few other issues. Hey Ian, does Katello use any of the "files", "changelogs", "requires", "provides" fields - does it index them or display them at all? Sayan, did you ever get around to checking on comment #31? I have a reproducer and I have verified that the workaround (setting --max-requests to some value like 10 (ten)) works for me. I still don't know what the leak is caused by (it might be deep inside Django) but I know what is leaking - it's all these changelogs strings, files strings and so forth that Satellite probably does not even use from the requests. (In reply to Daniel Alley from comment #39) > >So, For some reason, They had to run the changelog trimming for the second time which helped a bit. https://github.com/pulp/pulpcore/issues/2250#issuecomment-1069455371 was also implemented for them and hence I am not sure if it was this config or the second trimming that helped here. > > Both definitely help. I have filed > https://github.com/theforeman/puppet-pulpcore/issues/285 for adjusting the > systemd service files in satellite so that perhaps this can get shipped. > Unfortunately every attempt I've made at figuring out what precisely is > leaking has failed as tools that would normally help with this seem to break Any idea why it needs to run changelog trimming multiple times? I think I also notice that a couple of cases that customers have already upgraded their Satellite (the trim change-log script should have run by the installer). Some time later, they still have the memory issue and suggested them to run the script again which helped. I don't think it would help a second time, unless it's a false impact (from restarting or something) or the first time it didn't actually run. Making some comments public I can confirm this is causing the issue. Caching without maxsize specified. https://github.com/pulp/pulpcore/pull/2826/files Still haven't check why the cache for getting the access policy can be that large. Feel like the content list results are cached too. (In reply to Hao Chang Yu from comment #49) > I can confirm this is causing the issue. Caching without maxsize specified. > > https://github.com/pulp/pulpcore/pull/2826/files > > > Still haven't check why the cache for getting the access policy can be that > large. Feel like the content list results are cached too. Issue persist even setting the maxsize=128. This cache shouldn't be large because my request doesn't really return any access policy (None). I guess the issue is could be the caching is preventing the GC the whole request. Based on feedback in my case on this topic i applied a oneliner fix to disable the LRU cache and with this the memory leak is gone for me. Even after the weekly refreshing of multiple CVs with RedHat release content from CDN and updating the CCVs with this update CV the memory usage is really stable again like i am used to on 6.11 and 6.12 ~~~ --- /usr/lib/python3.9/site-packages/pulpcore/app/access_policy.py +++ /usr/lib/python3.9/site-packages/pulpcore/app/access_policy.py @@ -12,7 +12,7 @@ """ @staticmethod - @lru_cache +# @lru_cache def get_access_policy(view): """ Retrieves the AccessPolicy from the DB or None if it doesn't exist. ~~~ Second confirmation i can make is that the preload also is a nice improvement that also works nicely on 6.11 reducing in my case the memory footprint by 1.5-2GB sustained, which is in my case on a 32GB VM a nice improvement, especially in todays Cloud keeping the memory under control has direct positive impact on costs, because if i would need 36GB sustained if means in the end a 64GB VM is needed and that is double costs I'm moving this back to POST because there is a new and far superior patch, thanks to Peter and Hao. I removed https://github.com/pulp/pulpcore/pull/4090 from this BZ and created https://bugzilla.redhat.com/show_bug.cgi?id=2229788 to track it, because this BZ was originally created against Satellite 6.11.2 (the caching mechanism for the Pulpcore API Access Policy wasn't introduced until Satellite 6.13.0) and this BZ already got cloned (2211957) with a fix released for for 6.13.2 (which contained some installer changes but did not contain the newer Pulp PR). Please let me know though if this is not correct for any reason. This BZ is already tracking delivery of the satellite-installer (via puppet-pulpcore) changes for the Satellite 6.14 release, so maybe we could re-attach the newer Pulpcore patch to it, but that feels like overloading this BZ and I don't want to create confusion for QE, support, or end users (and we'd still need a new BZ for the 6.13.z backport of the additional patch, separate from the already released installer changes on that version). I think that might just make it more confusing, tbh. You're correct that the original issue was reported in 6.11 whereas a separate regression is now being fixed is 6.13. But the original issue seems to have been fixed in 6.12 [0] so far as I can tell, and 6.12 hasn't had any complaints about memory usage either upstream or downstream, whereas 6.10, 6.11, and 6.13 (before this regression fix) have, and several of the users (both upstream and downstream) that previously had issues with memory reported that they were able to use 6.12 without issue. Given the current stage in the lifecycle of 6.11 and 6.12, I feel like just calling it "fixed in 6.14 and 6.13.z" is fine. We've already backported as much of [0] as we can to 6.11, so there isn't much else we can do apart from suggest an upgrade, which users will need to do anyway as EOL is approaching. So, IMO, keeping this BZ as a "meta-issue" is fine. [0] Probably because of "changelog trimming" and a couple of other memory usage improvements Just for a reference, the same root cause (searching for packages with big changelog triggers high memory usage) can also affect incremental CV update (or I assume also publish of CV with filters). Here pulp is requested to *generate* metadata and starts to consume gigabytes of memory when having backtrace: File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/pulp_rpm/app/tasks/publishing.py", line 550, in generate_repo_metadata for package in packages.order_by("name", "evr").iterator(): File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/query.py", line 353, in _iterator yield from self._iterable_class(self, chunked_fetch=use_chunked_fetch, chunk_size=chunk_size) File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/query.py", line 68, in __iter__ for row in compiler.results_iter(results): File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1118, in apply_converters for row in map(list, rows): File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1646, in cursor_iter for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel): File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1646, in <lambda> for rows in iter((lambda: cursor.fetchmany(itersize)), sentinel): File "/opt/theforeman/tfm-pulpcore/root/usr/lib/python3.8/site-packages/django/db/utils.py", line 97, in inner return func(*args, **kwargs) Also here, rpm-trim-changelogs is a workaround. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.14 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6818 |