Bug 1994397
Summary: | Increased memory usage of pulp-3 workers during repo sync | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | |
Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | |
Status: | CLOSED ERRATA | QA Contact: | Lai <ltran> | |
Severity: | high | Docs Contact: | sabuchan | |
Priority: | high | |||
Version: | 6.10.0 | CC: | ahumbe, amarirom, ankulkar, bbuckingham, dalley, dkliban, faguiard, fgarciad, ggainey, jangerrit.kootstra, jbhatia, jentrena, jjansky, jkrajice, jmcdonald, juwatts, jyejare, ktordeur, martin.schlossarek, mkalyat, myllynen, peter.vreman, pmendezh, rchan, redhat-bugzilla, sadas, saydas, shwsingh, smajumda, swachira, ttereshc, wdr, youssef.ghorbal | |
Target Milestone: | 6.11.0 | Keywords: | Performance, Triaged, UserExperience | |
Target Release: | Unused | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | pulp_rpm-3.17, pulpcore-3.16.6, createrepo_c-0.17.6 | Doc Type: | Known Issue | |
Doc Text: |
Cause: During repository sync, pulp-3 workers exhibit evident higher memory usage when compared to pulp-2 workers.
Consequence: Satellite hits OOM or heavy swapping when syncing several large repos concurrently
Workaround (if any):
Result:
|
Story Points: | --- | |
Clone Of: | ||||
: | 2082211 (view as bug list) | Environment: | ||
Last Closed: | 2022-07-05 14:29:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2000769 |
Description
Pavel Moravec
2021-08-17 08:40:56 UTC
+1 ... Even though I particularly work on a low memory system but It used to work fine with pulp2 but with pulp3 there is a big increase of system memory consumption. Without having > 25 GB ram, I cannot sync several big repos at once I guess. I never got into OOM condition thanks to https://bugzilla.redhat.com/show_bug.cgi?id=1948258 but I have ran into https://bugzilla.redhat.com/show_bug.cgi?id=1993773 while syncing EUS 7.7 and RHEL ELS 6Server at the same time. Comparison of (time and) memory required to sync 12 repos on Sat6.9 and Sat6.10 snap 13. Both Satellites were 32GB / 4cores, tuned the same (--profile=medium, puma min/max threads 16, puma workers 16). All repos set to on demand download policy. In either case, a single sync was running at a time, memory taken as max of RSS from "ps aux | grep <pulp-worker>". sync_repo 6.9(time) 6.9(memory,start=78196) 6.10(time) 6.10(memory,start=103992) 6.10mem/6.9mem 1-EPEL8_baseos 4m 216120 5m 480108 222,15 % 2-EPEL8_appstream 16m 193424 4m 428748 221,66 % 3-EPEL7 13m 184616 5m 368996 199,87 % 4-RHSCL 25m 192620 10m 692188 359,35 % 5-Caps6.9 47s 151336 43s 105396 69,64 % 6-RHEL8_baseos 37m 270780 13m 2774208 1024,52 % 7-RHEL8_appstream 2h 261092 21m 1931632 739,83 % 8-RHEL7 2h 365652 36m 2748472 751,66 % 9-RHEL7_optional 40m 263984 19m 1021184 386,84 % 10-RHEL7_extras 3m 233148 2m 199020 85,36 % 11-RHEL6_eus 55m 248192 24m 2022944 815,07 % 12-Ansible 58s 179732 42s 105388 58,64 % Results summary: - pulp-3 is evidently faster, esp. on bigger repos; hurray! - pulp-3 just restarted worker consumes 103MB compared to 78MB of pulp-2 - pulp-3 worker consumes several times more memory than pulp-2; basically the bigger repo synced, the worse pulp-3 is Some colleagues complained about several times higher RSS usage per worker..? Meanwhile, I noticed various CV publishes (with filters) suffer by the huge memory consumption even much worse ("repos for Caps6.9 with content frozen in 1.1.2021" CV took >20G for a single worker and 64G Satellite was swapping..) - I will do a comparison as well, though a different pulp task types are in charge (but maybe the root cause is the same?). Let me know if the CV memory consumption problem should be logged as a separate BZ. (In reply to Pavel Moravec from comment #2) > Meanwhile, I noticed various CV publishes (with filters) suffer by the huge > memory consumption even much worse ("repos for Caps6.9 with content frozen > in 1.1.2021" CV took >20G for a single worker and 64G Satellite was > swapping..) - I will do a comparison as well, though a different pulp task > types are in charge (but maybe the root cause is the same?). So CV _with_ filters suffer rather by https://bugzilla.redhat.com/show_bug.cgi?id=1995232 than by memory. CV _without_ filters behaves very well in 6.10: CVcontent_depsolve 6.9(time) 6.9(memory 6.10(time) 6.10(memory) RHEL8_yes 10m 189800 3m 103968 RHEL8_no 10m 190576 3m 103968 RHEL7only_yes 17m 295136 5m 103964 RHEL7only_no 17m 296896 6m 103964 RHEL7all_yes 17m 294616 7m 103980 RHEL7all_no 17m 296880 7m 103976 EPEL8_yes 2m 110088 31s 103968 EPEL8_no 2m 110652 1m 103964 EPEL7_yes 2m 143540 33s 103964 EPEL7_no 1m 143288 33s 103960 Caps_yes 5m 294108 6m 103980 Caps_no 4m 295128 6m 103972 All_yes 12m 293272 16m 104032 All_no 12m 292740 16m 104004 So 6.10 behaves usually better in time and also in space, great! CV promote of the same CVs show very similar results like publish. TL;DR: CV publish doesnt seem to suffer by excessive memory usage (for now). @Pavel CV Publish memory consumption is already covered by https://bugzilla.redhat.com/show_bug.cgi?id=1965936, let us try to keep this issue focused on the memory usage of syncs. It would be good to post your data in the relevant issue though. Some increase in memory usage is expected. Pulp 2 did its own custom parsing implementation that streamed over the metadata, Pulp 3 currently partially relies on createrepo_c which does not currently do so and so the entirety of primary.xml in mmeory (which is around 400mb for RHEL7 for example). I say "partially" because the fix for the earlier OOM issues was to use the aforementioned custom streaming approach for other.xml and filelists.xml which are the largest but simplest files to parse, while using createrepo_c for the primary.xml, which is the smallest but most complex to parse. createrepo_c has been working on adding a full streaming parser for us upstream but it is not yet ready. That would allow us to drop the memory consumption much closer to what you would expect from Pulp 2, however, it will not be ready for the beta and may need to go into a Z-stream. There are a few other things we can look into that may help in the meantime but ^^ sets a ceiling on the improvement we can achieve in the short term, just to set expectations. I didn't mean to remove the tags, I think those were added while I was writing my post ^^ (In reply to Daniel Alley from comment #4) > createrepo_c has been working on adding a full streaming parser for us > upstream but it is not yet ready. That would allow us to drop the memory > consumption much closer to what you would expect from Pulp 2, however, it > will not be ready for the beta and may need to go into a Z-stream. There are > a few other things we can look into that may help in the meantime but ^^ > sets a ceiling on the improvement we can achieve in the short term, just to > set expectations. We're still waiting on this however the PR to get it into Pulp is ready, so once it's released it won't require too much additional effort to turn it on. I did some testing without Pulp just to see what the parsing cost is, and it seems that this new API only requires 19mb peak to parse the RHEL 6 repo, compared to 392mb for the current parsing method, and 1.9gb for the original one. https://pulp.plan.io/issues/9309#note-2 The memory consumption when Pulp is involved also will definitely be higher than that, but it should still be an improvement. Also it only required 13 seconds to finish parsing the metadata whereas the current method takes about 1m10s. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. @dalley is investigating trying to fix by GA. We can list as a known issue if it isn't fixed by the GA - good suggestion. According to the author, they are hoping to merge the upstream createrepo_c patch within the next 2 weeks or so. If our risk tolerance allows such an upgrade, it will very likely reduce the overhead of syncs. See the two graphs here [0] (but note that these are not graphs of Pulp memory usage, just createrepo_c) [0] https://pulp.plan.io/issues/9309#note-2 The Pulp upstream bug status is at POST. Updating the external tracker on this bug. The Pulp upstream bug priority is at High. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. (In reply to Daniel Alley from comment #34) > Which makes 7.0 at least a dozen bugfixes behind even 6.10.3, FWIW... Can I > request that the next snap pull in pulp_rpm 3.17.3? Thanks for info, didnt know that. The backports definitely must land in GA candidate :) Having it sooner can speedup e.g. scale testing (but also ramp-up playing with Sat7 for almost anybody as my 32GB Satellites were both swapping after importing manifest and syncing just a few repos..). So having it in soon next snap would be welcomed, in GA critical. *** Bug 2035873 has been marked as a duplicate of this bug. *** Hello, In the meantime, we've managed to get the workers memory consumption under control by leveraging gunicorn worker recycling. The procedure goes like this Create a gunicorn.conf.py under /var/lib/pulp containing max_requests and max_requests_jitter values [root@rosetta ~]# cat /var/lib/pulp/gunicorn.conf.py max_requests = 1000 max_requests_jitter = 50 restart the specific service systemctl restart pulpcore-api (or all services : satellite-maintain service stop && satellite-maintain service start) gunicorn workers will then be automatically recycled every 1000 requests (with a jitter to avoir all workers restaring at the same time) One ca reduce the max_requests value to have more aggressive recycling. Youssef Ghorbal Institut Pasteur With an additional two changes (still pending release) we've managed to get the memory consumption below 1gb for a RHEL 7 sync, which is the high-water-mark for any sync we're currently aware of. So at that point it should represent a significant improvement over Pulp 2 as opposed to a regression. Compared to 6.10.2 the sync memory consumption should be improved by 70-90% and runtime improved by ~50%. Compared to 6.10.3 it would be more like 50% memory and 20% runtime (since 6.10.3 already had some improvements) Attaching the relevant issues - we will try to get all of this backported in time for 6.10.4 Really fantastic numbers @ Since 6.10.4 has already been released: is there a chance that this fix will be backported in a future 6.10.x release? I see this bug as a showstopper for the migration of our production environment from 6.9 to 6.10, as we frequently sync many large repos. Our test environment, which is already on 6.10, runs daily in OOMs just by syncing some repos. At least in my environment with Sat6.10 it was not only Pulp causing high memory. I had to tweak also the number of Puma workers (each taking also 1.5GB) to get the memory usage down on my 16cpu 64GB VM. I have tweaked Puma workers from the default 1.5xcpus -> 0.75cpu, this reduces Puma memory from 1.5x16x1.5=36GB -> 0.75x16x1.5=18GB. Saving a whopping 18GB. Hello, Jaroslav here, support engineer that has originally identified the issue and worked with Pavel on bringing it forward here. There are multiple tuning parameters that change performance, memory consumption, CPU and I/O usage on Satellite. The issue filed as this bug is partially already remedied in Satellite 6.10.4. The latest development we (as support) hope will land in future 6.10.z release, as soon as possible. However already have minimal issues on Satellite 6.10.4. For the most common and reference configuration, recommended is Puma workers 4-6, Puma threads 16, Pulp workers 4-8. Given current auto-tuning, in certain situations the parameter selection, especially for Puma workers/threads is not ideal. For a relatively standard memory-centric tuning, matching a lot of usecases, you may start with: (Standard Satellite, 4-8 CPU cores, 25GB RAM + 10GB swap, ~ 2000 Hosts) ~~~ # satellite-maintain service restart && sleep 120 && free -ht # satellite-installer --foreman-foreman-service-puma-workers=4 --foreman-foreman-service-puma-threads-min=16 --foreman-foreman-service-puma-threads-max=16 --foreman-proxy-content-pulpcore-worker-count=4 # satellite-maintain service restart # sleep 120 ## This is to give better figures on the `free` output # free -ht ~~~ This may slow the bulk synchronization and publishing speed. If your I/O is on HDDs or other slower storage, it may even get faster though. In either situation, it seems you have either insufficient memory, incorrect configuration or other issues. Please file a case with Red Hat Support (attaching a Sosreport of the Satellite, and a statement of current & planned number of registered hosts) in case of any doubt, so we can provide better and more granular advice. For performance tuning of Satellite 6.10, you may also visit https://access.redhat.com/solutions/6523471 Best Regards, Jaroslav Krajicek Technical Support Engineer Our test environment (6.10.4) is perfectly aligned to the medium profile mentioned in the official satellite tuning guide (https://access.redhat.com/sites/default/files/attachments/performance_tuning_for_red_hat_satellite_6.10.pdf) 8 cores, 32gb ram, relative fast san storage 8 puma workers, 16 puma threads 8 pulp workers (default value for 8 cores) As I can understand from your comment, 8 pulp workers are too many for this configuration (at least until this issues is completely fixed). Maybe this should be considered in the official tuning guide? @martin.schlossarek @peter.vreman There's a bunch of issues being slightly mixed together in this thread, let me try to separate them. Sync memory consumption ======================= One mitigation was released in 6.10.3, cuts memory usage for syncing on RH repos by 30-60% One mitigation will be released in 6.10.5 (missed 6.10.4 unfortunately), cuts memory usage for syncing repos by a further ~30% One mitigation is slated for 6.11 (and might be backported to 6.10.z at a later date), cuts memory usage for syncing repos by a futher ~40% For 6.10.5, a sync of the RHEL7 repo should take at maximum 1.6gb, and any other (smaller) repo should max out at less than that. For 6.11 no individual repo should take more than 1gb to sync. For gunicorn and Puma memory consumption, please track those in a separate BZ, not this one. This BZ is about memory consumption of Pulp workers during sync. > This BZ is about memory consumption of Pulp workers during sync.
That is exactly the problem we are facing with 6.10.4 and default tuning parameters (medium) in our test environment. We don't even have all the repos activated that we use in the productive environment (EUS, E4S, Third Party, ...).
We halved the (default) pulp worker count to 4 and we were now able to successfully sync all repos for the first time without OOMs.
Agreed, I also just wanted to make it clear what improvements are coming and when, especially since I already mentioned 6.10.4 a while back and the patches didn't make it in time for that release. (In reply to Martin Schlossarek from comment #55) > > This BZ is about memory consumption of Pulp workers during sync. > > That is exactly the problem we are facing with 6.10.4 and default tuning > parameters (medium) in our test environment. We don't even have all the > repos activated that we use in the productive environment (EUS, E4S, Third > Party, ...). > > We halved the (default) pulp worker count to 4 and we were now able to > successfully sync all repos for the first time without OOMs. Hello, indeed, the default tuning relies mainly on the number of CPUs but it rather impacts consumed RAM. That is some dis-balance common to Puma and pulpcore workers count (in their default tuning). As I heard this story several times, I just raised an internal discussion on the topic. Meanwhile, I would recommend decreasing pulp workers to 4. It might delay some actions like publishing bigger CVs, but it should prevent the OOM killer events. Further, I would like to point to Customer portal discussion https://access.redhat.com/discussions . That is a better place in case we would like to discuss some more general topic than the very specific "pulp-3 worker is/was consuming too much memory" bugzilla subject. Tested on Satellite 6.11 snap 18 on RHEL 7 and RHEL 8 When syncing larger repos (RHEL 7 server) pulp uses around 1G of RAM. Steps to Reproduce: 1. Start large repo sync 2. ps aux | grep pulpcore-worker Expected Results: No OOM errors and pulp stays around 1G of memory. Actual Results: No OOM errors and pulp stays around 1G of memory. Notes: Results were identical on RHEL 7 and RHEL 8. The memory slowly climbs during the sync from 1G to 1.4G and then peaks for half a minute at the end of the sync and hits ~2G. As long as users don't have multiple syncs peaking at the same time, you should no longer be hitting OOM. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5498 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |