Bug 1994397

Summary:	Increased memory usage of pulp-3 workers during repo sync
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Lai <ltran>
Severity:	high	Docs Contact:	sabuchan
Priority:	high
Version:	6.10.0	CC:	ahumbe, amarirom, ankulkar, bbuckingham, dalley, dkliban, faguiard, fgarciad, ggainey, jangerrit.kootstra, jbhatia, jentrena, jjansky, jkrajice, jmcdonald, juwatts, jyejare, ktordeur, martin.schlossarek, mkalyat, myllynen, peter.vreman, pmendezh, rchan, redhat-bugzilla, sadas, saydas, shwsingh, smajumda, swachira, ttereshc, wdr, youssef.ghorbal
Target Milestone:	6.11.0	Keywords:	Performance, Triaged, UserExperience
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	pulp_rpm-3.17, pulpcore-3.16.6, createrepo_c-0.17.6	Doc Type:	Known Issue
Doc Text:	Cause: During repository sync, pulp-3 workers exhibit evident higher memory usage when compared to pulp-2 workers. Consequence: Satellite hits OOM or heavy swapping when syncing several large repos concurrently Workaround (if any): Result:	Story Points:	---
Clone Of:
Clones:	2082211 (view as bug list)		Environment:
Last Closed:	2022-07-05 14:29:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2000769

Description Pavel Moravec 2021-08-17 08:40:56 UTC

Description of problem:
during a repository sync, pulp-3 workers exhibit evident higher memory usage when compared to pulp-2 workers. This lead to situation when a Satellite with 20GB RAM (minimum requirement per docs) hits OOM or heavy swapping when syncing several bigger repos concurrently.

I will provide particular figures from pulp-2/pulp-3 memory usage comparison within few days.

We should either improve the memory consumption of pulp-3 workers, or increase the memory requirements in Install Guide.


Version-Release number of selected component (if applicable):
Sat 6.10 snap 13


How reproducible:
100%

Steps to Reproduce:
1. Sync several bigger repos and check memory usage of pulp worker processes


Actual results:
1. memory usage is significantly bigger than in pulp-2 case (particular figures to follow)


Expected results:
Similar memory usage OR Install Guide increases memory requirements for 6.10 deployment.


Additional info:

Comment 1 Sayan Das 2021-08-17 12:50:43 UTC

+1 ... 

Even though I particularly work on a low memory system but It used to work fine with pulp2 but with pulp3 there is a big increase of system memory consumption.


Without having > 25 GB ram, I cannot sync several big repos at once I guess. 


I never got into OOM condition thanks to https://bugzilla.redhat.com/show_bug.cgi?id=1948258 but I have ran into https://bugzilla.redhat.com/show_bug.cgi?id=1993773 while syncing EUS 7.7 and RHEL ELS 6Server at the same time.

Comment 2 Pavel Moravec 2021-08-17 20:23:38 UTC

Comparison of (time and) memory required to sync 12 repos on Sat6.9 and Sat6.10 snap 13.

Both Satellites were 32GB / 4cores, tuned the same (--profile=medium, puma min/max threads 16, puma workers 16).

All repos set to on demand download policy.

In either case, a single sync was running at a time, memory taken as max of RSS from "ps aux | grep <pulp-worker>".


sync_repo		6.9(time)		6.9(memory,start=78196)	6.10(time)		6.10(memory,start=103992)	6.10mem/6.9mem
1-EPEL8_baseos		4m			216120			5m			480108				222,15 %
2-EPEL8_appstream	16m			193424			4m			428748				221,66 %
3-EPEL7			13m			184616			5m			368996				199,87 %
4-RHSCL			25m			192620			10m			692188				359,35 %
5-Caps6.9		47s			151336			43s			105396				69,64 %
6-RHEL8_baseos		37m			270780			13m			2774208				1024,52 %
7-RHEL8_appstream	2h			261092			21m			1931632				739,83 %
8-RHEL7			2h			365652			36m			2748472				751,66 %
9-RHEL7_optional	40m			263984			19m			1021184				386,84 %
10-RHEL7_extras		3m			233148			2m			199020				85,36 %
11-RHEL6_eus		55m			248192			24m			2022944				815,07 %
12-Ansible		58s			179732			42s			105388				58,64 %


Results summary:
- pulp-3 is evidently faster, esp. on bigger repos; hurray!
- pulp-3 just restarted worker consumes 103MB compared to 78MB of pulp-2
- pulp-3 worker consumes several times more memory than pulp-2; basically the bigger repo synced, the worse pulp-3 is


Some colleagues complained about several times higher RSS usage per worker..?


Meanwhile, I noticed various CV publishes (with filters) suffer by the huge memory consumption even much worse ("repos for Caps6.9 with content frozen in 1.1.2021" CV took >20G for a single worker and 64G Satellite was swapping..) - I will do a comparison as well, though a different pulp task types are in charge (but maybe the root cause is the same?).

Let me know if the CV memory consumption problem should be logged as a separate BZ.

Comment 3 Pavel Moravec 2021-08-19 14:30:05 UTC

(In reply to Pavel Moravec from comment #2)
> Meanwhile, I noticed various CV publishes (with filters) suffer by the huge
> memory consumption even much worse ("repos for Caps6.9 with content frozen
> in 1.1.2021" CV took >20G for a single worker and 64G Satellite was
> swapping..) - I will do a comparison as well, though a different pulp task
> types are in charge (but maybe the root cause is the same?).

So CV _with_ filters suffer rather by https://bugzilla.redhat.com/show_bug.cgi?id=1995232 than by memory. CV _without_ filters behaves very well in 6.10:

CVcontent_depsolve      6.9(time)	6.9(memory      6.10(time)	6.10(memory)
RHEL8_yes               10m             189800          3m              103968
RHEL8_no                10m             190576          3m              103968
RHEL7only_yes           17m             295136          5m              103964
RHEL7only_no            17m             296896          6m              103964
RHEL7all_yes            17m             294616          7m              103980
RHEL7all_no             17m             296880          7m              103976
EPEL8_yes               2m              110088          31s             103968
EPEL8_no                2m              110652          1m              103964
EPEL7_yes               2m              143540          33s             103964
EPEL7_no                1m              143288          33s             103960
Caps_yes                5m              294108          6m              103980
Caps_no                 4m              295128          6m              103972
All_yes                 12m             293272          16m             104032
All_no                  12m             292740          16m             104004


So 6.10 behaves usually better in time and also in space, great!

CV promote of the same CVs show very similar results like publish.

TL;DR: CV publish doesnt seem to suffer by excessive memory usage (for now).

Comment 4 Daniel Alley 2021-08-19 16:45:13 UTC

@Pavel CV Publish memory consumption is already covered by https://bugzilla.redhat.com/show_bug.cgi?id=1965936, let us try to keep this issue focused on the memory usage of syncs. It would be good to post your data in the relevant issue though.

Some increase in memory usage is expected. Pulp 2 did its own custom parsing implementation that streamed over the metadata, Pulp 3 currently partially relies on createrepo_c which does not currently do so and so the entirety of primary.xml in mmeory (which is around 400mb for RHEL7 for example). I say "partially" because the fix for the earlier OOM issues was to use the aforementioned custom streaming approach for other.xml and filelists.xml which are the largest but simplest files to parse, while using createrepo_c for the primary.xml, which is the smallest but most complex to parse.

createrepo_c has been working on adding a full streaming parser for us upstream but it is not yet ready. That would allow us to drop the memory consumption much closer to what you would expect from Pulp 2, however, it will not be ready for the beta and may need to go into a Z-stream. There are a few other things we can look into that may help in the meantime but ^^ sets a ceiling on the improvement we can achieve in the short term, just to set expectations.

Comment 5 Daniel Alley 2021-08-19 17:08:35 UTC

I didn't mean to remove the tags, I think those were added while I was writing my post ^^

Comment 6 Daniel Alley 2021-08-31 20:16:14 UTC

(In reply to Daniel Alley from comment #4)
> createrepo_c has been working on adding a full streaming parser for us
> upstream but it is not yet ready. That would allow us to drop the memory
> consumption much closer to what you would expect from Pulp 2, however, it
> will not be ready for the beta and may need to go into a Z-stream. There are
> a few other things we can look into that may help in the meantime but ^^
> sets a ceiling on the improvement we can achieve in the short term, just to
> set expectations.


We're still waiting on this however the PR to get it into Pulp is ready, so once it's released it won't require too much additional effort to turn it on.  I did some testing without Pulp just to see what the parsing cost is, and it seems that this new API only requires 19mb peak to parse the RHEL 6 repo, compared to 392mb for the current parsing method, and 1.9gb for the original one.

https://pulp.plan.io/issues/9309#note-2

The memory consumption when Pulp is involved also will definitely be higher than that, but it should still be an improvement.  Also it only required 13 seconds to finish parsing the metadata whereas the current method takes about 1m10s.

Comment 7 pulp-infra@redhat.com 2021-08-31 21:07:19 UTC

The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 8 pulp-infra@redhat.com 2021-08-31 21:07:21 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 11 Robin Chan 2021-09-13 13:59:41 UTC

@dalley is investigating trying to fix by GA. We can list as a known issue if it isn't fixed by the GA - good suggestion.

Comment 12 Daniel Alley 2021-09-14 13:48:38 UTC

According to the author, they are hoping to merge the upstream createrepo_c patch within the next 2 weeks or so.  If our risk tolerance allows such an upgrade, it will very likely reduce the overhead of syncs. See the two graphs here [0] (but note that these are not graphs of Pulp memory usage, just createrepo_c)

[0] https://pulp.plan.io/issues/9309#note-2

Comment 16 pulp-infra@redhat.com 2021-10-11 20:09:17 UTC

The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 17 pulp-infra@redhat.com 2021-10-11 20:09:19 UTC

The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 23 pulp-infra@redhat.com 2021-12-22 16:13:22 UTC

The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 24 pulp-infra@redhat.com 2021-12-22 16:13:24 UTC

The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 25 pulp-infra@redhat.com 2021-12-22 17:19:09 UTC

All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 35 Pavel Moravec 2022-02-12 21:00:33 UTC

(In reply to Daniel Alley from comment #34)
> Which makes 7.0 at least a dozen bugfixes behind even 6.10.3, FWIW...  Can I
> request that the next snap pull in pulp_rpm 3.17.3?

Thanks for info, didnt know that.

The backports definitely must land in GA candidate :)

Having it sooner can speedup e.g. scale testing (but also ramp-up playing with Sat7 for almost anybody as my 32GB Satellites were both swapping after importing manifest and syncing just a few repos..). So having it in soon next snap would be welcomed, in GA critical.

Comment 41 Grant Gainey 2022-02-22 17:07:21 UTC

*** Bug 2035873 has been marked as a duplicate of this bug. ***

Comment 43 Youssef Ghorbal 2022-03-16 11:20:45 UTC

Hello,

 In the meantime, we've managed to get the workers memory consumption under control by leveraging gunicorn worker recycling.
 The procedure goes like this
 Create a gunicorn.conf.py under /var/lib/pulp containing max_requests and max_requests_jitter values

[root@rosetta ~]# cat /var/lib/pulp/gunicorn.conf.py
max_requests = 1000
max_requests_jitter = 50

 restart the specific service systemctl restart pulpcore-api (or all services : satellite-maintain service stop && satellite-maintain service start)
 
 gunicorn workers will then be automatically recycled every 1000 requests (with a jitter to avoir all workers restaring at the same time) One ca reduce the max_requests value to have more aggressive recycling.

Youssef Ghorbal
Institut Pasteur

Comment 44 Daniel Alley 2022-03-16 16:05:29 UTC

With an additional two changes (still pending release) we've managed to get the memory consumption below 1gb for a RHEL 7 sync, which is the high-water-mark for any sync we're currently aware of.  So at that point it should represent a significant improvement over Pulp 2 as opposed to a regression.

Compared to 6.10.2 the sync memory consumption should be improved by 70-90% and runtime improved by ~50%. 

Compared to 6.10.3 it would be more like 50% memory and 20% runtime (since 6.10.3 already had some improvements)

Attaching the relevant issues - we will try to get all of this backported in time for 6.10.4

Comment 45 Robin Chan 2022-03-16 17:13:13 UTC

Really fantastic numbers @

Comment 47 Martin Schlossarek 2022-04-21 06:12:27 UTC

Since 6.10.4 has already been released: is there a chance that this fix will be backported in a future 6.10.x release?

I see this bug as a showstopper for the migration of our production environment from 6.9 to 6.10, as we frequently sync many large repos. Our test environment, which is already on 6.10, runs daily in OOMs just by syncing some repos.

Comment 48 Peter Vreman 2022-04-21 10:52:18 UTC

At least in my environment with Sat6.10 it was not only Pulp causing high memory. 
I had to tweak also the number of Puma workers (each taking also 1.5GB) to get the memory usage down on my 16cpu 64GB VM.
I have tweaked Puma workers from the default 1.5xcpus -> 0.75cpu, this reduces Puma memory from 1.5x16x1.5=36GB -> 0.75x16x1.5=18GB. Saving a whopping 18GB.

Comment 49 Jaroslav Krajicek 2022-04-21 11:53:00 UTC

Hello,

Jaroslav here, support engineer that has originally identified the issue and worked with Pavel on bringing it forward here.

There are multiple tuning parameters that change performance, memory consumption, CPU and I/O usage on Satellite.
The issue filed as this bug is partially already remedied in Satellite 6.10.4.

The latest development we (as support) hope will land in future 6.10.z release, as soon as possible.
However already have minimal issues on Satellite 6.10.4.

For the most common and reference configuration, recommended is Puma workers 4-6, Puma threads 16, Pulp workers 4-8.
Given current auto-tuning, in certain situations the parameter selection, especially for Puma workers/threads is not ideal.

For a relatively standard memory-centric tuning, matching a lot of usecases, you may start with:
(Standard Satellite, 4-8 CPU cores, 25GB RAM + 10GB swap, ~ 2000 Hosts)
~~~
# satellite-maintain service restart && sleep 120 && free -ht

# satellite-installer --foreman-foreman-service-puma-workers=4 --foreman-foreman-service-puma-threads-min=16 --foreman-foreman-service-puma-threads-max=16 --foreman-proxy-content-pulpcore-worker-count=4
# satellite-maintain service restart

# sleep 120 ## This is to give better figures on the `free` output
# free -ht
~~~

This may slow the bulk synchronization and publishing speed. If your I/O is on HDDs or other slower storage, it may even get faster though.

In either situation, it seems you have either insufficient memory, incorrect configuration or other issues.
Please file a case with Red Hat Support (attaching a Sosreport of the Satellite, and a statement of current & planned number of registered hosts) in case of any doubt, so we can provide better and more granular advice.

For performance tuning of Satellite 6.10, you may also visit
https://access.redhat.com/solutions/6523471

Best Regards,
Jaroslav Krajicek
Technical Support Engineer

Comment 50 Martin Schlossarek 2022-04-21 12:09:49 UTC

Our test environment (6.10.4) is perfectly aligned to the medium profile mentioned in the official satellite tuning guide (https://access.redhat.com/sites/default/files/attachments/performance_tuning_for_red_hat_satellite_6.10.pdf)

8 cores, 32gb ram, relative fast san storage
8 puma workers, 16 puma threads
8 pulp workers (default value for 8 cores)

As I can understand from your comment, 8 pulp workers are too many for this configuration (at least until this issues is completely fixed). Maybe this should be considered in the official tuning guide?

Comment 54 Daniel Alley 2022-04-21 13:45:43 UTC

@martin.schlossarek @peter.vreman There's a bunch of issues being slightly mixed together in this thread, let me try to separate them.

Sync memory consumption
=======================

One mitigation was released in 6.10.3, cuts memory usage for syncing on RH repos by 30-60%

One mitigation will be released in 6.10.5 (missed 6.10.4 unfortunately), cuts memory usage for syncing repos by a further ~30%

One mitigation is slated for 6.11 (and might be backported to 6.10.z at a later date), cuts memory usage for syncing repos by a futher ~40%



For 6.10.5, a sync of the RHEL7 repo should take at maximum 1.6gb, and any other (smaller) repo should max out at less than that.  For 6.11 no individual repo should take more than 1gb to sync.  


For gunicorn and Puma memory consumption, please track those in a separate BZ, not this one.  This BZ is about memory consumption of Pulp workers during sync.

Comment 55 Martin Schlossarek 2022-04-21 13:57:21 UTC

> This BZ is about memory consumption of Pulp workers during sync.

That is exactly the problem we are facing with 6.10.4 and default tuning parameters (medium) in our test environment. We don't even have all the repos activated that we use in the productive environment (EUS, E4S, Third Party, ...).

We halved the (default) pulp worker count to 4 and we were now able to successfully sync all repos for the first time without OOMs.

Comment 56 Daniel Alley 2022-04-21 14:07:33 UTC

Agreed, I also just wanted to make it clear what improvements are coming and when, especially since I already mentioned 6.10.4 a while back and the patches didn't make it in time for that release.

Comment 57 Pavel Moravec 2022-04-22 08:01:43 UTC

(In reply to Martin Schlossarek from comment #55)
> > This BZ is about memory consumption of Pulp workers during sync.
> 
> That is exactly the problem we are facing with 6.10.4 and default tuning
> parameters (medium) in our test environment. We don't even have all the
> repos activated that we use in the productive environment (EUS, E4S, Third
> Party, ...).
> 
> We halved the (default) pulp worker count to 4 and we were now able to
> successfully sync all repos for the first time without OOMs.

Hello,
indeed, the default tuning relies mainly on the number of CPUs but it rather impacts consumed RAM. That is some dis-balance common to Puma and pulpcore workers count (in their default tuning). As I heard this story several times, I just raised an internal discussion on the topic.

Meanwhile, I would recommend decreasing pulp workers to 4. It might delay some actions like publishing bigger CVs, but it should prevent the OOM killer events.

Further, I would like to point to Customer portal discussion https://access.redhat.com/discussions . That is a better place in case we would like to discuss some more general topic than the very specific "pulp-3 worker is/was consuming too much memory" bugzilla subject.

Comment 59 Griffin Sullivan 2022-05-03 15:34:19 UTC

Tested on Satellite 6.11 snap 18 on RHEL 7 and RHEL 8

When syncing larger repos (RHEL 7 server) pulp uses around 1G of RAM.

Steps to Reproduce:
1. Start large repo sync
2. ps aux | grep pulpcore-worker

Expected Results:
No OOM errors and pulp stays around 1G of memory.

Actual Results:
No OOM errors and pulp stays around 1G of memory.


Notes:
Results were identical on RHEL 7 and RHEL 8. The memory slowly climbs during the sync from 1G to 1.4G and then peaks for half a minute at the end of the sync and hits ~2G. As long as users don't have multiple syncs peaking at the same time, you should no longer be hitting OOM.

Comment 66 errata-xmlrpc 2022-07-05 14:29:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Comment 67 Red Hat Bugzilla 2023-09-18 04:25:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days