Bug 1388631 - Enable Process Recycling for Pulp Worker Processes
Summary: Enable Process Recycling for Pulp Worker Processes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: Unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Unspecified
Assignee: Martin Bacovsky
QA Contact: Lukas Pramuk
URL:
Whiteboard:
Depends On:
Blocks: 1393409 1405513
TreeView+ depends on / blocked
 
Reported: 2016-10-25 19:25 UTC by Brian Bouterse
Modified: 2021-12-10 14:46 UTC (History)
28 users (show)

Fixed In Version: pulp-2.8.7.4-1
Doc Type: Enhancement
Doc Text:
Feature: Satellite allows configuration of maximum tasks after pulp workers are recycled and release allocated memory to the system. This is set to '2' by default. To disable this completely, users can set '--katello-max-tasks-per-pulp-worker' to 'undef'. Reason: Python doesn't release allocated memory to the system even when it was freed and pulp workers keep reserving big amount of memory after processing certain memory hungry tasks. Result: With satellite-installer --katello-max-tasks-per-pulp-worker 2 each Pulp worker is restarted after every second task processed and the allocated memory is returned to the system.
Clone Of:
: 1393409 1405513 (view as bug list)
Environment:
Last Closed: 2017-01-26 10:43:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Sample memory usage of the workers at customers Satellite 6.2 (71.37 KB, image/png)
2016-10-31 13:41 UTC, Martin Bacovsky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 17298 0 High Closed Enable Process Recycling for Pulp Worker Processes 2020-09-16 20:33:33 UTC
Pulp Redmine 2172 0 Normal CLOSED - CURRENTRELEASE Memory Improvements with Process Recycling 2016-12-19 16:01:50 UTC
Red Hat Product Errata RHBA-2017:0197 0 normal SHIPPED_LIVE Satellite 6.2.7 Async Bug Release 2017-01-26 15:38:38 UTC

Description Brian Bouterse 2016-10-25 19:25:00 UTC
Upstream Pulp added a new feature which should reduce the memory used by Pulp workers. It does this using process recycling. See the upstream bug for more details. To include this in downstream you should:

1. cherry pick the 3 commits attached to the upstream bug
2. Enable the feature (see the upstream docs on how to do this)

I recommend a value of of < 10 for process recycling. A value of 2 would probably be good.

Also Katello should likely enable this also when they rebase onto upstream 2.11+.

Comment 1 pulp-infra@redhat.com 2016-10-25 19:31:22 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 2 pulp-infra@redhat.com 2016-10-25 19:31:24 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 3 pulp-infra@redhat.com 2016-10-25 20:30:59 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 4 pulp-infra@redhat.com 2016-10-26 05:01:45 UTC
The Pulp upstream bug status is at ON_QA. Updating the external tracker on this bug.

Comment 5 Martin Bacovsky 2016-10-31 13:41:03 UTC
Created attachment 1215814 [details]
Sample memory usage of the workers at customers Satellite 6.2

Comment 11 Martin Bacovsky 2016-11-09 15:33:50 UTC
How to test:
The maximum tasks per worker should be unlimited by default to keep default behavour from previous Sat versions.
[root@dell-pe1950-06 ~]# satellite-installer
Installing             Done                                               [100%] [...........................................................................................................................................................................................]
  Success!
  * Satellite is running at https://hostname
  * To install additional capsule on separate machine continue by running:

      capsule-certs-generate --capsule-fqdn "$CAPSULE" --certs-tar "~/$CAPSULE-certs.tar"

  The full log is at /var/log/foreman-installer/satellite.log
[root@dell-pe1950-06 ~]# vim /etc/default/pulp_workers

# Configuration file for Pulp's Celery workers

# Define the number of worker nodes you wish to have here. This defaults to the number of processors
# that are detected on the system if left commented here.
PULP_CONCURRENCY=4

# Configure Python's encoding for writing all logs, stdout and stderr
PYTHONIOENCODING="UTF-8"

# To avoid memory leaks, Pulp can terminate and replace a worker after processing X tasks. If
# left commented, process recycling is disabled. PULP_MAX_TASKS_PER_CHILD must be > 0.

# PULP_MAX_TASKS_PER_CHILD=2 %>

Check the config was extended and the entry is remarked. Also check that the workers run without --maxtasksperchild

[root@dell-pe1950-06 ~]# ps -fax|grep worker
 5839 ?        Ssl    0:02 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
 5990 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
 5876 ?        Ssl    0:03 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
 5996 ?        Sl     0:01  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
 5879 ?        Ssl    0:03 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
 5994 ?        Sl     0:01  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30
 5882 ?        Ssl    0:03 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
 5995 ?        Sl     0:01  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30
 5888 ?        Ssl    0:03 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30
 5992 ?        Sl     0:01  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30


When the maximum is set the config should be modified accordingly and the workers should be started with proper --maxtasksperchild parameter:

  [root@dell-pe1950-06 ~]# satellite-installer --katello-max-tasks-per-pulp-worker 2
Installing             Done                                               [100%] [...........................................................................................................................................................................................]
  Success!
  * Satellite is running at https://hostname
  * To install additional capsule on separate machine continue by running:

      capsule-certs-generate --capsule-fqdn "$CAPSULE" --certs-tar "~/$CAPSULE-certs.tar"

  The full log is at /var/log/foreman-installer/satellite.log
[root@dell-pe1950-06 ~]# ps -fax|grep worker
 9610 ?        Ssl    0:02 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
 9783 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
 9671 ?        Ssl    0:01 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
 9785 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
 9674 ?        Ssl    0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
 9787 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
 9677 ?        Ssl    0:01 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30 --maxtasksperchild=2
 9789 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid --heartbeat-interval=30 --maxtasksperchild=2
 9680 ?        Ssl    0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30 --maxtasksperchild=2
 9791 ?        S      0:00  \_ /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid --heartbeat-interval=30 --maxtasksperchild=2

[root@dell-pe1950-06 ~]# cat /etc/default/pulp_workers 
# Configuration file for Pulp's Celery workers

# Define the number of worker nodes you wish to have here. This defaults to the number of processors
# that are detected on the system if left commented here.
PULP_CONCURRENCY=4

# Configure Python's encoding for writing all logs, stdout and stderr
PYTHONIOENCODING="UTF-8"

# To avoid memory leaks, Pulp can terminate and replace a worker after processing X tasks. If
# left commented, process recycling is disabled. PULP_MAX_TASKS_PER_CHILD must be > 0.

PULP_MAX_TASKS_PER_CHILD=2

Comment 12 Martin Bacovsky 2016-11-09 18:26:56 UTC
Created redmine issue http://projects.theforeman.org/issues/17298 from this bug

Comment 15 pulp-infra@redhat.com 2016-12-09 17:31:31 UTC
The Pulp upstream bug status is at VERIFIED. Updating the external tracker on this bug.

Comment 16 Pavel Moravec 2016-12-19 11:46:08 UTC
FYI, quite probable reproducer for pulp celery worker memory leak: sync and publish a bigger repo, repeatedly.

In Sat world, create content view, add RHEL7 base repo (feel free to use a bigger one), publish the CV, and delete it. Do it in a cycle. Particular commands:

hmr="hammer -u admin -p redhat"

while true; do
	echo "$(date): creating&publishing&deleting a content view with RHEL7 repo"
	$hmr content-view create --name cv_rhel7_test --organization-id=1
	$hmr content-view add-repository --organization-id=1 --repository="Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server" --name=cv_rhel7_test --product="Red Hat Enterprise Linux Server"
	$hmr content-view publish --name=cv_rhel7_test --organization-id=1
	$hmr content-view remove-from-environment --name=cv_rhel7_test --organization-id=1 --lifecycle-environment=Library
	$hmr content-view delete --name=cv_rhel7_test --organization-id=1
	echo "$(date): sleeping"
	sleep 10
done

Comment 17 Pavel Moravec 2016-12-19 15:30:19 UTC
(In reply to Pavel Moravec from comment #16)
> FYI, quite probable reproducer for pulp celery worker memory leak: sync and
> publish a bigger repo, repeatedly.
> 
> In Sat world, create content view, add RHEL7 base repo (feel free to use a
> bigger one), publish the CV, and delete it. Do it in a cycle. Particular
> commands:
> 
> hmr="hammer -u admin -p redhat"
> 
> while true; do
> 	echo "$(date): creating&publishing&deleting a content view with RHEL7 repo"
> 	$hmr content-view create --name cv_rhel7_test --organization-id=1
> 	$hmr content-view add-repository --organization-id=1 --repository="Red Hat
> Enterprise Linux 7 Server RPMs x86_64 7Server" --name=cv_rhel7_test
> --product="Red Hat Enterprise Linux Server"
> 	$hmr content-view publish --name=cv_rhel7_test --organization-id=1
> 	$hmr content-view remove-from-environment --name=cv_rhel7_test
> --organization-id=1 --lifecycle-environment=Library
> 	$hmr content-view delete --name=cv_rhel7_test --organization-id=1
> 	echo "$(date): sleeping"
> 	sleep 10
> done

I apologize, that increases memory to some extent but stabilizes after a while - ne leak.

Comment 18 pulp-infra@redhat.com 2016-12-19 16:01:51 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 19 Chris Duryee 2016-12-21 23:31:03 UTC
small docs update to account for change to set to '2' by default.

Comment 20 Lukas Pramuk 2017-01-25 12:05:07 UTC
VERIFIED.

@Sat6.2.7-Snap2
pulp-server-2.8.7.4-1.el7sat.noarch

# grep ^PULP_MAX_TASKS_PER_CHILD /etc/default/pulp_workers 
PULP_MAX_TASKS_PER_CHILD=2

# ps -efH|grep -c maxtasksperchild=[2]
16

>>> all workers-{0..7} are using the new option and repetitive syncs of big rhel repositories work fine, celery workers consumes ~ 1G of memory


# satellite-installer --katello-max-tasks-per-pulp-worker 3
Installing             Done                                               [100%] [..........................................................................................................]
  Success!
  * Satellite is running at https://<SATFQDN>
  * To install additional capsule on separate machine continue by running:

      capsule-certs-generate --capsule-fqdn "$CAPSULE" --certs-tar "~/$CAPSULE-certs.tar"

  The full log is at /var/log/foreman-installer/satellite.log

# grep ^PULP_MAX_TASKS_PER_CHILD /etc/default/pulp_workers 
PULP_MAX_TASKS_PER_CHILD=3

# ps -efH|grep -c maxtasksperchild=[3]
16

>>> Using installer option one is able to customize the PULP_MAX_TASKS_PER_CHILD value 



# satellite-installer --katello-max-tasks-per-pulp-worker undef
Installing             Done                                               [100%] [..........................................................................................................]
  Success!
  * Satellite is running at https://<SATFQDN>
  * To install additional capsule on separate machine continue by running:

      capsule-certs-generate --capsule-fqdn "$CAPSULE" --certs-tar "~/$CAPSULE-certs.tar"

  The full log is at /var/log/foreman-installer/satellite.log

# grep PULP_MAX_TASKS_PER_CHILD /etc/default/pulp_workers 
# left commented, process recycling is disabled. PULP_MAX_TASKS_PER_CHILD must be > 0.
# PULP_MAX_TASKS_PER_CHILD=2

# ps -efH|grep -c [m]axtasksperchild
0

>>> Using installer option one is able to change the behavior to how it was before the fix

Comment 22 errata-xmlrpc 2017-01-26 10:43:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0197


Note You need to log in before you can comment on or make changes to this bug.