Bug 1420823 - pipe leak in any parent pulp celery process triggered by qpidd restart
Summary: pipe leak in any parent pulp celery process triggered by qpidd restart
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.2.8
Hardware: All
OS: Linux
medium
medium vote
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-09 15:09 UTC by Pavel Moravec
Modified: 2020-08-13 08:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 18:00:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Pulp Redmine 2582 High CLOSED - CURRENTRELEASE Pipe leak in any parent Pulp celery process triggered by qpidd restart 2019-03-12 18:03:37 UTC

Description Pavel Moravec 2017-02-09 15:09:15 UTC
Description of problem:
Whenever I restart qpidd, I see every parent pulp celery process (i.e. for resource manager and for every worker) allocates/consumes four extra file descriptors for a pipe.

Doing so >250 times, the relevant process runs out of FDs and shuts down.

So, if one restarts qpidd broker 250times (well, this sounds as strong/unprobable requirement), pulp ends up with no worker and no manager process, unable to perform a task.


Version-Release number of selected component (if applicable):
pulp-server-2.8.7.5-1.el7sat.noarch
Sat 6.2.7


How reproducible:
100%


Steps to Reproduce:
(lazy reproducer):

for i in $(seq 1 250); do service qpidd restart; sleep 15; done

then check if some pulp worker or manager is up

(more detailed reproducer)
1. Check PIDs of parent resource_manager and parent worker-* processes:
 
# ps aux | grep celery
apache   24202  0.1  0.5 669132 63500 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache   24235  0.1  0.5 669792 63712 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   24238  0.1  0.5 669116 63612 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   24273  0.1  0.2 661948 33012 ?        Ssl  14:59   0:03 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
apache   24306  0.0  0.4 668396 55976 ?        Sl   14:59   0:00 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache   25315  0.0  0.4 669792 56860 ?        Sl   15:19   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   25811  0.0  0.4 669116 54472 ?        S    15:29   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
root     26213  0.0  0.0 112652   960 pts/0    S+   15:36   0:00 grep --color=auto celery
#


2. take lsof of these processes:
for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.1; done

3. restart qpidd
service qpidd restart

4. take lsof again (to new files):
for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.2; done

5. Compare the lsof outputs:
# wc lsof*
   169   1525  18058 lsof.24202.1
   173   1561  18378 lsof.24202.2
   169   1525  18058 lsof.24235.1
   173   1561  18378 lsof.24235.2
   169   1525  18058 lsof.24238.1
   173   1561  18378 lsof.24238.2
#

Diff shows extra:
> celery  24202 apache   33r     FIFO                0,8       0t0  15901973 pipe
> celery  24202 apache   34w     FIFO                0,8       0t0  15901973 pipe
> celery  24202 apache   35r     FIFO                0,8       0t0  15901975 pipe
> celery  24202 apache   36w     FIFO                0,8       0t0  15901975 pipe

6. Check /proc:
# file /proc/24202/fd/33 /proc/24202/fd/34 /proc/24202/fd/35 /proc/24202/fd/36
/proc/24202/fd/33: broken symbolic link to `pipe:[15901973]'
/proc/24202/fd/34: broken symbolic link to `pipe:[15901973]'
/proc/24202/fd/35: broken symbolic link to `pipe:[15901975]'
/proc/24202/fd/36: broken symbolic link to `pipe:[15901975]'
#

7. goto 3 for repeat


Actual results:
(lazy reproducer): after the qpidd restarts, pulp processes are gone, no pulp task can be executed

(more detailed reproducer): see above


Expected results:
(lazy reproducer): after qpidd restarts, pulp processes are alive, pulp tasks can be run

(more detailed reproducer): lsof shows same number of FDs before and after a restart, no "broken symbolic link" FDs


Additional info:

Comment 2 pulp-infra@redhat.com 2017-02-09 17:31:20 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 3 pulp-infra@redhat.com 2017-02-09 17:31:23 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 4 pulp-infra@redhat.com 2017-02-10 17:01:37 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 5 pulp-infra@redhat.com 2017-02-22 16:02:05 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 6 Dennis Kliban 2017-03-14 02:25:26 UTC
The Pulp upstream bug status is at CLOSED - WORKSFORME. Updating the external tracker on this bug.

Comment 7 pulp-infra@redhat.com 2017-03-21 19:18:07 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 9 Pavel Moravec 2017-06-13 10:23:26 UTC
FYI it *seems* like another reproducer is stopping qpidd for a longer time - the pulp processes will attempt to connect to qpidd but fail and *this* probably triggers the pipe leak - at least that is deduction from what I see at a customer now (easy to reproduce but no free time / Satellite ATM).

Comment 10 Brian Bouterse 2017-06-13 13:15:59 UTC
Sat6 is using the Celery 3.1.z , but upstream Celery is now at 4.0.z. The current plan (as I understand it) is to focus on verifying/fixing any celery based bugs against the 4.0.z stack. We are tracking that work more broadly with this tracker bug: https://pulp.plan.io/issues/2632

The audit as part of bug 2632 will cover this issue also. I'm not sure when Satellite will switch to 4.0.z, but that is the plan FWIW.

Comment 13 Bryan Kearney 2018-09-04 18:00:15 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.

Comment 14 pulp-infra@redhat.com 2019-03-12 18:03:38 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.


Note You need to log in before you can comment on or make changes to this bug.