Description of problem: Whenever I restart qpidd, I see every parent pulp celery process (i.e. for resource manager and for every worker) allocates/consumes four extra file descriptors for a pipe. Doing so >250 times, the relevant process runs out of FDs and shuts down. So, if one restarts qpidd broker 250times (well, this sounds as strong/unprobable requirement), pulp ends up with no worker and no manager process, unable to perform a task. Version-Release number of selected component (if applicable): pulp-server-2.8.7.5-1.el7sat.noarch Sat 6.2.7 How reproducible: 100% Steps to Reproduce: (lazy reproducer): for i in $(seq 1 250); do service qpidd restart; sleep 15; done then check if some pulp worker or manager is up (more detailed reproducer) 1. Check PIDs of parent resource_manager and parent worker-* processes: # ps aux | grep celery apache 24202 0.1 0.5 669132 63500 ? Ssl 14:59 0:02 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30 apache 24235 0.1 0.5 669792 63712 ? Ssl 14:59 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2 apache 24238 0.1 0.5 669116 63612 ? Ssl 14:59 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2 apache 24273 0.1 0.2 661948 33012 ? Ssl 14:59 0:03 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler apache 24306 0.0 0.4 668396 55976 ? Sl 14:59 0:00 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30 apache 25315 0.0 0.4 669792 56860 ? Sl 15:19 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2 apache 25811 0.0 0.4 669116 54472 ? S 15:29 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2 root 26213 0.0 0.0 112652 960 pts/0 S+ 15:36 0:00 grep --color=auto celery # 2. take lsof of these processes: for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.1; done 3. restart qpidd service qpidd restart 4. take lsof again (to new files): for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.2; done 5. Compare the lsof outputs: # wc lsof* 169 1525 18058 lsof.24202.1 173 1561 18378 lsof.24202.2 169 1525 18058 lsof.24235.1 173 1561 18378 lsof.24235.2 169 1525 18058 lsof.24238.1 173 1561 18378 lsof.24238.2 # Diff shows extra: > celery 24202 apache 33r FIFO 0,8 0t0 15901973 pipe > celery 24202 apache 34w FIFO 0,8 0t0 15901973 pipe > celery 24202 apache 35r FIFO 0,8 0t0 15901975 pipe > celery 24202 apache 36w FIFO 0,8 0t0 15901975 pipe 6. Check /proc: # file /proc/24202/fd/33 /proc/24202/fd/34 /proc/24202/fd/35 /proc/24202/fd/36 /proc/24202/fd/33: broken symbolic link to `pipe:[15901973]' /proc/24202/fd/34: broken symbolic link to `pipe:[15901973]' /proc/24202/fd/35: broken symbolic link to `pipe:[15901975]' /proc/24202/fd/36: broken symbolic link to `pipe:[15901975]' # 7. goto 3 for repeat Actual results: (lazy reproducer): after the qpidd restarts, pulp processes are gone, no pulp task can be executed (more detailed reproducer): see above Expected results: (lazy reproducer): after qpidd restarts, pulp processes are alive, pulp tasks can be run (more detailed reproducer): lsof shows same number of FDs before and after a restart, no "broken symbolic link" FDs Additional info:
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - WORKSFORME. Updating the external tracker on this bug.
FYI it *seems* like another reproducer is stopping qpidd for a longer time - the pulp processes will attempt to connect to qpidd but fail and *this* probably triggers the pipe leak - at least that is deduction from what I see at a customer now (easy to reproduce but no free time / Satellite ATM).
Sat6 is using the Celery 3.1.z , but upstream Celery is now at 4.0.z. The current plan (as I understand it) is to focus on verifying/fixing any celery based bugs against the 4.0.z stack. We are tracking that work more broadly with this tracker bug: https://pulp.plan.io/issues/2632 The audit as part of bug 2632 will cover this issue also. I'm not sure when Satellite will switch to 4.0.z, but that is the plan FWIW.
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.