Description of problem: The stack dump from "Additional Info" can be seen in logs when starting/stopping pulp workers Version-Release number of selected component (if applicable): pulp-server-2.4.0-0.14.beta.fc20.noarch python-kombu-3.0.15-5.pulp.fc20.noarch How reproducible: ~1/2 Steps to Reproduce: 1. systemctl stop `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\ -f1 2. check /var/log messages for the stack dump 2. ystemctl start `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\ -f1 4. goto 1. Actual results: repeating few times, the stack dump from additional info appears Expected results: no stack dump in during stop workers appears when accessing the connection_errors property Additional info: # the stack dump May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping The Apache HTTP Server... May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp's Celerybeat... May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Resource Manager... May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Worker #0... May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Celery Workers... May 16 13:37:10 ec2-54-220-31-64 celery: celery beat v3.1.11 (Cipater) is starting. May 16 13:37:10 ec2-54-220-31-64 celery: __ - ... __ - _ May 16 13:37:10 ec2-54-220-31-64 celery: Configuration -> May 16 13:37:10 ec2-54-220-31-64 celery: . broker -> qpid://guest.compute.amazonaws.com:5672// May 16 13:37:10 ec2-54-220-31-64 celery: . loader -> celery.loaders.app.AppLoader May 16 13:37:10 ec2-54-220-31-64 celery: . scheduler -> pulp.server.async.scheduler.Scheduler May 16 13:37:10 ec2-54-220-31-64 celery: . logfile -> [stderr]@%INFO May 16 13:37:10 ec2-54-220-31-64 celery: . maxinterval -> now (0s) May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess) May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess) May 16 13:37:10 ec2-54-220-31-64 celery: Exception in thread Thread-2 (most likely raised during interpreter shutdown): May 16 13:37:10 ec2-54-220-31-64 celery: Traceback (most recent call last): May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 764, in run May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 92, in monitor_events May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 732, in __exit__ May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 334, in release May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 302, in _close May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 292, in _do_close_self May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 285, in maybe_close_channel May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/utils/__init__.py", line 320, in __get__ May 16 13:37:10 ec2-54-220-31-64 celery: <type 'exceptions.KeyError'>: 'connection_errors' May 16 13:37:10 ec2-54-220-31-64 systemd: Stopped Pulp's Celerybeat. May 16 13:37:11 ec2-54-220-31-64 celery: reserved_resource_worker-0.compute.amazonaws.com ready. May 16 13:37:11 ec2-54-220-31-64 celery: resource_manager.compute.amazonaws.com ready. May 16 13:37:11 ec2-54-220-31-64 systemd: Stopped The Apache HTTP Server. May 16 13:37:12 ec2-54-220-31-64 systemd: Stopped Pulp Worker #0.
*** Bug 1098540 has been marked as a duplicate of this bug. ***
I believe I understand why this occurs, but I'm not sure of the best way to fix it In the stacktrace you'll see this is related to the monitor_events[0] method, which is run inside of a thread that is spawned as part of celerybeat. This method uses a connection to the broker, which in this case is qpid. that connection uses kombu and qpid.messaging underneath. The teardown order of the threads I believe is the cause of this exception. If the threads used by qpid.messaging exit first, then it would cause a connection error to be raised on the Receiver object used by the monitor_events thread, which may not yet have exited. There is already a try/accept around the monitor_events() call[1] and that code is designed to log the exception as an error, sleep for 10 seconds, and then reenter monitor_events() again. Of course the thread will have exited long before that. The only downside here is the stack trace, but I want real connection errors captured also, so I'm cautious to silence this exception. [0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L152 [1]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L146
One thing we could do would be to reorder the sleep(10) and the logging statement here[0] causing the thread to delay logging until it almost certainly exited. Delayed logging isn't great, so maybe we could delay be 2 seconds, log, and then sleep for 8 seconds. 2 seconds should be enough time for all daemon thread to terminate. [0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L149
There is a better way to fix this. The EventMonitor Thread should be converted to be a non-daemon thread, and wait for a signal to shutdown properly. During thread exit, it should properly de-allocate the resources used by the thread. See the note in the Threading library[0] that recommends this very fix for this type of problem. [0]: https://docs.python.org/2/library/threading.html#thread-objects
Moved to https://pulp.plan.io/issues/431