Bug 1098536 - A key-error for connection_errors Stack dump appears when stopping pulp workers from time to time
Summary: A key-error for connection_errors Stack dump appears when stopping pulp worke...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Pulp
Classification: Retired
Component: async/tasks
Version: 2.4 Beta
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: pulp-bugs
QA Contact: pulp-qe-list
URL:
Whiteboard:
: 1098540 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-16 13:45 UTC by mkovacik
Modified: 2015-02-28 22:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-28 22:06:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Pulp Redmine 431 0 None None None Never

Description mkovacik 2014-05-16 13:45:51 UTC
Description of problem:
The stack dump from "Additional Info" can be seen in logs when starting/stopping pulp workers

Version-Release number of selected component (if applicable):
pulp-server-2.4.0-0.14.beta.fc20.noarch
python-kombu-3.0.15-5.pulp.fc20.noarch

How reproducible:
~1/2

Steps to Reproduce:
1. systemctl stop `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\  -f1
2. check /var/log messages for the stack dump
2. ystemctl start `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\  -f1
4. goto 1.

Actual results:
repeating few times, the stack dump from additional info appears

Expected results:
no stack dump in during stop workers appears when accessing the connection_errors property


Additional info:
# the stack dump
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping The Apache HTTP Server...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp's Celerybeat...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Resource Manager...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Worker #0...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Celery Workers...
May 16 13:37:10 ec2-54-220-31-64 celery: celery beat v3.1.11 (Cipater) is starting.
May 16 13:37:10 ec2-54-220-31-64 celery: __    -    ... __   -        _
May 16 13:37:10 ec2-54-220-31-64 celery: Configuration ->
May 16 13:37:10 ec2-54-220-31-64 celery: . broker -> qpid://guest.compute.amazonaws.com:5672//
May 16 13:37:10 ec2-54-220-31-64 celery: . loader -> celery.loaders.app.AppLoader
May 16 13:37:10 ec2-54-220-31-64 celery: . scheduler -> pulp.server.async.scheduler.Scheduler
May 16 13:37:10 ec2-54-220-31-64 celery: . logfile -> [stderr]@%INFO
May 16 13:37:10 ec2-54-220-31-64 celery: . maxinterval -> now (0s)
May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
May 16 13:37:10 ec2-54-220-31-64 celery: Exception in thread Thread-2 (most likely raised during interpreter shutdown):
May 16 13:37:10 ec2-54-220-31-64 celery: Traceback (most recent call last):
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 764, in run
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 92, in monitor_events
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 732, in __exit__
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 334, in release
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 302, in _close
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 292, in _do_close_self
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 285, in maybe_close_channel
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/utils/__init__.py", line 320, in __get__
May 16 13:37:10 ec2-54-220-31-64 celery: <type 'exceptions.KeyError'>: 'connection_errors'
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopped Pulp's Celerybeat.
May 16 13:37:11 ec2-54-220-31-64 celery: reserved_resource_worker-0.compute.amazonaws.com ready.
May 16 13:37:11 ec2-54-220-31-64 celery: resource_manager.compute.amazonaws.com ready.
May 16 13:37:11 ec2-54-220-31-64 systemd: Stopped The Apache HTTP Server.
May 16 13:37:12 ec2-54-220-31-64 systemd: Stopped Pulp Worker #0.

Comment 1 Sayli Karmarkar 2014-05-16 16:25:58 UTC
*** Bug 1098540 has been marked as a duplicate of this bug. ***

Comment 2 Brian Bouterse 2014-05-17 19:11:25 UTC
I believe I understand why this occurs, but I'm not sure of the best way to fix it

In the stacktrace you'll see this is related to the monitor_events[0] method, which is run inside of a thread that is spawned as part of celerybeat.  This method uses a connection to the broker, which in this case is qpid. that connection uses kombu and qpid.messaging underneath.

The teardown order of the threads I believe is the cause of this exception.  If the threads used by qpid.messaging exit first, then it would cause a connection error to be raised on the Receiver object used by the monitor_events thread, which may not yet have exited.

There is already a try/accept around the monitor_events() call[1] and that code is designed to log the exception as an error, sleep for 10 seconds, and then reenter monitor_events() again.  Of course the thread will have exited long before that.  The only downside here is the stack trace, but I want real connection errors captured also, so I'm cautious to silence this exception.

[0]:  https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L152

[1]:  https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L146

Comment 3 Brian Bouterse 2014-05-17 19:15:20 UTC
One thing we could do would be to reorder the sleep(10) and the logging statement here[0] causing the thread to delay logging until it almost certainly exited.  Delayed logging isn't great, so maybe we could delay be 2 seconds, log, and then sleep for 8 seconds.  2 seconds should be enough time for all daemon thread to terminate.

[0]:  https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L149

Comment 4 Brian Bouterse 2014-05-23 16:19:27 UTC
There is a better way to fix this. The EventMonitor Thread should be converted to be a non-daemon thread, and wait for a signal to shutdown properly. During thread exit, it should properly de-allocate the resources used by the thread.

See the note in the Threading library[0] that recommends this very fix for this type of problem.

[0]:  https://docs.python.org/2/library/threading.html#thread-objects

Comment 5 Brian Bouterse 2015-02-28 22:06:30 UTC
Moved to https://pulp.plan.io/issues/431


Note You need to log in before you can comment on or make changes to this bug.