Bug 1339288
| Summary: | pulp celery worker unhandled exception duiring startup when cancelling a task | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
| Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED WONTFIX | QA Contact: | jcallaha |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.1.8 | CC: | bkearney, bmbouter, cwelton, daviddavis, dkliban, ggainey, ipanova, ktordeur, mhrivnak, pcreech, rchan, ttereshc |
| Target Milestone: | Unspecified | Keywords: | Reopened, Triaged |
| Target Release: | Unused | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-04 18:04:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Per 6.3 planning, moving out non acked bugs to the backlog I tried 4 times to reproduce this, but I couldn't. Note this was reported against Pulp 2.6 and I tested against 2.9. Also note that I'm on fedora 24 and satellite's Qpid packages are different based on MRG when upstream is based on Qpid directly. In my dev environment I ran: # sync the zoo repo and kill -9 all workers just after it starts pulp-admin rpm repo sync run --repo-id zoo sudo pkill -9 -f celery # verify the task is at running and get the task id as <task_id> pulp-admin task list # stop pulp and qpidd pstop sudo systemctl stop qpidd # start all pulp service and immediately after issue the cancel pstart pulp-admin tasks cancel --task-id <task_id> # Observe that the cancel halts waiting on the message bus # Observe the expected wait-and-retry with backoff behavior from all pulp processes # Bring qpidd back online sudo systemctl start qpidd # Observe all pulp services connect as expected and the task cancellation unblocks and returns # Observe that no unexpected tracebacks are in the logs In the past month several people have observed unexpected tracebacks raised by qpid.messaging related to connection issues and errors. Also there have been some deadlocking issues [0]. I think this issue needs to be reproduced or re-observed against Sat 6.2.z which is based on Pulp 2.8.z If it does reproduce we need to get assistance and support from Qpid team to troubleshoot it. I don't think there is likely an adjustment to Pulp which will improve this issue. [0]: https://issues.apache.org/jira/browse/QPID-7317 Brian, thank you for the investigation and feedback. Based on the comments above, I am going to recommend that we move this one to ON_QA for QE verification. qa -> jcalla I am unable to reproduce this issue in Satellite 6.2.2 Async Snap 2. Kicked off a repo sync, then stopped pulp services for i in "pulp_celerybeat pulp_resource_manager pulp_workers"; do systemctl stop $i; done stopped qpidd systemctl stop qpidd Then re-enabled the pulp services for i in "pulp_celerybeat pulp_resource_manager pulp_workers"; do systemctl start $i; done and cancelled the job in UI. The task stopped successfully without needing qpidd to be restarted. restarted qpidd and performed a half dozen variations of syn'ing a repo while restarting services and cancelling the task. No issues encountered with pulp services starting back up. Reopening since hit by a customer on
pulp-server-2.8.7.3-1.el7sat.noarch
I expect no change fixed this in the very latest pulp-server-2.8.7.5-1.el7sat.noarch either.
Current backtrace:
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) Unrecoverable error: ConnectError('[Errno 111] Connection refused',)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) Traceback (most recent call last):
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self.blueprint.start(self)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self.on_start()
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/apps/worker.py", line 157, in on_start
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) sender=self.hostname, instance=self, conf=self.app.conf,
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) response = receiver(signal=self, sender=sender, **named)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/pulp/server/async/app.py", line 52, in initialize_worker
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) tasks._delete_worker(sender, normal_shutdown=True)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 258, in _delete_worker
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) cancel(task_status['task_id'])
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 591, in cancel
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) controller.revoke(task_id, terminate=True)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) 'signal': signal}, **kwargs)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) limit, callback, channel=channel,
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) chan = channel or self.connection.default_channel
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self.connection
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self._connection = self._establish_connection()
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) conn = self.transport.establish_connection()
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) conn = self.Connection(**opts)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self._qpid_conn = establish(**self.connection_options)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) conn.open(timeout=timeout)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "<string>", line 6, in open
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self.attach(timeout=timeout)
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "<string>", line 6, in attach
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout):
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) self.check_error()
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) raise e
Feb 05 00:02:12 satellite pulp[1337]: celery.worker:ERROR: (1337-35232) ConnectError: [Errno 111] Connection refused
Anyway, qpidd broker was running since:
Feb 05 00:02:08 satellite systemd[1]: Started An AMQP message broker daemon..
It might appear after machine reboot when pulp tasks cancellation has not been processed within pulp properly, yet. That differs from the way QE tried to reproduce.
Need to find out a reliable reproducer, though..
Hi, Was able to reproduce this as follows: - Before doing anything the current workers on the Satellite: [root@ktordeur-sat61 ~]# mongo MongoDB shell version: 2.6.11 connecting to: test Server has startup warnings: 2017-01-20T11:32:35.350+0100 [initandlisten] 2017-01-20T11:32:35.350+0100 [initandlisten] ** WARNING: Readahead for /var/lib/mongodb is set to 4096KB 2017-01-20T11:32:35.350+0100 [initandlisten] ** We suggest setting it to 256KB (512 sectors) or less 2017-01-20T11:32:35.350+0100 [initandlisten] ** http://dochub.mongodb.org/core/readahead > use pulp_database switched to db pulp_database > db.workers.find() { "_id" : "scheduler.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:22:48.320Z") } { "_id" : "resource_manager.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:23:18.102Z") } { "_id" : "reserved_resource_worker-2.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:23:18.414Z") } { "_id" : "reserved_resource_worker-0.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:23:24.513Z") } { "_id" : "reserved_resource_worker-3.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:23:17.931Z") } { "_id" : "reserved_resource_worker-1.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:23:18.695Z") } > - Start sync of ~ 8 repositories - Reboot the server after starting the sync - Check workers after reboot Satellite: Last login: Thu Feb 9 09:29:19 2017 from ovpn-204-128.brq.redhat.com [root@ktordeur-sat61 ~]# mongo MongoDB shell version: 2.6.11 connecting to: test > use pulp_database switched to db pulp_database > db.workers.find() { "_id" : "scheduler.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:33:45.014Z") } { "_id" : "resource_manager.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:34:49.869Z") } > Checking logs: ~~~ Feb 9 10:24:24 ktordeur-sat61 pulp: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_reserved_task[6d640909-13e3-458f-931e-69694bd08bfe] succeeded in 0.0527766929008s: None Feb 9 11:53:57 ktordeur-sat61 systemd: Started Pulp's Celerybeat. Feb 9 11:53:57 ktordeur-sat61 systemd: Starting Pulp's Celerybeat... Feb 9 11:53:57 ktordeur-sat61 systemd: Starting Pulp Celery Workers... Feb 9 11:53:58 ktordeur-sat61 systemd: Started Pulp Celery Workers. Feb 9 10:33:42 ktordeur-sat61 pulp: celery.beat:INFO: beat: Starting... Feb 9 10:33:45 ktordeur-sat61 pulp: celery.beat:ERROR: beat: Connection error: [Errno 111] Connection refused. Trying again in 2.0 seconds... Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) Unrecoverable error: ConnectError('[Errno 111] Connection refused',) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) Traceback (most recent call last): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) Unrecoverable error: ConnectError('[Errno 111] Connection refused',) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) Traceback (most recent call last): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self.blueprint.start(self) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self.blueprint.start(self) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self.on_start() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self.on_start() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/apps/worker.py", line 157, in on_start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/apps/worker.py", line 157, in on_start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) sender=self.hostname, instance=self, conf=self.app.conf, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) response = receiver(signal=self, sender=sender, **named) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) sender=self.hostname, instance=self, conf=self.app.conf, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/pulp/server/async/app.py", line 52, in initialize_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) response = receiver(signal=self, sender=sender, **named) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) tasks._delete_worker(sender, normal_shutdown=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/pulp/server/async/app.py", line 52, in initialize_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 258, in _delete_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) tasks._delete_worker(sender, normal_shutdown=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) cancel(task_status['task_id']) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 258, in _delete_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 591, in cancel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) cancel(task_status['task_id']) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) controller.revoke(task_id, terminate=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) 'signal': signal}, **kwargs) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 591, in cancel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) controller.revoke(task_id, terminate=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) limit, callback, channel=channel, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) 'signal': signal}, **kwargs) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) limit, callback, channel=channel, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) chan = channel or self.connection.default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) chan = channel or self.connection.default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self.connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self.connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self._connection = self._establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self._connection = self._establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) conn = self.transport.establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) conn = self.transport.establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) conn = self.Connection(**opts) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) conn = self.Connection(**opts) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__ Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self._qpid_conn = establish(**self.connection_options) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__ Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self._qpid_conn = establish(**self.connection_options) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) conn.open(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "<string>", line 6, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) conn.open(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "<string>", line 6, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self.attach(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "<string>", line 6, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self.attach(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "<string>", line 6, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) self.check_error() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) raise e Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) self.check_error() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1115-43808) ConnectError: [Errno 111] Connection refused Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) raise e Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1113-05888) ConnectError: [Errno 111] Connection refused Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) Unrecoverable error: ConnectError('[Errno 111] Connection refused',) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) Traceback (most recent call last): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self.blueprint.start(self) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self.on_start() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/apps/worker.py", line 157, in on_start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) sender=self.hostname, instance=self, conf=self.app.conf, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) response = receiver(signal=self, sender=sender, **named) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/pulp/server/async/app.py", line 52, in initialize_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) tasks._delete_worker(sender, normal_shutdown=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 258, in _delete_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) cancel(task_status['task_id']) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 591, in cancel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) controller.revoke(task_id, terminate=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) 'signal': signal}, **kwargs) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) limit, callback, channel=channel, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) chan = channel or self.connection.default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self.connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self._connection = self._establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) conn = self.transport.establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) conn = self.Connection(**opts) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__ Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self._qpid_conn = establish(**self.connection_options) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) conn.open(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "<string>", line 6, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self.attach(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "<string>", line 6, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) self.check_error() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) raise e Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1119-28768) ConnectError: [Errno 111] Connection refused Feb 9 10:33:47 ktordeur-sat61 pulp: celery.beat:ERROR: beat: Connection error: [Errno 111] Connection refused. Trying again in 4.0 seconds... Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) Unrecoverable error: ConnectError('[Errno 111] Connection refused',) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) Traceback (most recent call last): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self.blueprint.start(self) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self.on_start() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/apps/worker.py", line 157, in on_start Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) sender=self.hostname, instance=self, conf=self.app.conf, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) response = receiver(signal=self, sender=sender, **named) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/pulp/server/async/app.py", line 52, in initialize_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) tasks._delete_worker(sender, normal_shutdown=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 258, in _delete_worker Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) cancel(task_status['task_id']) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 591, in cancel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) controller.revoke(task_id, terminate=True) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) 'signal': signal}, **kwargs) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) limit, callback, channel=channel, Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) chan = channel or self.connection.default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self.connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self._connection = self._establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) conn = self.transport.establish_connection() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) conn = self.Connection(**opts) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__ Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self._qpid_conn = establish(**self.connection_options) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) conn.open(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "<string>", line 6, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self.attach(timeout=timeout) Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "<string>", line 6, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout): Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) self.check_error() Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) raise e Feb 9 10:33:47 ktordeur-sat61 pulp: celery.worker:ERROR: (1117-18848) ConnectError: [Errno 111] Connection refused Feb 9 10:33:49 ktordeur-sat61 celery: -------------- resource_manager.redhat.com v3.1.11 (Cipater) Feb 9 10:33:49 ktordeur-sat61 celery: ---- **** ----- Feb 9 10:33:49 ktordeur-sat61 celery: --- * *** * -- Linux-3.10.0-514.6.1.el7.x86_64-x86_64-with-redhat-7.3-Maipo Feb 9 10:33:49 ktordeur-sat61 celery: -- * - **** --- Feb 9 10:33:49 ktordeur-sat61 celery: - ** ---------- [config] Feb 9 10:33:49 ktordeur-sat61 celery: - ** ---------- .> app: tasks:0x36e6490 Feb 9 10:33:49 ktordeur-sat61 celery: - ** ---------- .> transport: qpid://ktordeur-sat61.usersys.redhat.com:5671// Feb 9 10:33:49 ktordeur-sat61 celery: - ** ---------- .> results: disabled Feb 9 10:33:49 ktordeur-sat61 celery: - *** --- * --- .> concurrency: 1 (prefork) Feb 9 10:33:49 ktordeur-sat61 celery: -- ******* ---- Feb 9 10:33:49 ktordeur-sat61 celery: --- ***** ----- [queues] Feb 9 10:33:49 ktordeur-sat61 celery: -------------- .> resource_manager exchange=resource_manager(direct) key=resource_manager Feb 9 10:33:49 ktordeur-sat61 celery: .> resource_manager.redhat.com.dq exchange=C.dq(direct) key=resource_manager.redhat.com Feb 9 10:33:49 ktordeur-sat61 pulp: celery.worker.consumer:INFO: Connected to qpid://ktordeur-sat61.usersys.redhat.com:5671// ~~~ - Restart pulp services: # for i in pulp_resource_manager pulp_workers pulp_celerybeat; do service $i restart; done Workers are back: ~~~ [root@ktordeur-sat61 ~]# mongo MongoDB shell version: 2.6.11 connecting to: test > use pulp_database switched to db pulp_database > db.workers.find() { "_id" : "scheduler.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:37:46.517Z") } { "_id" : "resource_manager.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:38:16.363Z") } { "_id" : "reserved_resource_worker-3.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:38:16.527Z") } { "_id" : "reserved_resource_worker-1.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:38:16.535Z") } { "_id" : "reserved_resource_worker-2.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:38:16.666Z") } { "_id" : "reserved_resource_worker-0.redhat.com", "last_heartbeat" : ISODate("2017-02-09T09:38:16.683Z") } > ~~~ On first glance it appears that the worker is starting before qpid is available, and when it tries to do cleanup, that fails. Brian, what do you think? Presumably the worker should wait during startup for a broker to become available before trying to use it, or be able to retry. The traceback from Comment 12 suggests the following is happening: 1. Pulp is healthy 2. Pulp is given work to do (8 syncs) 3. The server is restarted. 4. Upon restart, the workers start before qpidd 5. The workers try to cancel the tasks that were "lost" due to upstream bug 489 [0]. 6. The workers experience an 'Unrecoverable Error' and exit 7. Only the resource manager and celerybeat correctly wait for the broker to become available. What is strange is that Comment 5 is very similar to this process which I could not reproduce with. I will try once again to reproduce this on upstream Pulp. [0]: https://pulp.plan.io/issues/489 Brian, do you have an update on this? Maybe it's worth trying to reproduce on a Satellite box. Thanks! I tried to reproduce it on upstream but I could not. Where can I get a satellite system that I can stop/start in such an intrusive way? I'm going to try to reproduce on a Sat 6.2.8 latest EL7 vagrant box. I was able to reproduce this on a satellite 6.2.8 rhel7 box. To reproduce I: Created a 'zoo' product that sync from the zoo fixture repo https://repos.fedorapeople.org/repos/pulp/pulp/fixtures/rpm/ Ensure that pulp was using 1 worker in /etc/default/pulp_workers. Sync 'zoo' manually once to see it working. Sync it again, and just when the task begins run: sudo pkill -9 -f reserved_resource_worker Then stop qpidd with `sudo systemctl stop qpidd` Then restart the pulp_workers `sudo systemctl restart pulp_workers` # Observe the traceback in question Then start qpidd again `sudo systemctl start qpidd` Trigger a new manual sync and observe that it does not complete. I will try once again to reproduce on upstream. Using Comment 19 I was able to reproduce the issue in upstream also. I will file an upstream bug. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. After some upstream investigation, I think this bug should have a priority of med and a severity of low. A traceback emitted on startup isn't great, but since the broker is not available and we require that connection to cancel the only thing we can do is wait. This bug is easily worked around by calling restarting pulp_workers after the message bus is online. Also note that if one worker experiences this issue, other workers can start successfully and allow Satellite to function. After investigating this in upstream Pulp, we have filed a Celery bug [0]. I want to hear back from them regarding how to properly fix this bug. [0]: https://github.com/celery/celery/issues/4104 The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. The Pulp upstream bug status is at NEW. Updating the external tracker on this bug. Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you. The Pulp upstream bug status is at CLOSED - WONTFIX. Updating the external tracker on this bug. |
Description of problem: Under an unknown circumstances, restarting katello-service ended up with _all_ pulp celery workers shut down due to unhandled exception. Backtrace (might not be 100% correct since it is selected from interleaving 3 backtraces): May 10 09:57:59 mysatellite celery: Traceback (most recent call last): May 10 09:57:59 mysatellite celery: File "/usr/bin/celery", line 9, in <module> May 10 09:57:59 mysatellite celery: load_entry_point('celery==3.1.11', 'console_scripts', 'celery')() May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/__main__.py", line 30, in main May 10 09:57:59 mysatellite celery: main() May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 81, in main May 10 09:57:59 mysatellite celery: cmd.execute_from_commandline(argv) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 769, in execute_from_commandline May 10 09:57:59 mysatellite celery: super(CeleryCommand, self).execute_from_commandline(argv))) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 306, in execute_from_commandline May 10 09:57:59 mysatellite celery: return self.handle_argv(self.prog_name, argv[1:]) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 761, in handle_argv May 10 09:57:59 mysatellite celery: return self.execute(command, argv) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 693, in execute May 10 09:57:59 mysatellite celery: ).run_from_argv(self.prog_name, argv[1:], command=argv[0]) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/worker.py", line 179, in run_from_argv May 10 09:57:59 mysatellite celery: return self(*args, **options) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 269, in __call__ May 10 09:57:59 mysatellite celery: ret = self.run(*args, **kwargs) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/bin/worker.py", line 212, in run May 10 09:57:59 mysatellite celery: state_db=self.node_format(state_db, hostname), **kwargs May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 100, in __init__ May 10 09:57:59 mysatellite celery: self.setup_instance(**self.prepare_args(**kwargs)) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 131, in setup_instance May 10 09:57:59 mysatellite celery: signals.worker_init.send(sender=self) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send May 10 09:57:59 mysatellite celery: response = receiver(signal=self, sender=sender, **named) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 521, in cleanup_old_worker May 10 09:57:59 mysatellite celery: _delete_worker(name, normal_shutdown=True) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 124, in _delete_worker May 10 09:57:59 mysatellite celery: cancel(task['task_id']) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 428, in cancel May 10 09:57:59 mysatellite celery: controller.revoke(task_id, terminate=True) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 171, in revoke May 10 09:57:59 mysatellite celery: 'signal': signal}, **kwargs) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 306, in broadcast May 10 09:57:59 mysatellite celery: limit, callback, channel=channel, May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast May 10 09:57:59 mysatellite celery: chan = channel or self.connection.default_channel May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel May 10 09:57:59 mysatellite celery: self.connection May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection May 10 09:57:59 mysatellite celery: self._connection = self._establish_connection() May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection May 10 09:57:59 mysatellite celery: conn = self.transport.establish_connection() May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1701, in establish_connection May 10 09:57:59 mysatellite celery: conn = self.Connection(**opts) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1309, in __init__ May 10 09:57:59 mysatellite celery: self._qpid_conn = establish(**self.connection_options) May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 68, in establish May 10 09:57:59 mysatellite celery: conn.open(timeout=timeout) May 10 09:57:59 mysatellite celery: File "<string>", line 6, in open May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 279, in open May 10 09:57:59 mysatellite celery: self.attach(timeout=timeout) May 10 09:57:59 mysatellite celery: File "<string>", line 6, in attach May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 297, in attach May 10 09:57:59 mysatellite celery: if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout): May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 230, in _ewait May 10 09:57:59 mysatellite celery: self.check_error() May 10 09:57:59 mysatellite celery: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 223, in check_error May 10 09:57:59 mysatellite celery: raise e May 10 09:57:59 mysatellite celery: qpid.messaging.exceptions.ConnectError: [Errno 111] Connection refused Usual connection attempt failures to the qpid broker are caught properly: May 10 09:57:56 mysatellite pulp: celery.beat:ERROR: beat: Connection error: [Errno 111] Connection refused. Trying again in 4.0 seconds... but the above one not. Version-Release number of selected component (if applicable): pulp-server-2.6.0.20-1.el7sat.noarch How reproducible: ??? I guess 100% when reproducer known Steps to Reproduce: 1. Enforce pulp to cancel a task just after startup 2. Shut down qpidd and pulp 3. Start pulp Actual results: Above backtrace causes all pulp_worker-*.service to fail, i.e. no worker after all. Expected results: Workers should catch the exception and retry after a while. (or they should try cancelling a task even after a connection to qpid is made) Additional info: