Description of problem: rhel6.5 repo sync failed with list index out of range. I haven't tested it on any other repos yet. Version-Release number of selected component (if applicable): [root@ibm-x3550m3-08 ~]# rpm -qa pulp-server pulp-server-2.4.0-0.17.beta.el6.noarch How reproducible: Steps to Reproduce: 1. Create & sync a rhel6.5 repo from CDN 2. 3. Actual results: [root@ibm-x3550m3-08 ~]# pulp-admin rpm repo create --repo-id rhel6 --feed https://cdn.redhat.com/content/dist/rhel/rhui/server/6/6.5/x86_64/os/ --feed-ca-cert CDN/cdn.redhat.com-chain.crt --feed-cert CDN/1391032266_4771.crt --feed-key CDN/1391032266_4771.key Successfully created repository [rhel6] [root@ibm-x3550m3-08 ~]# [root@ibm-x3550m3-08 ~]# [root@ibm-x3550m3-08 ~]# [root@ibm-x3550m3-08 ~]# pulp-admin rpm repo sync run --repo-id rhel6 +----------------------------------------------------------------------+ Synchronizing Repository [rhel6] +----------------------------------------------------------------------+ This command may be exited via ctrl+c without affecting the request. Downloading metadata... [-] ... completed Downloading repository content... [==================================================] 100% RPMs: 12535/12535 items Delta RPMs: 0/0 items ... completed Downloading distribution files... [==================================================] 100% Distributions: 0/0 items ... completed Importing errata... [|] ... completed Importing package groups/categories... [|] ... completed Task Failed list index out of range Expected results: Additional info: from the log May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._reserve_resource[34af15d8-5a25-409b-8441-b76cbb50a8bf] May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.async.tasks._reserve_resource[34af15d8-5a25-409b-8441-b76cbb50a8bf] raised unexpected: IndexError('list index out of range',) May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last): May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: R = retval = fun(*args, **kwargs) May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__ May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return self.run(*args, **kwargs) May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 148, in _reserve_resource May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: aq_list[0].increment_num_reservations() May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.tasks.repository.sync_with_auto_publish[26d59754-9228-4bb9-a82a-364e841544c9] raised unexpected: IndexError('list index out of range',) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last): May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: R = retval = fun(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 322, in __call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return super(Task, self).__call__(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return self.run(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/tasks/repository.py", line 220, in sync_with_auto_publish May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: spawned_tasks.append(publish(repo_id, distributor_id)) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/local.py", line 167, in <lambda> May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: __call__ = lambda x, *a, **kw: x._get_current_object()(*a, **kw) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 322, in __call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return super(Task, self).__call__(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 438, in __protected_call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return orig(self, *args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/task.py", line 420, in __call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return self.run(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/tasks/repository.py", line 194, in publish May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: RESOURCE_REPOSITORY_TYPE, repo_id, tags=tags, kwargs=kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 250, in apply_async_with_reservation May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: queue = _reserve_resource.apply_async((resource_id,), queue=RESOURCE_MANAGER_QUEUE).get() May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/result.py", line 169, in get May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: no_ack=no_ack, May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/backends/base.py", line 215, in wait_for May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: raise result May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._release_resource[fedf5c6e-33e2-41d3-9e97-3acb195ed8e1] May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_release_resource[96866280-05c7-433f-a3f6-e57e49a196a2] succeeded in 0.030695073001s: None May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.async.tasks._release_resource[fedf5c6e-33e2-41d3-9e97-3acb195ed8e1] raised unexpected: IndexError('list index out of range',) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last): May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: R = retval = fun(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__ May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: return self.run(*args, **kwargs) May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 106, in _release_resource May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: available_queue = aq_list[0] May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range
I'm investigating this issue, and am troubleshooting on the box where the issue was experienced. Here is what I know so far: 1. The workers were removed by scheduler.WorkerTimeoutMonitor because their their last_heartbeat times were older than 5 minutes. 2. Worker heartbeats have continued to flow the entire time 3. All workers were deleted at the same moment, which means it's not the workers that had the issue, it's the event monitor. 3. Restarting only celerybeat causes all workers to be discovered and everything works again. This reinforces the idea that the root cause is in the event monitor. 4. The monitor_events method never exits! If it did we would see the log statement from the outer runloop in run() that logs at the error level. This means some kind of unrecoverable exception is occurring inside of the celery capture() method and being silenced, not logged, and continuing (without actually continuing). Unfortunately, the useful logging is done at the DEBUG level, and so the existing logs don't provide any useful detail. I've enabled DEBUG logging, and I am trying to reproduce the issue with DEBUG on. Perhaps this is a bug in the Celery event capture functionality that we rely on for this functionality.
Several individuals have run into this. Their environments are: Fedora 20 with Qpid 0.26 RHEL 6.5 with Qpid 0.18 RHEL 6.5 with Qpid 0.22 This also shows itself as the exception NoAvailableQueues()
I believe the root cause is a thread deadlock is occurring during a call to get() here[0]. The deadlock is in a thread on the critical path of message processing, so that explains why events stop being processed even though heartbeat messages are being emitted, and properly placed into the 'celeryev.xxxxxxxx' queue. Normally, events in the 'celeryev.xxxxxxxxx' queue are supposed to be drained by threads in the celerybeat process.. Interestingly, the FDShimThread threads do drain the events from 'celeryev.xxxxxxx' and continue to read event messages using qpid.messaging even after the Celery event callback handlers have stopped being called (because of the deadlock). This means that worker heartbeat messages are being delivered to qpid.messaging, and a thread inside the consumer, but it's not making its way to the final destination in the qpid transport. It's a strange problem cause, but that's what I've observed. Especially strange because the get is on a Queue.Queue object which is supposed to be thread safe. There are some posts on the internet where users describe thread deadlock when using Queue objects, but I'm not sure how this is possible. I've reworked that blocking call to use a timeout, and I'm investigating if that fixes the issue by allowing the blocking call to wakeup, and then re-enter to process more events. [0]: https://github.com/pulp/kombu/blob/e9155d9f8ba7e4f72844e32d3dc6a005ae341b4c/kombu/transport/qpid.py#L1452
After talking with Ted Ross, I determined a way to simplify the Qpid transport by adjusting how we interact with the qpid.messaging client. Here would be the changes necessary: 1. Delete the FDShimthread object altogether 2. Have basic_consume create a qpid.messaging receiver() and maintain the reference directly to it in MainThread. 3. Have FDShim monitor the set of receivers at once using a blocking call with a timeout to the undocumented next_receiver() method. 4. Fix associated tests. This would eliminate the pipe that is heavily shared across all of those threads, and is the root cause of this NoAvailableQueues issue. We would be left with 1 pipe with exactly 1 producer thread writing to it, and exactly 1 consumer thread reading from it. You would also reduce the thread count from N +1 additional threads for N consumers to exactly 1 additional thread for N consumers. Obviously, this would be a significantly improvement in the simplicity, and will likely resolve the reliability issues at the same time.
*** Bug 1100797 has been marked as a duplicate of this bug. ***
*** Bug 1100700 has been marked as a duplicate of this bug. ***
*** Bug 1098734 has been marked as a duplicate of this bug. ***
*** Bug 1101984 has been marked as a duplicate of this bug. ***
PR available here: https://github.com/pulp/pulp/pull/988
Merged
This was fixed in pulp-2.4.0-0.19.beta.
verified I have had this new build for over a week and I havent seen this error. Moving this to verified [root@pulp-24-server ~]# rpm -qa pulp-server pulp-server-2.4.0-0.19.beta.el6.noarch [root@pulp-24-server ~]#
This has been fixed in Pulp 2.4.0-1.