1100005 – repo sync failure list index out of range (No available Queue exception)

Bug 1100005 - repo sync failure list index out of range (No available Queue exception)

Summary: repo sync failure list index out of range (No available Queue exception)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Pulp
Classification:	Retired
Component:	rpm-support
Sub Component:
Version:	2.4 Beta
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	2.4.0
Assignee:	Brian Bouterse
QA Contact:	Preethi Thomas
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1098734 1100700 1100797 1101984 (view as bug list)
Depends On:
Blocks:	1099902
TreeView+	depends on / blocked

Reported:	2014-05-21 18:59 UTC by Preethi Thomas
Modified:	2014-08-09 06:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-08-09 06:54:14 UTC
Embargoed:

Attachments	(Terms of Use)

Description Preethi Thomas 2014-05-21 18:59:31 UTC

Description of problem:

rhel6.5 repo sync failed with list index out of range. I haven't tested it on any other repos yet. 

Version-Release number of selected component (if applicable):
[root@ibm-x3550m3-08 ~]# rpm -qa pulp-server
pulp-server-2.4.0-0.17.beta.el6.noarch

How reproducible:


Steps to Reproduce:
1. Create & sync a rhel6.5 repo from CDN
2.
3.

Actual results:
[root@ibm-x3550m3-08 ~]# pulp-admin rpm repo create --repo-id rhel6 --feed https://cdn.redhat.com/content/dist/rhel/rhui/server/6/6.5/x86_64/os/ --feed-ca-cert CDN/cdn.redhat.com-chain.crt  --feed-cert CDN/1391032266_4771.crt  --feed-key CDN/1391032266_4771.key
Successfully created repository [rhel6]

[root@ibm-x3550m3-08 ~]# 
[root@ibm-x3550m3-08 ~]# 
[root@ibm-x3550m3-08 ~]# 
[root@ibm-x3550m3-08 ~]# pulp-admin rpm repo sync run --repo-id rhel6
+----------------------------------------------------------------------+
                    Synchronizing Repository [rhel6]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.


Downloading metadata...
[-]
... completed

Downloading repository content...
[==================================================] 100%
RPMs:       12535/12535 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[|]
... completed

Importing package groups/categories...
[|]
... completed


Task Failed

list index out of range

Expected results:


Additional info:


from the log

May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._reserve_resource[34af15d8-5a25-409b-8441-b76cbb50a8bf]
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.async.tasks._reserve_resource[34af15d8-5a25-409b-8441-b76cbb50a8bf] raised unexpected: IndexError('list index out of range',)
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last):
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     R = retval = fun(*args, **kwargs)
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return self.run(*args, **kwargs)
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 148, in _reserve_resource
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     aq_list[0].increment_num_reservations()
May 21 19:51:56 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.tasks.repository.sync_with_auto_publish[26d59754-9228-4bb9-a82a-364e841544c9] raised unexpected: IndexError('list index out of range',)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last):
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     R = retval = fun(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 322, in __call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return super(Task, self).__call__(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return self.run(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/tasks/repository.py", line 220, in sync_with_auto_publish
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     spawned_tasks.append(publish(repo_id, distributor_id))
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/local.py", line 167, in <lambda>
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     __call__ = lambda x, *a, **kw: x._get_current_object()(*a, **kw)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 322, in __call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return super(Task, self).__call__(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 438, in __protected_call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return orig(self, *args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/task.py", line 420, in __call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return self.run(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/tasks/repository.py", line 194, in publish
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     RESOURCE_REPOSITORY_TYPE, repo_id, tags=tags, kwargs=kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 250, in apply_async_with_reservation
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     queue = _reserve_resource.apply_async((resource_id,), queue=RESOURCE_MANAGER_QUEUE).get()
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/result.py", line 169, in get
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     no_ack=no_ack,
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/backends/base.py", line 215, in wait_for
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     raise result
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._release_resource[fedf5c6e-33e2-41d3-9e97-3acb195ed8e1]
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_release_resource[96866280-05c7-433f-a3f6-e57e49a196a2] succeeded in 0.030695073001s: None
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Task pulp.server.async.tasks._release_resource[fedf5c6e-33e2-41d3-9e97-3acb195ed8e1] raised unexpected: IndexError('list index out of range',)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: Traceback (most recent call last):
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 240, in trace_task
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     R = retval = fun(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 437, in __protected_call__
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     return self.run(*args, **kwargs)
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 106, in _release_resource
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR:     available_queue = aq_list[0]
May 21 19:51:57 ibm-x3550m3-08 pulp: celery.worker.job:ERROR: IndexError: list index out of range

Comment 1 Brian Bouterse 2014-05-21 20:08:40 UTC

I'm investigating this issue, and am troubleshooting on the box where the issue was experienced.

Here is what I know so far:

1.  The workers were removed by scheduler.WorkerTimeoutMonitor because their their last_heartbeat times were older than 5 minutes.

2.  Worker heartbeats have continued to flow the entire time

3.  All workers were deleted at the same moment, which means it's not the workers that had the issue, it's the event monitor.

3.  Restarting only celerybeat causes all workers to be discovered and everything works again.  This reinforces the idea that the root cause is in the event monitor.

4.  The monitor_events method never exits!  If it did we would see the log statement from the outer runloop in run() that logs at the error level. This means some kind of unrecoverable exception is occurring inside of the celery capture() method and being silenced, not logged, and continuing (without actually continuing).


Unfortunately, the useful logging is done at the DEBUG level, and so the existing logs don't provide any useful detail.

I've enabled DEBUG logging, and I am trying to reproduce the issue with DEBUG on.  Perhaps this is a bug in the Celery event capture functionality that we rely on for this functionality.

Comment 2 Brian Bouterse 2014-05-22 13:48:16 UTC

Several individuals have run into this.  Their environments are:

Fedora 20 with Qpid 0.26
RHEL 6.5 with Qpid 0.18
RHEL 6.5 with Qpid 0.22 

This also shows itself as the exception NoAvailableQueues()

Comment 3 Brian Bouterse 2014-05-22 21:10:09 UTC

I believe the root cause is a thread deadlock is occurring during a call to get() here[0].  The deadlock is in a thread on the critical path of message processing, so that explains why events stop being processed even though heartbeat messages are being emitted, and properly placed into the 'celeryev.xxxxxxxx' queue. Normally, events in the 'celeryev.xxxxxxxxx' queue are supposed to be drained by threads in the celerybeat process..

Interestingly, the FDShimThread threads do drain the events from 'celeryev.xxxxxxx' and continue to read event messages using qpid.messaging even after the Celery event callback handlers have stopped being called (because of the deadlock). This means that worker heartbeat messages are being delivered to qpid.messaging, and a thread inside the consumer, but it's not making its way to the final destination in the qpid transport.  It's a strange problem cause, but that's what I've observed.

Especially strange because the get is on a Queue.Queue object which is supposed to be thread safe. There are some posts on the internet where users describe thread deadlock when using Queue objects, but I'm not sure how this is possible.

I've reworked that blocking call to use a timeout, and I'm investigating if that fixes the issue by allowing the blocking call to wakeup, and then re-enter to process more events.

[0]:  https://github.com/pulp/kombu/blob/e9155d9f8ba7e4f72844e32d3dc6a005ae341b4c/kombu/transport/qpid.py#L1452

Comment 4 Brian Bouterse 2014-05-22 21:25:31 UTC

After talking with Ted Ross, I determined a way to simplify the Qpid transport by adjusting how we interact with the qpid.messaging client. Here would be the changes necessary:

1. Delete the FDShimthread object altogether
2. Have basic_consume create a qpid.messaging receiver() and maintain the reference directly to it in MainThread.
3. Have FDShim monitor the set of receivers at once using a blocking call with a timeout to the undocumented next_receiver() method.
4. Fix associated tests.

This would eliminate the pipe that is heavily shared across all of those threads, and is the root cause of this NoAvailableQueues issue.

We would be left with 1 pipe with exactly 1 producer thread writing to it, and exactly 1 consumer thread reading from it. You would also reduce the thread count from N +1 additional threads for N consumers to exactly 1 additional thread for N consumers.

Obviously, this would be a significantly improvement in the simplicity, and will likely resolve the reliability issues at the same time.

Comment 5 Og Maciel 2014-05-23 13:34:58 UTC

*** Bug 1100797 has been marked as a duplicate of this bug. ***

Comment 6 Sayli Karmarkar 2014-05-23 16:21:42 UTC

*** Bug 1100700 has been marked as a duplicate of this bug. ***

Comment 7 Preethi Thomas 2014-05-25 18:20:47 UTC

*** Bug 1098734 has been marked as a duplicate of this bug. ***

Comment 8 Brian Bouterse 2014-05-28 10:25:20 UTC

*** Bug 1101984 has been marked as a duplicate of this bug. ***

Comment 9 Brian Bouterse 2014-05-28 15:10:33 UTC

PR available here:

https://github.com/pulp/pulp/pull/988

Comment 10 Brian Bouterse 2014-05-28 15:10:56 UTC

Merged

Comment 11 Randy Barlow 2014-05-29 18:07:24 UTC

This was fixed in pulp-2.4.0-0.19.beta.

Comment 12 Preethi Thomas 2014-06-06 17:43:02 UTC

verified

I have had this new build for over a week and I havent seen this error. Moving this to verified

[root@pulp-24-server ~]# rpm -qa pulp-server
pulp-server-2.4.0-0.19.beta.el6.noarch
[root@pulp-24-server ~]#

Comment 13 Randy Barlow 2014-08-09 06:54:14 UTC

This has been fixed in Pulp 2.4.0-1.

Note You need to log in before you can comment on or make changes to this bug.