Bug 1100005
| Summary: | repo sync failure list index out of range (No available Queue exception) | ||
|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Preethi Thomas <pthomas> |
| Component: | rpm-support | Assignee: | Brian Bouterse <bmbouter> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Preethi Thomas <pthomas> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.4 Beta | CC: | bmbouter, cperry, ipanova, omaciel, paul.urwin, pgustafs, skarmark |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | 2.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-08-09 06:54:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1099902 | ||
|
Description
Preethi Thomas
2014-05-21 18:59:31 UTC
I'm investigating this issue, and am troubleshooting on the box where the issue was experienced. Here is what I know so far: 1. The workers were removed by scheduler.WorkerTimeoutMonitor because their their last_heartbeat times were older than 5 minutes. 2. Worker heartbeats have continued to flow the entire time 3. All workers were deleted at the same moment, which means it's not the workers that had the issue, it's the event monitor. 3. Restarting only celerybeat causes all workers to be discovered and everything works again. This reinforces the idea that the root cause is in the event monitor. 4. The monitor_events method never exits! If it did we would see the log statement from the outer runloop in run() that logs at the error level. This means some kind of unrecoverable exception is occurring inside of the celery capture() method and being silenced, not logged, and continuing (without actually continuing). Unfortunately, the useful logging is done at the DEBUG level, and so the existing logs don't provide any useful detail. I've enabled DEBUG logging, and I am trying to reproduce the issue with DEBUG on. Perhaps this is a bug in the Celery event capture functionality that we rely on for this functionality. Several individuals have run into this. Their environments are: Fedora 20 with Qpid 0.26 RHEL 6.5 with Qpid 0.18 RHEL 6.5 with Qpid 0.22 This also shows itself as the exception NoAvailableQueues() I believe the root cause is a thread deadlock is occurring during a call to get() here[0]. The deadlock is in a thread on the critical path of message processing, so that explains why events stop being processed even though heartbeat messages are being emitted, and properly placed into the 'celeryev.xxxxxxxx' queue. Normally, events in the 'celeryev.xxxxxxxxx' queue are supposed to be drained by threads in the celerybeat process.. Interestingly, the FDShimThread threads do drain the events from 'celeryev.xxxxxxx' and continue to read event messages using qpid.messaging even after the Celery event callback handlers have stopped being called (because of the deadlock). This means that worker heartbeat messages are being delivered to qpid.messaging, and a thread inside the consumer, but it's not making its way to the final destination in the qpid transport. It's a strange problem cause, but that's what I've observed. Especially strange because the get is on a Queue.Queue object which is supposed to be thread safe. There are some posts on the internet where users describe thread deadlock when using Queue objects, but I'm not sure how this is possible. I've reworked that blocking call to use a timeout, and I'm investigating if that fixes the issue by allowing the blocking call to wakeup, and then re-enter to process more events. [0]: https://github.com/pulp/kombu/blob/e9155d9f8ba7e4f72844e32d3dc6a005ae341b4c/kombu/transport/qpid.py#L1452 After talking with Ted Ross, I determined a way to simplify the Qpid transport by adjusting how we interact with the qpid.messaging client. Here would be the changes necessary: 1. Delete the FDShimthread object altogether 2. Have basic_consume create a qpid.messaging receiver() and maintain the reference directly to it in MainThread. 3. Have FDShim monitor the set of receivers at once using a blocking call with a timeout to the undocumented next_receiver() method. 4. Fix associated tests. This would eliminate the pipe that is heavily shared across all of those threads, and is the root cause of this NoAvailableQueues issue. We would be left with 1 pipe with exactly 1 producer thread writing to it, and exactly 1 consumer thread reading from it. You would also reduce the thread count from N +1 additional threads for N consumers to exactly 1 additional thread for N consumers. Obviously, this would be a significantly improvement in the simplicity, and will likely resolve the reliability issues at the same time. *** Bug 1100797 has been marked as a duplicate of this bug. *** *** Bug 1100700 has been marked as a duplicate of this bug. *** *** Bug 1098734 has been marked as a duplicate of this bug. *** *** Bug 1101984 has been marked as a duplicate of this bug. *** PR available here: https://github.com/pulp/pulp/pull/988 Merged This was fixed in pulp-2.4.0-0.19.beta. verified I have had this new build for over a week and I havent seen this error. Moving this to verified [root@pulp-24-server ~]# rpm -qa pulp-server pulp-server-2.4.0-0.19.beta.el6.noarch [root@pulp-24-server ~]# This has been fixed in Pulp 2.4.0-1. |