1118404 – Scheduler not able to recover from very long mongo and/or qpidd outage

Bug 1118404 - Scheduler not able to recover from very long mongo and/or qpidd outage

Summary: Scheduler not able to recover from very long mongo and/or qpidd outage

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Pulp
Classification:	Retired
Component:	async/tasks
Sub Component:
Version:	2.4 Beta
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	pulp-bugs
QA Contact:	pulp-qe-list
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-10 15:47 UTC by Brian Bouterse
Modified:	2015-02-28 22:12 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-28 22:12:43 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Pulp Redmine	470	0	None	None	None	Never

Description Brian Bouterse 2014-07-10 15:47:21 UTC

Celerybeat uses pulp.server.async.scheduler which provides correct reconnect support if either Mongo or Qpid go down and then come back later. For normal outage times, minutes or hours, the reconnect support works fine. For outages that last on the order of days the user will eventually receive the following message:

pulp.server.async.scheduler:ERROR: [Errno 24] Too many open files

One the user sees that message, reconnect support no longer works, and the celerybeat service would need to be restarted. Something about the reconnect support is using a file descriptor with each reconnect attempt.

I'm not sure if it is Qpid or Mongo that causes this, so I'm identifying them both as possible causes.

To reproduce:
1. Stop all Pulp services
2. Start Mongo
3. Start Qpid
4. Start celerybeat
5. stop Mongo
6. Stop Qpid
7. Observe the reconnects trying over and over
8. Wait a long time (like overnight)
9. Observe the error message above in the logs

Comment 1 Brian Bouterse 2014-08-12 21:32:18 UTC

The failing component is likely long-term reconnect with qpidd because there is another error with pulp_celerybeat that occurs if mongod is not available [0]. That other error is experienced on a much shorter timeline (minutes) with a different traceback. By deduction I believe this BZ is likely caused by reconnect support with respect to qpidd.

[0]:  https://bugzilla.redhat.com/show_bug.cgi?id=1129488

Comment 2 Brian Bouterse 2015-02-28 22:12:43 UTC

Moved to https://pulp.plan.io/issues/470

Note You need to log in before you can comment on or make changes to this bug.