Bug 1129488

Summary: pulp_celerybeat Mongodb Reconnect Support Fails After Three Tries
Product: [Retired] Pulp Reporter: Brian Bouterse <bmbouter>
Component: async/tasksAssignee: Brian Bouterse <bmbouter>
Status: CLOSED CURRENTRELEASE QA Contact: Preethi Thomas <pthomas>
Severity: unspecified Docs Contact:
Priority: medium    
Version: MasterCC: austin, bmbouter, pthomas, rbarlow, skarmark
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: 2.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1129768 (view as bug list) Environment:
Last Closed: 2014-11-24 21:33:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1129768, 1131719    

Description Brian Bouterse 2014-08-12 21:30:08 UTC
Description of problem: If pulp_celerybeat is running and correctly connected to mongod, and then mongod becomes unreachable for some reason, pulp_celerybeat will try to reconnect for 3 times, but will fail on the 4th time. Once the failure occurs it will die, and will never recover.

NOTE: Reconnect support does work properly on pulp_celerybeat if mongod is not available while starting. This BZ only affects reconnect support after an initial successful connection to mongod is made.


Version-Release number of selected component (if applicable): Pulp 2.4.0-1


How reproducible:
Always


Steps to Reproduce:
1. Stop pulp_celerybeat, pulp_resource_manager, pulp_workers, and httpd
2. Start mongod and qpidd
3. Start pulp_celerybeat and verify that the logs show that it starts correctly.
4. Stop mongod
5. Wait 2 minutes (ish)
6. Observe the following traceback:

Aug 12 16:42:41 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: [Errno 111] Connection refused
Aug 12 16:42:51 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: [Errno 111] Connection refused
Aug 12 16:42:56 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.db.connection:WARNING: create_index operation failed on pulp_database.workers: tries remaini
ng: 2
Aug 12 16:42:57 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.db.connection:WARNING: create_index operation failed on pulp_database.workers: tries remaini
ng: 1
Aug 12 16:42:57 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.db.connection:WARNING: create_index operation failed on pulp_database.workers: tries remaini
ng: 0
Aug 12 16:42:57 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: create_index operation failed on pulp_database.workers: database conn
ection still down after 3 tries
Aug 12 16:43:01 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: [Errno 111] Connection refused
Aug 12 16:43:11 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: [Errno 111] Connection refused
Aug 12 16:43:21 dhcp129-138.rdu.redhat.com pulp[25595]: pulp.server.async.scheduler:ERROR: [Errno 111] Connection refused
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL: beat raised exception <class 'pymongo.errors.AutoReconnect'>: AutoReconnect('could
 not connect to localhost:27017: [Errno 111] Connection refused',)
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL: Traceback (most recent call last):
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/celery/celery/apps/beat.py", line 112, in start_scheduler
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     beat.start()
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/celery/celery/beat.py", line 462, in start
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     interval = self.scheduler.tick()
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/Documents/pulp/server/pulp/server/async/scheduler.py", line
 317, in tick
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     ret = super(Scheduler, self).tick()
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/celery/celery/beat.py", line 219, in tick
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     for entry in values(self.schedule):
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/Documents/pulp/server/pulp/server/async/scheduler.py", line 387, in schedule
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     if self.schedule_changed:
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/Documents/pulp/server/pulp/server/async/scheduler.py", line 366, in schedule_changed
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     if utils.get_enabled().count() != self._loaded_from_db_count:
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 566, in count
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     **command)
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/database.py", line 388, in command
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     result = self["$cmd"].find_one(command, **extra_opts)
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/collection.py", line 596, in find_one
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 814, in next
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     if len(self.__data) or self._refresh():
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 763, in _refresh
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     self.__uuid_subtype))
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 700, in __send_message
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     **kwargs)
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/home/bmbouter/Documents/pulp/server/pulp/server/db/connection.py", line 171, in _with_end_request
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     return method(*args, **kwargs)
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/mongo_client.py", line 985, in _send_message_with_response
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     sock_info = self.__socket()
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/mongo_client.py", line 720, in __socket
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     host, port = self.__find_node()
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:   File "/usr/lib64/python2.7/site-packages/pymongo/mongo_client.py", line 713, in __find_node
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL:     raise AutoReconnect(', '.join(errors))
Aug 12 16:43:27 dhcp129-138.rdu.redhat.com pulp[25595]: celery.beat:CRITICAL: AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused


Expected results:
pulp_celerybeat to wait indefinitely for mongod to become available again.

Comment 1 Brian Bouterse 2014-08-14 15:32:42 UTC
PR available at:  https://github.com/pulp/pulp/pull/1109

Comment 2 Brian Bouterse 2014-08-14 15:34:10 UTC
Steps to reproduce:

1. Start all the pulp services
2. Perform a sync to ensure everything is running normally
3. Stop mongod
4. Wait a two minutes (or as long as you like)
5. Start mongod
6. Perform a sync to ensure everything is running normally

Comment 3 Brian Bouterse 2014-08-14 20:21:39 UTC
Actually the PR (against the correct branch this time) is here: https://github.com/pulp/pulp/pull/1111

Comment 4 Brian Bouterse 2014-08-20 14:26:02 UTC
https://github.com/pulp/pulp/pull/1118

PR #1111 was closed and reopened against the new 2.4.1-dev branch.

Comment 5 Brian Bouterse 2014-08-20 14:26:44 UTC
PR 1118 was merged into 2.4.1-dev branch

Comment 6 Randy Barlow 2014-08-20 18:58:46 UTC
This PR is merged to the 2.4.1-dev branch, but the commits were not included in the 2.4-dev branch or the master branch. Please merge these commits to 2.4-dev and master, and then move this bug back to MODIFIED.

Comment 7 Brian Bouterse 2014-08-20 21:15:51 UTC
I've merged this PR into 2.4-dev and master. I also merged it into pulp-2.4 by mistake, but that will not cause any issues.

Comment 8 Randy Barlow 2014-08-20 22:16:44 UTC
This was fixed in 2.4.1-0.2.alpha, available in Pulp's testing repository.

Comment 9 Preethi Thomas 2014-08-21 02:50:35 UTC
verified on rhel6.5 & 7
[root@qe-blade-08 module]# rpm -qa pulp-server
pulp-server-2.4.1-0.2.alpha.el6.noarch
[root@qe-blade-08 module]# 


[root@qe-blade-13 ~]# rpm -qa pulp-server
pulp-server-2.4.1-0.2.alpha.el7.noarch
[root@qe-blade-13 ~]# 

[root@qe-blade-13 ~]#  pulp-admin rpm repo create --repo-id zoo --feed https://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/
Successfully created repository [zoo]

[root@qe-blade-13 ~]# pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
                     Synchronizing Repository [zoo]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.


Downloading metadata...
[|]
... completed

Downloading repository content...
[==================================================] 100%
RPMs:       32/32 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[-]
... completed

Importing package groups/categories...
[-]
... completed


Task Succeeded



Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[==================================================] 100%
32 of 32 items
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
4 of 4 items
... completed

Publishing Comps file
[==================================================] 100%
3 of 3 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[-]
... completed

Writing Listings File
[-]
... completed


Task Succeeded


[root@qe-blade-13 ~]# 
[root@qe-blade-13 ~]# 
[root@qe-blade-13 ~]# 
[root@qe-blade-13 ~]# systemctl stop mongo
Failed to issue method call: Unit mongo.service not loaded.
[root@qe-blade-13 ~]# systemctl stop mongod
[root@qe-blade-13 ~]# systemctl start mongod
[root@qe-blade-13 ~]# 
[root@qe-blade-13 ~]# 
[root@qe-blade-13 ~]# pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
                     Synchronizing Repository [zoo]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.


Downloading metadata...
[|]
... completed

Downloading repository content...
[==================================================] 100%
RPMs:       0/0 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[-]
... completed

Importing package groups/categories...
[-]
... completed


Task Succeeded



Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[==================================================] 100%
32 of 32 items
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
4 of 4 items
... completed

Publishing Comps file
[==================================================] 100%
3 of 3 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[-]
... completed

Writing Listings File
[-]
... completed


Task Succeeded


[root@qe-blade-13 ~]#

Comment 10 Randy Barlow 2014-09-23 17:54:17 UTC
This is fixed in Pulp-2.4.1-1.

Comment 11 Brian Bouterse 2014-11-14 21:04:16 UTC
This was regressed in 2.5.0 so I'm reopening this again.

Comment 12 Brian Bouterse 2014-11-14 22:01:07 UTC
Another PR to fix the regression:  https://github.com/pulp/pulp/pull/1302

Comment 13 Brian Bouterse 2014-11-14 22:08:00 UTC
Merged to 2.5-testing -> 2.5-dev -> master

Comment 14 Preethi Thomas 2014-11-18 13:55:25 UTC
verified

[root@hp-dl380pgen8-02-vm-2 ~]# rpm -qa pulp-server
pulp-server-2.5.0-0.19.rc.el7.noarch
[root@hp-dl380pgen8-02-vm-2 ~]# 

[root@hp-dl380pgen8-02-vm-2 ~]# pulp-admin login -u admin -p admin
Successfully logged in. Session certificate will expire at Nov 25 13:38:36 2014
GMT.

[root@hp-dl380pgen8-02-vm-2 ~]# pulp-admin rpm repo create --repo-id zoo --feed https://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/
Successfully created repository [zoo]

[root@hp-dl380pgen8-02-vm-2 ~]# pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
                     Synchronizing Repository [zoo]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.


Downloading metadata...
[\]
... completed

Downloading repository content...
[==================================================] 100%
RPMs:       32/32 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[-]
... completed

Importing package groups/categories...
[-]
... completed


Task Succeeded



Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[==================================================] 100%
32 of 32 items
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
4 of 4 items
... completed

Publishing Comps file
[==================================================] 100%
3 of 3 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[-]
... completed

Writing Listings File
[-]
... completed


Task Succeeded


[root@hp-dl380pgen8-02-vm-2 ~]# systemctl stop mongod
[root@hp-dl380pgen8-02-vm-2 ~]# systemctl start mongod[root@hp-dl380pgen8-02-vm-2 ~]# pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
                     Synchronizing Repository [zoo]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.


Downloading metadata...
[\]
... completed

Downloading repository content...
[==================================================] 100%
RPMs:       0/0 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[-]
... completed

Importing package groups/categories...
[-]
... completed


Task Succeeded



Copying files
[-]
... completed

Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[-]
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
4 of 4 items
... completed

Publishing Comps file
[==================================================] 100%
3 of 3 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[-]
... completed

Writing Listings File
[-]
... completed


Task Succeeded


[root@hp-dl380pgen8-02-vm-2 ~]#