1019909 – mongo seeds are not reconnecting to new PRIMARY of a replica set

Bug 1019909 - mongo seeds are not reconnecting to new PRIMARY of a replica set

Summary: mongo seeds are not reconnecting to new PRIMARY of a replica set

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Pulp
Classification:	Retired
Component:	z_other
Sub Component:
Version:	2.2 Beta
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	2.2.1
Assignee:	Jay Dobies
QA Contact:	Preethi Thomas
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1021011
TreeView+	depends on / blocked

Reported:	2013-10-16 15:31 UTC by Vincent Batts
Modified:	2013-12-09 14:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	1021011 (view as bug list)
Environment:
Last Closed:	2013-12-09 14:37:06 UTC
Embargoed:

Attachments	(Terms of Use)

Description Vincent Batts 2013-10-16 15:31:20 UTC

Description of problem:
If pulp seeds from a mongodb replica set, and the mongo PRIMARY is re-elected, pulp fails to reconnect to the new PRIMARY.

Version-Release number of selected component (if applicable):
pulp-server-2.2.0-0.20.beta.git.0.d54a854.el6eng.cdn.1.noarch
mongo server buildinfo - 2.4.6 (EPEL)
pymongo-2.1.1-1.el6.x86_64

How reproducible:
very

Steps to Reproduce:
1. have a mongo replica set, where mongodb01.web.stage.our.domain.com is PRIMARY
2. setup pulp to seed from a host in the mongodb replica set
 /etc/pulp/server.conf [database] seeds: mongodb01.web.stage.our.domain.com:27017
3. Start the pulp sever (and pulp-manage-db)(ensure it's functioning correctly)
3. on the mongodb rs, have mongodb01 step down from PRIMARY ( rs.stepDown() )
4. make calls to the pulp server ( `pulp-admin login -u admin -p S3krit` )

Actual results:
   An internal error occurred on the Pulp server. More information can be found in the client log file ~/.pulp/admin.log.
=== START ~/.pulp/admin.log =====
2013-10-16 10:22:45,466 - ERROR - Client-side exception occurred
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/pulp/client/extensions/core.py", line 478, in run
    exit_code = Cli.run(self, args)
  File "/usr/lib/python2.6/site-packages/okaara/cli.py", line 974, in run
    exit_code = command_or_section.execute(self.prompt, remaining_args)
  File "/usr/lib/python2.6/site-packages/pulp/client/extensions/extensions.py", line 224, in execute
    return self.method(*arg_list, **clean_kwargs)
  File "/usr/lib/pulp/admin/extensions/pulp_server_info/pulp_cli.py", line 35, in types
    all_types = self.context.server.server_info.get_types()
  File "/usr/lib/python2.6/site-packages/pulp/bindings/server_info.py", line 33, in get_types
    return self.server.GET(path)
  File "/usr/lib/python2.6/site-packages/pulp/bindings/server.py", line 84, in GET
    return self._request('GET', path, queries)
  File "/usr/lib/python2.6/site-packages/pulp/bindings/server.py", line 142, in _request
    self._handle_exceptions(response_code, response_body)
  File "/usr/lib/python2.6/site-packages/pulp/bindings/server.py", line 183, in _handle_exceptions
    raise code_class_mappings[response_code](response_body)
PermissionsException: RequestException: GET request on /pulp/api/v2/plugins/types/ failed with 401 - Pulp exception occurred: AuthenticationFailed
2013-10-16 10:23:05,446 - ERROR - Exception occurred:
        href:      /pulp/api/v2/actions/login/
        method:    POST
        status:    500
        error:     create_index operation failed on pulp2_database.users: database connection still down after 3 tries
        traceback: None
        data:      {u'args': [u'create_index operation failed on pulp2_database.users: database connection still down after 3 tries']}
=== END ~/.pulp/admin.log =====

Expected results:
   Successfully logged in. Session certificate will expire at Oct 23 14:24:33 2013 GMT.

Additional info:
If I can get the original PRIMARY node elected back as PRIMARY, then everything on pulp begins working again.

This is enough of an issue to block us from promoting pulp v2 to production.

Comment 1 Michael Hrivnak 2013-10-18 18:45:48 UTC

This is likely a regression. Unless it's particularly inconvenient, I think it makes sense to fix this in 2.2 and get it into our 2.2.1 release.

Comment 2 Jay Dobies 2013-10-23 14:55:42 UTC

https://github.com/pulp/pulp/pull/672

Comment 3 Jay Dobies 2013-10-23 15:06:26 UTC

For QE:

Check out http://docs.mongodb.org/manual/tutorial/deploy-replica-set-for-testing/ for information on setting up a replica set. Here's what I did:

- Set up a replica set with three mongod processes. I connected a mongo shell to each so that I could see who the primary was. It's pretty simple, the prompt in the console will indicate if it's a primary or secondary.

Unconfigured Test:

- Left the Pulp configuration at the default (i.e. no replica set configured but one in use)
- Point Pulp at the replica set primary DB.
- Run a watch on `pulp-admin rpm repo list` (or some other cheap command that hits the DB) and kill the primary database.
- /var/log/pulp/pulp.log will spam messages about not being able to connect, even though another database is named the primary.

Environment Reset:

Stop the watch and Apache. Restart the killed Mongo DB process. At this point, it actually doesn't matter which is the primary for the purposes of Pulp server configuration; it can continue to point at the port used in the previous run even though it's very likely to be a secondary (when it comes back up it doesn't replace the newly elected primary).

Configured Test:

- Edit /etc/pulp/server.conf to configure it for your replica set. The comments in there should be enough to guide you, so I won't mention any more.
- Restart Apache.
- Restart the watch.
- Kill the primary (remember to check the mongo shells to see which is the primary). The server logs will complain for a bit about not being able to connect (the sleep on the retry is super quick and mongo typically takes a bit longer than it to reorient itself). The CLI command on the watch should show errors too.
- After a very short amount of time (~2 seconds), the pulp log should stop showing connection errors and the CLI should show the results of the command correctly.

You can restart the killed instance, but again, it won't be renamed primary unless there's a need, so don't expect it to start fielding requests again immediately.

Comment 4 Jeff Ortel 2013-11-01 18:50:33 UTC

build: 2.2.1-0.1.beta

Comment 5 Preethi Thomas 2013-11-13 16:25:37 UTC

verified
[root@pulp-v2-server ~]# rpm -qa pulp-server
pulp-server-2.2.1-0.2.beta.el6.noarch
[root@pulp-v2-server ~]# 

setup replica set as per above and made sure its working well, reconnecting to the new primary

[database]
name: pulp_database
seeds: localhost:27017,localhost:27018,localhost:27019
operation_retries: 2

Comment 6 Preethi Thomas 2013-12-09 14:37:06 UTC

Released pulp 2.2.1

Note You need to log in before you can comment on or make changes to this bug.