Bug 1041084

Summary: [RFE][nova]: Automatic recovery from transient db connection failures
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: openstack-novaAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Ami Jeain <ajeain>
Severity: low Docs Contact:
Priority: medium    
Version: unspecifiedCC: markmc, ndipanov, sgordon, slong, vpopovic, yeylon
Target Milestone: rcKeywords: FutureFeature
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/nova/+spec/db-reconnect
Whiteboard: upstream_milestone_icehouse-rc1 upstream_status_implemented upstream_definition_approved
Fixed In Version: openstack-nova-2014.1-3.el7ost Doc Type: Enhancement
Doc Text:
Transient database-connection failures are now recovered automatically. There are a variety of circumstances which can cause a transient failure in database connection (for example, the restart or upgrade of the database, migration of VIP between an HA pair, or a network failure). Compute now catches these "db-has-gone-away" errors by automatically reconnecting and retrying the last operation in such a way that the caller is able to continue whatever operation was in progress. The user no longer has to abort long-running operations (such as 'nova boot' or 'glance image-create') just because of a momentary interruption in database connectivity.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-08 15:27:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description RHOS Integration 2013-12-12 13:35:56 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/nova/+spec/db-reconnect.

Description:

There are a variety of circumstances which can cause a transient failure in database connections, for example: restart / upgrade of the database, migration of VIP between HA pair, or just a network failure. Nova (and all projects connecting to a database) would benefit from the db/api catching these "db-has-gone-away" errors and automatically reconnecting and retrying the last operation, in such a way that the caller is able to continue what ever operation was in process. It is not necessary to abort long-running operations (such as nova boot or glance image-create) just because of a momentary interruption in db connectivity.

A (slightly brute-force) patch was previously proposed: https://review.openstack.org/#/c/10797/. To enable retries safely, more work is probably going to be required.

Specification URL (additional information):

None

Comment 8 errata-xmlrpc 2014-07-08 15:27:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0853.html