Bug 1652287

Summary: Deployment may fail because nova_wait_for_db_sync runs simultaneously on all three controllers
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Gurenko Alex <agurenko>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: asimonel, dbecker, egallen, jhakimra, lyarwood, mbooth, mburns, michele, morazi, mschuppe
Target Milestone: rcKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-9.0.1-0.20181013060895.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:54:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1625732    

Description Marius Cornea 2018-11-21 19:30:55 UTC
Description of problem:
Deployment may fail because nova_wait_for_db_sync runs simultaneously on all three controllers, one of the controllers failing with:

        "ERROR:oslo_db.sqlalchemy.exc_filters:DBAPIError exception wrapped from (pymysql.err.InternalError) (1050, u\"Table 'migrate_version' already exists\") [SQL: u'\\nCREATE TABLE migrate_version (\\n\\trepository_id VARCHAR(250) NOT NULL, \\n\\trepository_path TEXT, \\n\\tversion INTEGER, \\n\\tPRIMARY KEY (repository_id)\\n)\\n\\n'] (Background on this error at: http://sqlalche.me/e/2j85)", 
        "Traceback (most recent call last):", 
        "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py\", line 1193, in _execute_context", 
        "    context)", 
        "  File \"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py\", line 507, in do_execute", 
        "    cursor.execute(statement, parameters)", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/cursors.py\", line 166, in execute", 
        "    result = self._query(query)", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/cursors.py\", line 322, in _query", 
        "    conn.query(q)", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 856, in query", 
        "    self._affected_rows = self._read_query_result(unbuffered=unbuffered)", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1057, in _read_query_result", 
        "    result.read()", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1340, in read", 
        "    first_packet = self.connection._read_packet()", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1014, in _read_packet", 
        "    packet.check_error()", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 393, in check_error", 
        "    err.raise_mysql_exception(self._data)", 
        "  File \"/usr/lib/python2.7/site-packages/pymysql/err.py\", line 107, in raise_mysql_exception", 
        "    raise errorclass(errno, errval)", 
        "InternalError: (1050, u\"Table 'migrate_version' already exists\")", 
        "ERROR:nova_wait_for_db_sync:uuups something went wrong: %s (pymysql.err.InternalError) (1050, u\"Table 'migrate_version' already exists\") [SQL: u'\\nCREATE TABLE migrate_version (\\n\\trepository_id VARCHAR(250) NOT NULL, \\n\\trepository_path TEXT, \\n\\tversion INTEGER, \\n\\tPRIMARY KEY (repository_id)\\n)\\n\\n'] (Background on this error at: http://sqlalche.me/e/2j85)", 

Version-Release number of selected component (if applicable):
2018-11-21.2 puddle

How reproducible:
not always

Steps to Reproduce:
1. Deploy OSP14 with 3 controllers

Actual results:
It looks like nova_wait_for_db_sync runs db migrations simultaneously on all 3 controllers.

Expected results:
It should run the db migrations on 1/3 controllers.

Additional info:
Attaching ansible.log

Comment 4 Matthew Booth 2018-11-22 13:12:42 UTC
I think these 2 errors represent 2 slightly different manifestations of the same race in nova's db_version(). We do:

try:
  get db version
except:
  count how many tables we've got
  if no tables:
    create a version table
  else:
    complain about an Essex-era db

So if 2 threads execute this concurrently and both fail to get db version, we've then got a different behaviour based on whether the race winner has started creating the schema yet or not. If not, we get a 1050 as in the description. If yes, we get the Essex warning as in comment 3. The cause is the same, though.

Comment 5 Martin Schuppert 2018-11-26 12:31:18 UTC
*** Bug 1636399 has been marked as a duplicate of this bug. ***

Comment 6 Martin Schuppert 2018-11-27 07:18:08 UTC
*** Bug 1651475 has been marked as a duplicate of this bug. ***

Comment 7 Lee Yarwood 2018-11-28 12:55:36 UTC
*** Bug 1653816 has been marked as a duplicate of this bug. ***

Comment 18 Joe H. Rahme 2018-12-03 14:02:13 UTC
Marking this as verified, after running multiple deployments with 3 controllers and didn't observe the faulty behavior.

If this bug starts reappearing do not hesitate to reopen the bug.

Comment 21 errata-xmlrpc 2019-01-11 11:54:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045