Bug 1434955

Summary: OSP10 -> OSP11 upgrade: after running major-upgrade-composable-steps.yaml nova-api cannot connect to Galera on 2/3 controllers
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: dbecker, jcoufal, jschluet, mburns, morazi, rhel-osp-director-maint, sasha, sathlang
Target Milestone: rcKeywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-0.11.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:11:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1394019    

Description Marius Cornea 2017-03-22 17:51:09 UTC
Description of problem:
OSP10 -> OSP11 upgrade: after running major-upgrade-composable-steps.yaml nova-api cannot connect to MySQL on 2/3 controllers. This results in timeouts and 500  ERROR (ClientException): Unknown Error (HTTP 504) when calling the nova api, making it very difficult to manage the nova instances prior to upgrading the compute nodes.

Even after running the upgrade converge step 2/3 controllers cannot reach MySQL leaving the upgraded environment in a semi working state. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-0.3.0rc2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 ctrl, 2 computes, 3 ceph nodes
2. Upgrde to OSP11

Actual results:
controller-1 and controller-2 report in /var/log/nova/nova-api.log:
2017-03-22 17:25:43.189 377885 WARNING oslo_db.sqlalchemy.engines [req-41cd5904-8031-48d1-9e12-56e73c62f5cd - - - - -] SQL connection failed. -123 attempts left.

Expected results:
nova api on all controllers is in a functional state.

Additional info:
Attaching the sosreports.

Comment 2 Sofer Athlan-Guyot 2017-03-22 19:09:57 UTC
Hi,

cannot find anything obvious.

VIP is on ctl-2.

On ctl-2 and ctl-1 we have this kind of error:

    /var/log/nova/nova-api.log:2017-03-22 18:49:12.394 195971 DEBUG nova.api.openstack.wsgi [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] Calling method '<bound method ServersController.detail of <nova.api.openstack.compute.servers.ServersController object at 0x79097d0>>' _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:626
    /var/log/nova/nova-api.log:2017-03-22 18:49:12.401 195971 DEBUG nova.compute.api [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] Searching by: {'deleted': False} get_all /usr/lib/python2.7/site-packages/nova/compute/api.py:2344
    /var/log/nova/nova-api.log:2017-03-22 18:49:12.414 195971 DEBUG oslo_db.sqlalchemy.engines [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py:261
    
    ....
    
    /var/log/nova/nova-api.log:2017-03-22 18:55:12.787 195971 WARNING oslo_db.sqlalchemy.engines [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] SQL connection failed. -37 attempts left.
    /var/log/nova/nova-api.log:2017-03-22 18:55:22.794 195971 WARNING oslo_db.sqlalchemy.engines [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] SQL connection failed. -38 attempts left.
    /var/log/nova/nova-api.log:2017-03-22 18:55:32.913 195971 WARNING oslo_db.sqlalchemy.engines [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] SQL connection failed. -39 attempts left.
    /var/log/nova/nova-api.log:2017-03-22 18:55:42.924 195971 WARNING oslo_db.sqlalchemy.engines [req-7681616d-f43d-490f-9333-b86a9bbf3dff - - - - -] SQL connection failed. -40 attempts left.

and:

    /var/log/nova/nova-api.log:2017-03-22 18:59:12.404 195967 DEBUG nova.api.openstack.wsgi [req-dbc1f1d7-a700-409a-b0ea-6364636ac8a6 - - - - -] Calling method '<bound method ServersController.detail of <nova.api.openstack.compute.servers.ServersController object at 0x79097d0>>' _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:626
    /var/log/nova/nova-api.log:2017-03-22 18:59:12.411 195967 DEBUG nova.compute.api [req-dbc1f1d7-a700-409a-b0ea-6364636ac8a6 - - - - -] Searching by: {'deleted': False} get_all /usr/lib/python2.7/site-packages/nova/compute/api.py:2344
    /var/log/nova/nova-api.log:2017-03-22 18:59:12.424 195967 DEBUG oslo_db.sqlalchemy.engines [req-dbc1f1d7-a700-409a-b0ea-6364636ac8a6 - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py:261
    /var/log/nova/nova-api.log:2017-03-22 18:59:12.437 195967 WARNING oslo_db.sqlalchemy.engines [req-dbc1f1d7-a700-409a-b0ea-6364636ac8a6 - - - - -] SQL connection failed. -1 attempts left.
    /var/log/nova/nova-api.log:2017-03-22 18:59:22.446 195967 WARNING oslo_db.sqlalchemy.engines [req-dbc1f1d7-a700-409a-b0ea-6364636ac8a6 - - - - -] SQL connection failed. -2 attempts left.

Using the mysql client we can connect from all the node to the db all
right and the grants looks fine.

What is strange is the negative counter for the attempt.  Will ask for
nova help if they can provide assistance.

Comment 3 Sofer Athlan-Guyot 2017-03-23 08:28:27 UTC
So just adding that runnnig 

    mysql -u nova_api -p -h <galera_vip> -e 'show grants;' 

works fine from every nodes in the cluster, so no network connectivity issue.

Adding computes from visibility.

Comment 4 Sofer Athlan-Guyot 2017-03-23 10:23:10 UTC
Hi,

So after a discussion with Damien and Michele, the problem was found.

It appears that when the nova cell is create it hardcode into the database the connection parameter present in /etc/nova/nova.conf of the node where it's run.  Running it on controller0 for instance will give you this in the database:

    mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19nova?bind_address=10.0.0.20'

This is later used to create a connection to the database for nova cell information.  This obviously fails on 2 other node as they don't have the 10.0.0.20 address.

To prevent this issue, this workaround have been done:  https://review.openstack.org/#/c/436192/ removing the bind_address parameter from the configuration line.

The sequence of event on the seems correct:
 1. update hiera data;
 2. create nova cell with database option;


From the journalctl logs:

    Mar 22 19:23:44 overcloud-controller-0.localdomain os-collect-config[4197]: [2017-03-22 19:23:44,806] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/hiera < /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json
    Mar 22 19:23:44 overcloud-controller-0.localdomain os-collect-config[4197]: [2017-03-22 19:23:44,852] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json < /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.notify.json

    $ grep nova::database_connection /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json
            "nova::database_connection": "mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo",

    [root@overcloud-controller-0 e]# journalctl | grep 'nova-manage cell_v2'
    Mar 22 19:39:52 overcloud-controller-0.localdomain ansible-command[440226]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_cell0 removes=None creates=None chdir=None
    Mar 22 19:39:55 overcloud-controller-0.localdomain ansible-command[440632]: Invoked with warn=True executable=None _uses_shell=True _raw_params=nova-manage cell_v2 create_cell --name='default' --database_connection=$(hiera nova::database_connection) removes=None creates=None chdir=None
    Mar 22 19:40:09 overcloud-controller-0.localdomain ansible-command[443480]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_cell_and_hosts removes=None creates=None chdir=None
    Mar 22 19:40:12 overcloud-controller-0.localdomain ansible-command[443950]: Invoked with warn=True executable=None _uses_shell=True _raw_params=nova-manage cell_v2 list_cells | sed -e '1,3d' -e '$d' | awk -F ' *| *' '$2 == "default" {print $4}' removes=None creates=None chdir=None
    Mar 22 19:40:15 overcloud-controller-0.localdomain ansible-command[444382]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_instances --cell_uuid 7f04f00d-4b9d-478d-941f-d93a7be145e7 removes=None creates=None chdir=None

    INSERT INTO `cell_mappings` VALUES
    ('2017-03-22 19:39:54',NULL,2,'00000000-0000-0000-0000-000000000000','cell0','none:///','mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova_cell0?bind_address=10.0.0.20'),
    ('2017-03-22 19:39:58',NULL,5,'7f04f00d-4b9d-478d-941f-d93a7be145e7','default','rabbit://guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672,guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672,guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672/?ssl=0','mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo'),
    ('2017-03-22 19:40:12',NULL,8,'785eaa63-2ac1-4f2f-83e0-cb92e9045b9f',NULL,'rabbit://guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672,guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672,guest:ZB6fp92XTbxGXzhDYnWrkN4Qq.localdomain:5672/?ssl=0','mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova?bind_address=10.0.0.20');


But in the end we can see that the database is not in the correct state:

    root@overcloud-controller-0 e]# mysql -e 'select name,database_connection from nova_api.cell_mappings'
    +---------+--------------------------------------------------------------------------------------------------------------------------------------+
    | name    | database_connection                                                                                                                  |
    +---------+--------------------------------------------------------------------------------------------------------------------------------------+
    | cell0   | mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova_cell0?bind_address=10.0.0.20                                           |
    | default | mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo |
    | NULL    | mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge.0.19/nova?bind_address=10.0.0.20                                                 |
    +---------+--------------------------------------------------------------------------------------------------------------------------------------+


 1. cell0 is pointing to the wrong connection
 2. cell NULL ??? what is that ?

For 1. we're missing --database=$(hiera nova::database_connection) in this command:

   nova-manage cell_v2 map_cell0

For NULL cell creation I don't know yet.

Making upstream bz, and review.

Comment 5 Sofer Athlan-Guyot 2017-03-23 21:33:55 UTC
Put the gerrit patch to point to the ocata branch.

Comment 9 errata-xmlrpc 2017-05-17 20:11:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245