Bug 1434748 - OSP10 -> OSP11 upgrades do not wait for galera to be fully up.
Summary: OSP10 -> OSP11 upgrades do not wait for galera to be fully up.
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: 11.0 (Ocata)
Assignee: Sofer Athlan-Guyot
QA Contact: Marius Cornea
Depends On:
TreeView+ depends on / blocked
Reported: 2017-03-22 10:00 UTC by Sofer Athlan-Guyot
Modified: 2017-05-17 20:11 UTC (History)
4 users (show)

Fixed In Version: openstack-tripleo-heat-templates-6.0.0-0.5.el7ost ansible-pacemaker-1.0.1-0.20170308085805.381d6c8.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-05-17 20:11:21 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Gerrithub.io 351387 0 None None None 2017-03-22 10:06:32 UTC
OpenStack gerrit 438947 0 None None None 2017-03-22 10:02:55 UTC
Red Hat Product Errata RHEA-2017:1245 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC

Description Sofer Athlan-Guyot 2017-03-22 10:00:37 UTC
Originally reported there: https://bugs.launchpad.net/tripleo/+bug/1668372 by Michele.

Description of problem:

I have observed the following during one of my HA N->O upgrade runs (note the lost connection):
1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0
4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca
ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n
    first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\
n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e
rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL
UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr
8B@', 'name': None, 'transport_url': u'rabbit://guest:9bakaEc7Zr7GqUYkwsWYuJDQm@overcloud-controller-0.internalapi.localdomain:5672,guest:9bakaEc
7Zr7GqUYkwsWYuJDQm@overcloud-controller-1.internalapi.localdomain:5672,guest:9bakaEc7Zr7GqUYkwsWYuJDQm@overcloud-controller-2.internalapi.localdomain:5672/?ssl=0', 'created_at': datetime
.datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call

2. While Step4 on controller-0 started at:
2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed
and finished at:
2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed

3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards:
galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started

So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue.

It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources.

How reproducible: often

Steps to Reproduce:
1. deploy overcloud with multiple controllers;
2. upgrade

Comment 1 Red Hat Bugzilla Rules Engine 2017-03-22 10:00:43 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 2 Sofer Athlan-Guyot 2017-03-22 10:06:33 UTC

the gerrithub review in ansible-pacemaker needs to land before or at the same time than the review in tht.


Comment 5 errata-xmlrpc 2017-05-17 20:11:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.