1434748 – OSP10 -> OSP11 upgrades do not wait for galera to be fully up.

Bug 1434748 - OSP10 -> OSP11 upgrades do not wait for galera to be fully up.

Summary: OSP10 -> OSP11 upgrades do not wait for galera to be fully up.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	11.0 (Ocata)
Assignee:	Sofer Athlan-Guyot
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-22 10:00 UTC by Sofer Athlan-Guyot
Modified:	2017-05-17 20:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-6.0.0-0.5.el7ost ansible-pacemaker-1.0.1-0.20170308085805.381d6c8.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-17 20:11:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	351387	None	None	None	2017-03-22 10:06:32 UTC
OpenStack gerrit	438947	None	None	None	2017-03-22 10:02:55 UTC
Red Hat Product Errata	RHEA-2017:1245	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory	2017-05-17 23:01:50 UTC

Description Sofer Athlan-Guyot 2017-03-22 10:00:37 UTC

Originally reported there: https://bugs.launchpad.net/tripleo/+bug/1668372 by Michele.

Description of problem:

I have observed the following during one of my HA N->O upgrade runs (note the lost connection):
1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0
4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca
ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man
....
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n
    first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\
n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e
rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL
UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr
8B.2.11/nova?bind_address=172.16.2.13', 'name': None, 'transport_url': u'rabbit://guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc
7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672/?ssl=0', 'created_at': datetime
.datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call

2. While Step4 on controller-0 started at:
2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed
and finished at:
2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed

3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards:
galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started

So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue.

It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources.


How reproducible: often


Steps to Reproduce:
1. deploy overcloud with multiple controllers;
2. upgrade

Comment 1 Red Hat Bugzilla Rules Engine 2017-03-22 10:00:43 UTC

This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 2 Sofer Athlan-Guyot 2017-03-22 10:06:33 UTC

Hi,

the gerrithub review in ansible-pacemaker needs to land before or at the same time than the review in tht.

Thanks,

Comment 5 errata-xmlrpc 2017-05-17 20:11:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245

Note You need to log in before you can comment on or make changes to this bug.