Bug 1434748

Summary: OSP10 -> OSP11 upgrades do not wait for galera to be fully up.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: jcoufal, mburns, rhel-osp-director-maint, sclewis
Target Milestone: rcKeywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-0.5.el7ost ansible-pacemaker-1.0.1-0.20170308085805.381d6c8.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:11:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2017-03-22 10:00:37 UTC
Originally reported there: https://bugs.launchpad.net/tripleo/+bug/1668372 by Michele.

Description of problem:

I have observed the following during one of my HA N->O upgrade runs (note the lost connection):
1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0
4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca
ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man
....
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n
    first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\
n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e
rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL
UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr
8B.2.11/nova?bind_address=172.16.2.13', 'name': None, 'transport_url': u'rabbit://guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc
7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672/?ssl=0', 'created_at': datetime
.datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call

2. While Step4 on controller-0 started at:
2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed
and finished at:
2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed

3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards:
galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started

So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue.

It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources.


How reproducible: often


Steps to Reproduce:
1. deploy overcloud with multiple controllers;
2. upgrade

Comment 1 Red Hat Bugzilla Rules Engine 2017-03-22 10:00:43 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 2 Sofer Athlan-Guyot 2017-03-22 10:06:33 UTC
Hi,

the gerrithub review in ansible-pacemaker needs to land before or at the same time than the review in tht.

Thanks,

Comment 5 errata-xmlrpc 2017-05-17 20:11:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245