Originally reported there: https://bugs.launchpad.net/tripleo/+bug/1668372 by Michele. Description of problem: I have observed the following during one of my HA N->O upgrade runs (note the lost connection): 1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲ Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0 4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man .... Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\ n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr 8B.2.11/nova?bind_address=172.16.2.13', 'name': None, 'transport_url': u'rabbit://guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc 7Zr7GqUYkwsWYuJDQm.localdomain:5672,guest:9bakaEc7Zr7GqUYkwsWYuJDQm.localdomain:5672/?ssl=0', 'created_at': datetime .datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call 2. While Step4 on controller-0 started at: 2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed and finished at: 2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed 3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards: galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue. It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources. How reproducible: often Steps to Reproduce: 1. deploy overcloud with multiple controllers; 2. upgrade
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
Hi, the gerrithub review in ansible-pacemaker needs to land before or at the same time than the review in tht. Thanks,
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245