RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1365884 - MySQL Galera fails to start
Summary: MySQL Galera fails to start
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: RDO
Classification: Community
Component: openstack-tripleo
Version: Mitaka
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: trunk
Assignee: James Slagle
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-10 12:05 UTC by Christopher Brown
Modified: 2017-02-15 17:32 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-02-15 17:32:48 UTC
Embargoed:


Attachments (Terms of Use)
sosreports from controller nodes (19.48 MB, application/x-gzip)
2016-08-11 01:36 UTC, Graeme Gillies
no flags Details

Description Christopher Brown 2016-08-10 12:05:32 UTC
Description of problem:

Deployment currently fails with:

2016-08-10 11:28:30 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-08-10 11:28:31 [0]: SIGNAL_COMPLETE Unknown
2016-08-10 11:28:31 [overcloud-ControllerNodesPostDeployment-aa2zt557lizs-ControllerServicesBaseDeployment_Step2-ualpuau5qu3e]: CREATE_FAILED Resource CREATE failed: Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6

pcs status shows:

 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-ctrl-0 overcloud-ctrl-1 overcloud-ctrl-2 ]
 ip-10.122.4.8  (ocf::heartbeat:IPaddr2):       Started overcloud-ctrl-1
 ip-10.122.4.9  (ocf::heartbeat:IPaddr2):       Started overcloud-ctrl-1
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-ctrl-0 overcloud-ctrl-1 overcloud-ctrl-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-ctrl-2 ]
     Slaves: [ overcloud-ctrl-0 overcloud-ctrl-1 ]
 Master/Slave Set: galera-master [galera]
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-ctrl-2 (unmanaged)
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-ctrl-0 (unmanaged)
     Masters: [ overcloud-ctrl-1 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-ctrl-0 overcloud-ctrl-1 overcloud-ctrl-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-ctrl-0 overcloud-ctrl-1 overcloud-ctrl-2 ]

Failed Actions:
* galera_promote_0 on overcloud-ctrl-2 'unknown error' (1): call=68, status=complete, exitreason='MySQL server failed to start (pid=8941) (rc=0), please check your installation',
    last-rc-change='Wed Aug 10 12:40:13 2016', queued=0ms, exec=38480ms
* ip-10.122.4.9_monitor_10000 on overcloud-ctrl-1 'unknown' (189): call=76, status=Error, exitreason='none',
    last-rc-change='Wed Aug 10 12:38:33 2016', queued=0ms, exec=0ms
* ip-10.122.4.8_monitor_10000 on overcloud-ctrl-1 'unknown error' (1): call=-1, status=Timed Out, exitreason='none',
    last-rc-change='Wed Aug 10 12:39:53 2016', queued=0ms, exec=0ms
* openstack-core_monitor_10000 on overcloud-ctrl-1 'unknown' (189): call=81, status=Error, exitreason='none',
    last-rc-change='Wed Aug 10 12:38:33 2016', queued=0ms, exec=0ms
* galera_promote_0 on overcloud-ctrl-0 'unknown error' (1): call=70, status=complete, exitreason='MySQL server failed to start (pid=26882) (rc=0), please check your installation',
    last-rc-change='Wed Aug 10 12:40:13 2016', queued=0ms, exec=38507ms


PCSD Status:
  overcloud-ctrl-0: Online
  overcloud-ctrl-1: Online
  overcloud-ctrl-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


mysql logs show:

2016-08-10 12:40:49 139724652165248 [Note] WSREP: view((empty))
2016-08-10 12:40:49 139724652165248 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():162
2016-08-10 12:40:49 139724652165248 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():206: Failed to open backend connection: -110 (Connection timed out)
2016-08-10 12:40:49 139724652165248 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1379: Failed to open channel 'galera_cluster' at 'gcomm://overcloud-ctrl-exeter-0,overcloud-ctrl-exeter-1,overcloud-ctrl-exeter-2': -110 (Connection timed out)
2016-08-10 12:40:49 139724652165248 [ERROR] WSREP: gcs connect failed: Connection timed out
2016-08-10 12:40:49 139724652165248 [ERROR] WSREP: wsrep::connect(gcomm://overcloud-ctrl-exeter-0,overcloud-ctrl-exeter-1,overcloud-ctrl-exeter-2) failed: 7
2016-08-10 12:40:49 139724652165248 [ERROR] Aborting


Version-Release number of selected component (if applicable):

This is using current mitaka stable images. Nightly delorean images do not exhibit this problem.

How reproducible:

Always

Steps to Reproduce:
1. Build images as follows:

Edit /usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_image.py and add rdo-release to build from stable

# mkdir ~/images
# cd ~/images
# export RDO_RELEASE=mitaka
# openstack overcloud image build --all
# openstack overcloud image upload --update-existing

2. Deploy overcloud

Comment 1 Graeme Gillies 2016-08-10 23:31:00 UTC
I've experienced the same problems as well, and it looks like different versions of the resource-agents package produces different results.

Technically this should be moved under the resource-agents component, which might mean moving it out from RDO product to CentOS, but I'll get someone to confirm

Comment 2 Graeme Gillies 2016-08-11 01:04:18 UTC
I have an environment that can 100% reproduce this

Looking closer this might not be a resource-agents issue, perhaps an issue with the mariadb or galera packages themselves

Comment 4 Graeme Gillies 2016-08-11 01:36:37 UTC
Created attachment 1189856 [details]
sosreports from controller nodes

sosreports from controller nodes experiencing problem are attached

Comment 5 Graeme Gillies 2016-08-12 00:15:00 UTC
Ok I have narrowed down the issue.

It looks like the version of galera we should be using is

galera-25.3.5-6.el7.x86_64

Which is provided by the openstack-mitaka repo.

If you have EPEL enabled on the machine, you will instead get

galera-25.3.12-2.el7.x86_64

Which obviously has an issue with the version of mysql-server-galera we are using.

Basically you need to make absolutely sure you don't have epel enabled on your overcloud images, and make sure the version of galera you are using is the one we ship as part of RDO.

This is a problem because python-openstackclient forces epel to be enabled, even though it shouldn't be (reason being something to do with diskimage-builder).

Comment 6 Haïkel Guémar 2016-08-17 15:09:47 UTC
No, the issue is that you're using galera instead of mariadb 10.1 directly.

Galera (standalone version) has been deprecated by MariaDB 10.1 which now includes the former. We've been shipping it since Mitaka release.
http://cbs.centos.org/koji/buildinfo?buildID=10246

In short, do not use standalone galera for Mitaka and newer releases.

Comment 7 Christopher Brown 2016-08-17 19:12:01 UTC
(In reply to Haïkel Guémar from comment #6)
> No, the issue is that you're using galera instead of mariadb 10.1 directly.
> 
> Galera (standalone version) has been deprecated by MariaDB 10.1 which now
> includes the former. We've been shipping it since Mitaka release.
> http://cbs.centos.org/koji/buildinfo?buildID=10246
> 
> In short, do not use standalone galera for Mitaka and newer releases.

But that is what gets rolled into the image when we build it? "We" are not requesting Galera at any point.

Comment 8 Graeme Gillies 2016-08-17 22:56:55 UTC
(In reply to Haïkel Guémar from comment #6)
> No, the issue is that you're using galera instead of mariadb 10.1 directly.
> 
> Galera (standalone version) has been deprecated by MariaDB 10.1 which now
> includes the former. We've been shipping it since Mitaka release.
> http://cbs.centos.org/koji/buildinfo?buildID=10246
> 
> In short, do not use standalone galera for Mitaka and newer releases.

Yes we understand that, the problem is the tripleo image building process

openstack overcloud image build --all

is pulling in epel and the bad galera package. We have no way of avoiding that. The process itself is broken. So "we" aren't doing anything, tripleo is.

The reason this has slipped through is that the way images are built as part of CI and in CBS is different to how users are expected to build these images

Comment 9 Christopher Brown 2017-02-15 17:32:48 UTC
Hello,

I'm closing this. Not sure if they general QA issue with image building has been addressed but I just ended up using cbs images.

Will re-open on next RDO project if issue persists.


Note You need to log in before you can comment on or make changes to this bug.