Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1158829

Summary:

maria db failed to start after server boot

Product:

Red Hat OpenStack

Reporter:

Asaf Hirshberg <ahirshbe>

Component:

openstack-foreman-installer

Assignee:

Jason Guiditta <jguiditt>

Status:

CLOSED ERRATA

QA Contact:

Asaf Hirshberg <ahirshbe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.0 (RHEL 6)

CC:

aberezin, adahms, ahirshbe, ddomingo, dmacpher, fdinitto, jdexter, jguiditt, lnatapov, lyarwood, mburns, morazi, oblaut, racedoro, rhos-maint, rohara, yeylon

Target Milestone:

Keywords:

ZStream

Target Release:

Installer

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-foreman-installer-3.0.3-1.el7ost

Doc Type:

Bug Fix

Doc Text:

MariaDB failed to start on controller nodes in high-availability environments due to issues with how systemd managed the MariaDB Galera Cluster. Red Hat Enterprise Linux OpenStack Platform 6.0 now uses Galera's resource-agent to manage MariaDB clusters, which resolves this issue.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-09 15:17:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
puma44-46 are the controllers, var/log/messages and mariadb.log	none
pcs config output	none

Description Asaf Hirshberg 2014-10-30 10:27:09 UTC

Created attachment 952051 [details]
puma44-46 are the controllers, var/log/messages and mariadb.log

Description of problem:
I deployed HA-neutron on baremetal with fencing and after some tests I booted all three server in the same time to check the recovery of cluster and the db.
as they finish the boot process I used "pcs status" to check pacemaker and I saw
lots of failed resources(rabbitmq-server, neutron-server, neutron-ovs-cleanup,  mysqld etc..). 
I checked mariadb status on the hosts and only in one it was active and on the other it was down with:" Failed to start MariaDB database server". So it's looks like another problem in galera. 

adding the logs of the 3 servers

Comment 3 Fabio Massimo Di Nitto 2014-10-30 12:18:42 UTC

Could you please attach pcs config output?

Comment 4 Jason Guiditta 2014-10-30 12:51:03 UTC

I don't think galera is expected to recover by itself when you reboot all three machines.  I remember trying this months ago and being told 'dont do that'.  Ryan, am I remembering wrong?

Comment 5 Fabio Massimo Di Nitto 2014-10-30 12:53:00 UTC

(In reply to Jason Guiditta from comment #4)
> I don't think galera is expected to recover by itself when you reboot all
> three machines.  I remember trying this months ago and being told 'dont do
> that'.  Ryan, am I remembering wrong?

That used to be case with the old systemd way to manage galera. Did we move to use the resource agent instead that solves this exact problem?

Comment 6 Asaf Hirshberg 2014-10-30 13:00:12 UTC

Created attachment 952108 [details]
pcs config output

Comment 7 Fabio Massimo Di Nitto 2014-10-30 13:13:24 UTC

 Clone: mysqld-clone
  Resource: mysqld (class=systemd type=mysqld)
   Attributes: timeout=500s 
   Operations: monitor interval=30s (mysqld-monitor-interval-30s)
               start interval=0s timeout=120s (mysqld-start-timeout-120s)

The installation is still using the systemd resource that cannot handle a 3 node simultaneous reboot.

This will be fixed one deployment will use the galera resource-agent.

Comment 8 Jason Guiditta 2014-10-30 13:31:47 UTC

(In reply to Fabio Massimo Di Nitto from comment #5)
> (In reply to Jason Guiditta from comment #4)
> > I don't think galera is expected to recover by itself when you reboot all
> > three machines.  I remember trying this months ago and being told 'dont do
> > that'.  Ryan, am I remembering wrong?
> 
> That used to be case with the old systemd way to manage galera. Did we move
> to use the resource agent instead that solves this exact problem?

No, I believe this was planned for OSP6/Juno

Comment 9 Ryan O'Hara 2014-10-30 13:43:31 UTC

(In reply to Asaf Hirshberg from comment #0)
> Created attachment 952051 [details]
> puma44-46 are the controllers, var/log/messages and mariadb.log
> 
> Description of problem:
> I deployed HA-neutron on baremetal with fencing and after some tests I
> booted all three server in the same time to check the recovery of cluster
> and the db.

You rebooted all three nodes? You can't do that.

(In reply to Fabio Massimo Di Nitto from comment #7)
> The installation is still using the systemd resource that cannot handle a 3
> node simultaneous reboot.

This is correct.

Comment 12 Leonid Natapov 2014-12-01 14:54:03 UTC

the workaround is:

Add 
wsrep_cluster_address=gcomm:// 

in /etc/my.cnf.d/galera.cnf on of the cluster nodes.

start mariadb service on that node. after it starts ,start mariadb on all cluster nodes.

Comment 13 Leonid Natapov 2014-12-01 19:38:03 UTC

Once again.

Add 

wsrep_cluster_address=gcomm:// 

in /etc/my.cnf.d/galera.cnf on one of the cluster nodes.
After adding this line ,start mariadb on that node (systemctl mariadb start)

After mariadb successfully starts on that node ,start mariadb on all other cluster nodes.

Comment 14 Ryan O'Hara 2014-12-01 19:40:46 UTC

(In reply to Leonid Natapov from comment #13)
> Once again.
> 
> Add 
> 
> wsrep_cluster_address=gcomm:// 
> 
> in /etc/my.cnf.d/galera.cnf on one of the cluster nodes.
> After adding this line ,start mariadb on that node (systemctl mariadb start)
> 
> After mariadb successfully starts on that node ,start mariadb on all other
> cluster nodes.

It is not that simple. You need to bootstrap the node has the most recent view of the database. Randomly selecting a node to bootstrap can potentially cause data lost.

Comment 15 Jason Guiditta 2014-12-03 14:20:01 UTC

Ryan, who would be able to best describe the steps for some doc text here?  You have the general steps above, is that enough?  Something like:
* Determine which node has the most recent view of the database (not sure how), then edit /etc/my.cnf.d/galera.cnf on that node to have a blank list of cluster addresses so the line looks like:
  wsrep_cluster_address=gcomm://
* Start mariadb on that node (systemctl mariadb start)
* Once started, start the other 2 nodes
* From any node, run 'pcs resource cleanup mariadb' to get the resource back to non-error state from pacemaker's view.

Note that these steps are for OSP 5 only, OSP 6 will use the galera resource-agent.

Comment 16 Ryan O'Hara 2014-12-03 14:46:56 UTC

(In reply to Jason Guiditta from comment #15)
> Ryan, who would be able to best describe the steps for some doc text here? 
> You have the general steps above, is that enough?  Something like:
> * Determine which node has the most recent view of the database (not sure
> how), then edit /etc/my.cnf.d/galera.cnf on that node to have a blank list
> of cluster addresses so the line looks like:
>   wsrep_cluster_address=gcomm://
> * Start mariadb on that node (systemctl mariadb start)
> * Once started, start the other 2 nodes
> * From any node, run 'pcs resource cleanup mariadb' to get the resource back
> to non-error state from pacemaker's view.

This is exactly right. I sent some links to Leonid that describe how to do this, and I think he sent those to the internal mailing list. Can you check those?

> Note that these steps are for OSP 5 only, OSP 6 will use the galera
> resource-agent.

Right.

Comment 25 Asaf Hirshberg 2015-01-13 08:28:36 UTC

Verified.

foreman-installer-1.6.0-0.2.RC1.el7ost.noarch
rhel-osp-installer-client-0.5.4-1.el7ost.noarch
openstack-foreman-installer-3.0.8-1.el7ost.noarch
rhel-osp-installer-0.5.4-1.el7ost.noarch

Comment 27 errata-xmlrpc 2015-02-09 15:17:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html