Bug 1158829
| Summary: | maria db failed to start after server boot | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Asaf Hirshberg <ahirshbe> | ||||||
| Component: | openstack-foreman-installer | Assignee: | Jason Guiditta <jguiditt> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.0 (RHEL 6) | CC: | aberezin, adahms, ahirshbe, ddomingo, dmacpher, fdinitto, jdexter, jguiditt, lnatapov, lyarwood, mburns, morazi, oblaut, racedoro, rhos-maint, rohara, yeylon | ||||||
| Target Milestone: | ga | Keywords: | ZStream | ||||||
| Target Release: | Installer | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | openstack-foreman-installer-3.0.3-1.el7ost | Doc Type: | Bug Fix | ||||||
| Doc Text: |
MariaDB failed to start on controller nodes in high-availability environments due to issues with how systemd managed the MariaDB Galera Cluster. Red Hat Enterprise Linux OpenStack Platform 6.0 now uses Galera's resource-agent to manage MariaDB clusters, which resolves this issue.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-02-09 15:17:44 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Could you please attach pcs config output? I don't think galera is expected to recover by itself when you reboot all three machines. I remember trying this months ago and being told 'dont do that'. Ryan, am I remembering wrong? (In reply to Jason Guiditta from comment #4) > I don't think galera is expected to recover by itself when you reboot all > three machines. I remember trying this months ago and being told 'dont do > that'. Ryan, am I remembering wrong? That used to be case with the old systemd way to manage galera. Did we move to use the resource agent instead that solves this exact problem? Created attachment 952108 [details]
pcs config output
Clone: mysqld-clone
Resource: mysqld (class=systemd type=mysqld)
Attributes: timeout=500s
Operations: monitor interval=30s (mysqld-monitor-interval-30s)
start interval=0s timeout=120s (mysqld-start-timeout-120s)
The installation is still using the systemd resource that cannot handle a 3 node simultaneous reboot.
This will be fixed one deployment will use the galera resource-agent.
(In reply to Fabio Massimo Di Nitto from comment #5) > (In reply to Jason Guiditta from comment #4) > > I don't think galera is expected to recover by itself when you reboot all > > three machines. I remember trying this months ago and being told 'dont do > > that'. Ryan, am I remembering wrong? > > That used to be case with the old systemd way to manage galera. Did we move > to use the resource agent instead that solves this exact problem? No, I believe this was planned for OSP6/Juno (In reply to Asaf Hirshberg from comment #0) > Created attachment 952051 [details] > puma44-46 are the controllers, var/log/messages and mariadb.log > > Description of problem: > I deployed HA-neutron on baremetal with fencing and after some tests I > booted all three server in the same time to check the recovery of cluster > and the db. You rebooted all three nodes? You can't do that. (In reply to Fabio Massimo Di Nitto from comment #7) > The installation is still using the systemd resource that cannot handle a 3 > node simultaneous reboot. This is correct. the workaround is: Add wsrep_cluster_address=gcomm:// in /etc/my.cnf.d/galera.cnf on of the cluster nodes. start mariadb service on that node. after it starts ,start mariadb on all cluster nodes. Once again. Add wsrep_cluster_address=gcomm:// in /etc/my.cnf.d/galera.cnf on one of the cluster nodes. After adding this line ,start mariadb on that node (systemctl mariadb start) After mariadb successfully starts on that node ,start mariadb on all other cluster nodes. (In reply to Leonid Natapov from comment #13) > Once again. > > Add > > wsrep_cluster_address=gcomm:// > > in /etc/my.cnf.d/galera.cnf on one of the cluster nodes. > After adding this line ,start mariadb on that node (systemctl mariadb start) > > After mariadb successfully starts on that node ,start mariadb on all other > cluster nodes. It is not that simple. You need to bootstrap the node has the most recent view of the database. Randomly selecting a node to bootstrap can potentially cause data lost. Ryan, who would be able to best describe the steps for some doc text here? You have the general steps above, is that enough? Something like: * Determine which node has the most recent view of the database (not sure how), then edit /etc/my.cnf.d/galera.cnf on that node to have a blank list of cluster addresses so the line looks like: wsrep_cluster_address=gcomm:// * Start mariadb on that node (systemctl mariadb start) * Once started, start the other 2 nodes * From any node, run 'pcs resource cleanup mariadb' to get the resource back to non-error state from pacemaker's view. Note that these steps are for OSP 5 only, OSP 6 will use the galera resource-agent. (In reply to Jason Guiditta from comment #15) > Ryan, who would be able to best describe the steps for some doc text here? > You have the general steps above, is that enough? Something like: > * Determine which node has the most recent view of the database (not sure > how), then edit /etc/my.cnf.d/galera.cnf on that node to have a blank list > of cluster addresses so the line looks like: > wsrep_cluster_address=gcomm:// > * Start mariadb on that node (systemctl mariadb start) > * Once started, start the other 2 nodes > * From any node, run 'pcs resource cleanup mariadb' to get the resource back > to non-error state from pacemaker's view. This is exactly right. I sent some links to Leonid that describe how to do this, and I think he sent those to the internal mailing list. Can you check those? > Note that these steps are for OSP 5 only, OSP 6 will use the galera > resource-agent. Right. Verified. foreman-installer-1.6.0-0.2.RC1.el7ost.noarch rhel-osp-installer-client-0.5.4-1.el7ost.noarch openstack-foreman-installer-3.0.8-1.el7ost.noarch rhel-osp-installer-0.5.4-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0156.html |
Created attachment 952051 [details] puma44-46 are the controllers, var/log/messages and mariadb.log Description of problem: I deployed HA-neutron on baremetal with fencing and after some tests I booted all three server in the same time to check the recovery of cluster and the db. as they finish the boot process I used "pcs status" to check pacemaker and I saw lots of failed resources(rabbitmq-server, neutron-server, neutron-ovs-cleanup, mysqld etc..). I checked mariadb status on the hosts and only in one it was active and on the other it was down with:" Failed to start MariaDB database server". So it's looks like another problem in galera. adding the logs of the 3 servers