1288514 – galera fails to start

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1288514 - galera fails to start

Summary: galera fails to start

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Damien Ciabrini
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-04 13:16 UTC by Jaison Raju
Modified:	2019-09-12 09:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-07 10:33:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jaison Raju 2015-12-04 13:16:11 UTC

Description of problem:
Pacemaker does not recover resource galera . 
Although the resource can be manually started using systemd script .
After disabling the galera / galera-master resource , manual start of mariadb
forms cluster successfully .
Although bringing pacemaker resource - galera up fails .

There are other failed resources , which we suspect is the result of constraints that have failed .


Version-Release number of selected component (if applicable):
resource-agents-3.9.5-40.el7_1.9.x86_64
RHOS 6

How reproducible:
Always on customer end

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
We notice the following error in messages :


Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to detect last commit version by reading /var/lib/mysql/g
rastate.dat
Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit version found:  82281890
Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400eac655> to report database status bef
ore Master instances can start.
Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400cb08b6> to report database status bef
ore Master instances can start.
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5)
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0, cib-update=4878, confirmed=true)
Dec  4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not running
Dec  4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0, cib-update=4879, confirmed=true)

Comment 1 Fabio Massimo Di Nitto 2015-12-04 13:22:26 UTC

Is this a customer escalation?

Also, collect information for the whole cluster as documented here:

https://access.redhat.com/solutions/2055933

otherwise we can´t help.

Comment 4 Fabio Massimo Di Nitto 2015-12-04 13:31:13 UTC

(In reply to Jaison Raju from comment #0)
> Description of problem:
> Pacemaker does not recover resource galera . 
> Although the resource can be manually started using systemd script .
> After disabling the galera / galera-master resource , manual start of mariadb
> forms cluster successfully .

This is a very dangerous operation to do manually, specially if you forgot to apply or roll back config change before handing over galera back to pacemaker.

Please provide very detailed steps that have been used to bring the system up with systemd and how it´s been changed back to move it again under pacemaker.

> Although bringing pacemaker resource - galera up fails .
> 
> There are other failed resources , which we suspect is the result of
> constraints that have failed .
> 
> 
> Version-Release number of selected component (if applicable):
> resource-agents-3.9.5-40.el7_1.9.x86_64
> RHOS 6
> 
> How reproducible:
> Always on customer end
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> We notice the following error in messages :
> 
> 
> Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to
> detect last commit version by reading /var/lib/mysql/g
> rastate.dat
> Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit
> version found:  82281890
> Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node
> <pcmk-mac525400eac655> to report database status bef
> ore Master instances can start.
> Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node
> <pcmk-mac525400cb08b6> to report database status bef
> ore Master instances can start.
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata:
> Failed to retrieve meta-data for ocf:heartbeat:galera
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No
> metadata found for galera::ocf:heartbeat: Input/output error (-5)
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event:
> Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0,
> cib-update=4878, confirmed=true)
> Dec  4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not
> running
> Dec  4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event:
> Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0,
> cib-update=4879, confirmed=true)

Comment 7 Fabio Massimo Di Nitto 2015-12-04 14:05:08 UTC

(In reply to Fabio Massimo Di Nitto from comment #4)
> (In reply to Jaison Raju from comment #0)
> > Description of problem:
> > Pacemaker does not recover resource galera . 
> > Although the resource can be manually started using systemd script .
> > After disabling the galera / galera-master resource , manual start of mariadb
> > forms cluster successfully .
> 
> This is a very dangerous operation to do manually, specially if you forgot
> to apply or roll back config change before handing over galera back to
> pacemaker.
> 
> Please provide very detailed steps that have been used to bring the system
> up with systemd and how it´s been changed back to move it again under
> pacemaker.

Please provide those info.

Comment 8 Michael Bayer 2015-12-04 15:06:51 UTC

Hi Jaison -

We need the following steps performed:

1. attach the /var/log/mysqld.log file from all three nodes to the collab-shell case.   For subsequent runs of sosreport, please run it with the --all-logs option:

   # sosreport -o mysql --all-logs

2. While Galera is running (assuming its running via systemd still), test that the "clustecheck" command is referring to the correct username/password on all nodes:

   [node1] #  clustercheck
   [node2] #  clustercheck
   [node3] #  clustercheck

we should see a 200 status code.  Otherwise, if Galera is not running, please verify that the connectivity info in /etc/sysconfig/clustercheck is correct for the current database settings.

3. if Galera is running via systemd, we'd like to bring pacemaker back to managing it, via these commands:

   # pcs resource unmanage galera
   # <ensure galera is running fully via systemctl>
   # pcs resource cleanup galera
   # <wait for pcs status to show it as up>
   # pcs resource manage galera

Comment 9 Michael Bayer 2015-12-04 15:22:07 UTC

OK for the last step, we might have to tweak it a bit because pacemaker will be looking at a different pidfile.

Comment 10 Michael Bayer 2015-12-04 15:28:56 UTC

OK so the modification to #3 is, when Galera is running fully under systemctl, *copy* the /var/run/mariadb/mariadb.pid file to /var/run/mysql/mysqld.pid, then run the "pcs resource cleanup" command.  This way pacemaker sees the current pid of mariadb.

Comment 11 Michael Bayer 2015-12-04 15:30:06 UTC

   # pcs resource unmanage galera
   # <ensure galera is running fully via systemctl>
   # cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid
   # pcs resource cleanup galera
   # <wait for pcs status to show it as up>
   # pcs resource manage galera

Comment 12 Michael Bayer 2015-12-04 15:37:09 UTC

revised, add an "enable" command:

# pcs resource unmanage galera
# <ensure galera is running fully via systemctl>
# cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid
# pcs resource enable galera
# pcs resource cleanup galera
# <wait for pcs status to show it as up>
# pcs resource manage galera

Comment 13 Jaison Raju 2015-12-05 09:15:54 UTC

(In reply to Fabio Massimo Di Nitto from comment #7)
> (In reply to Fabio Massimo Di Nitto from comment #4)
> > (In reply to Jaison Raju from comment #0)
> > > Description of problem:
> > > Pacemaker does not recover resource galera . 
> > > Although the resource can be manually started using systemd script .
> > > After disabling the galera / galera-master resource , manual start of mariadb
> > > forms cluster successfully .
> > 
> > This is a very dangerous operation to do manually, specially if you forgot
> > to apply or roll back config change before handing over galera back to
> > pacemaker.
> > 
> > Please provide very detailed steps that have been used to bring the system
> > up with systemd and how it´s been changed back to move it again under
> > pacemaker.
> 
> Please provide those info.

Checked /var/lib/mysql/grastate.dat & found content on all node same .
Cross checked with sequence no in the following command & used the node with latest sequence no to perform operation :
# mysqld_safe --wsrep-recover

Put pacemaker cluster in standby .
Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
Started mariadb via systemd .
On other 2 nodes update wsrep_cluster_address with all three nodes .
Started the mariadb on other 2 nodes , one after the other .
Confirm that the galera cluster is formed with all 3 nodes using below command.
"SHOW STATUS LIKE 'wsrep%';"
Bring pacemaker cluster out of standby for all nodes .
Wait for services to come up .
Multiple services remain in stopped state including galera .

I have also tested with pacemaker cluster in up state & keeping galera in disable state , with the same above steps .

I need to collect the data requested by Michael .
Thanks a lot for help .

Comment 14 Michael Bayer 2015-12-05 20:27:05 UTC

> Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
> Started mariadb via systemd .

OK, so just noticed this in the galera.cnf.  Pacemaker's Galera resource agents, at least the version we've run for a long time (not sure if we've improved this), will *NOT* work if this variable is filled out in the galera.cnf file, which in the SOS reports I can see it is:

    wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6"


This issue is so common that the resource agent actually states this in its error message when pcs won't start.

Can we please *remove* wsrep_cluster_address from *all* galera.cnf files, and start all over again.

Additionally, don't use systemctl to start or stop galera.   For manual control without pacemaker, use mysqld_safe.   When starting manually, the gcomm address can be specified here:

   mysqld_safe --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6"

However, if we just take out wsrep_cluster_address out of galera.cnf entirely and just do a from-the-beginning pacemaker start, ensuring the resource is both managed, enabled, and all nodes unbanned, the whole thing should just work.

Comment 15 Jaison Raju 2015-12-07 04:29:19 UTC

(In reply to Michael Bayer from comment #14)
> > Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
> > Started mariadb via systemd .
> 
> OK, so just noticed this in the galera.cnf.  Pacemaker's Galera resource
> agents, at least the version we've run for a long time (not sure if we've
> improved this), will *NOT* work if this variable is filled out in the
> galera.cnf file, which in the SOS reports I can see it is:
> 
>    
> wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,
> pcmk-mac525400cb08b6"
> 
> 
> This issue is so common that the resource agent actually states this in its
> error message when pcs won't start.
> 
> Can we please *remove* wsrep_cluster_address from *all* galera.cnf files,
> and start all over again.
> 
> Additionally, don't use systemctl to start or stop galera.   For manual
> control without pacemaker, use mysqld_safe.   When starting manually, the
> gcomm address can be specified here:
> 
>    mysqld_safe
> --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,
> pcmk-mac525400cb08b6"
> 
> However, if we just take out wsrep_cluster_address out of galera.cnf
> entirely and just do a from-the-beginning pacemaker start, ensuring the
> resource is both managed, enabled, and all nodes unbanned, the whole thing
> should just work.

Hi Michael ,

The gcom entry was put only to recover using systemd .
I planned to remove it ( change to initial state) after the pacemaker shows service as up after unstandby .
Trying steps as per #12 now .

Regards,
Jaison R

Note You need to log in before you can comment on or make changes to this bug.