Bug 1288514 - galera fails to start
galera fails to start
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents (Show other bugs)
7.1
All Linux
high Severity high
: rc
: ---
Assigned To: Damien Ciabrini
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-04 08:16 EST by Jaison Raju
Modified: 2016-02-10 09:45 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-07 05:33:52 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jaison Raju 2015-12-04 08:16:11 EST
Description of problem:
Pacemaker does not recover resource galera . 
Although the resource can be manually started using systemd script .
After disabling the galera / galera-master resource , manual start of mariadb
forms cluster successfully .
Although bringing pacemaker resource - galera up fails .

There are other failed resources , which we suspect is the result of constraints that have failed .


Version-Release number of selected component (if applicable):
resource-agents-3.9.5-40.el7_1.9.x86_64
RHOS 6

How reproducible:
Always on customer end

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
We notice the following error in messages :


Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to detect last commit version by reading /var/lib/mysql/g
rastate.dat
Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit version found:  82281890
Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400eac655> to report database status bef
ore Master instances can start.
Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400cb08b6> to report database status bef
ore Master instances can start.
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5)
Dec  4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0, cib-update=4878, confirmed=true)
Dec  4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not running
Dec  4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0, cib-update=4879, confirmed=true)
Comment 1 Fabio Massimo Di Nitto 2015-12-04 08:22:26 EST
Is this a customer escalation?

Also, collect information for the whole cluster as documented here:

https://access.redhat.com/solutions/2055933

otherwise we can´t help.
Comment 4 Fabio Massimo Di Nitto 2015-12-04 08:31:13 EST
(In reply to Jaison Raju from comment #0)
> Description of problem:
> Pacemaker does not recover resource galera . 
> Although the resource can be manually started using systemd script .
> After disabling the galera / galera-master resource , manual start of mariadb
> forms cluster successfully .

This is a very dangerous operation to do manually, specially if you forgot to apply or roll back config change before handing over galera back to pacemaker.

Please provide very detailed steps that have been used to bring the system up with systemd and how it´s been changed back to move it again under pacemaker.

> Although bringing pacemaker resource - galera up fails .
> 
> There are other failed resources , which we suspect is the result of
> constraints that have failed .
> 
> 
> Version-Release number of selected component (if applicable):
> resource-agents-3.9.5-40.el7_1.9.x86_64
> RHOS 6
> 
> How reproducible:
> Always on customer end
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> We notice the following error in messages :
> 
> 
> Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to
> detect last commit version by reading /var/lib/mysql/g
> rastate.dat
> Dec  4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit
> version found:  82281890
> Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node
> <pcmk-mac525400eac655> to report database status bef
> ore Master instances can start.
> Dec  4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node
> <pcmk-mac525400cb08b6> to report database status bef
> ore Master instances can start.
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata:
> Failed to retrieve meta-data for ocf:heartbeat:galera
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No
> metadata found for galera::ocf:heartbeat: Input/output error (-5)
> Dec  4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event:
> Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0,
> cib-update=4878, confirmed=true)
> Dec  4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not
> running
> Dec  4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event:
> Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0,
> cib-update=4879, confirmed=true)
Comment 7 Fabio Massimo Di Nitto 2015-12-04 09:05:08 EST
(In reply to Fabio Massimo Di Nitto from comment #4)
> (In reply to Jaison Raju from comment #0)
> > Description of problem:
> > Pacemaker does not recover resource galera . 
> > Although the resource can be manually started using systemd script .
> > After disabling the galera / galera-master resource , manual start of mariadb
> > forms cluster successfully .
> 
> This is a very dangerous operation to do manually, specially if you forgot
> to apply or roll back config change before handing over galera back to
> pacemaker.
> 
> Please provide very detailed steps that have been used to bring the system
> up with systemd and how it´s been changed back to move it again under
> pacemaker.

Please provide those info.
Comment 8 Michael Bayer 2015-12-04 10:06:51 EST
Hi Jaison -

We need the following steps performed:

1. attach the /var/log/mysqld.log file from all three nodes to the collab-shell case.   For subsequent runs of sosreport, please run it with the --all-logs option:

   # sosreport -o mysql --all-logs

2. While Galera is running (assuming its running via systemd still), test that the "clustecheck" command is referring to the correct username/password on all nodes:

   [node1] #  clustercheck
   [node2] #  clustercheck
   [node3] #  clustercheck

we should see a 200 status code.  Otherwise, if Galera is not running, please verify that the connectivity info in /etc/sysconfig/clustercheck is correct for the current database settings.

3. if Galera is running via systemd, we'd like to bring pacemaker back to managing it, via these commands:

   # pcs resource unmanage galera
   # <ensure galera is running fully via systemctl>
   # pcs resource cleanup galera
   # <wait for pcs status to show it as up>
   # pcs resource manage galera
Comment 9 Michael Bayer 2015-12-04 10:22:07 EST
OK for the last step, we might have to tweak it a bit because pacemaker will be looking at a different pidfile.
Comment 10 Michael Bayer 2015-12-04 10:28:56 EST
OK so the modification to #3 is, when Galera is running fully under systemctl, *copy* the /var/run/mariadb/mariadb.pid file to /var/run/mysql/mysqld.pid, then run the "pcs resource cleanup" command.  This way pacemaker sees the current pid of mariadb.
Comment 11 Michael Bayer 2015-12-04 10:30:06 EST
   # pcs resource unmanage galera
   # <ensure galera is running fully via systemctl>
   # cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid
   # pcs resource cleanup galera
   # <wait for pcs status to show it as up>
   # pcs resource manage galera
Comment 12 Michael Bayer 2015-12-04 10:37:09 EST
revised, add an "enable" command:

# pcs resource unmanage galera
# <ensure galera is running fully via systemctl>
# cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid
# pcs resource enable galera
# pcs resource cleanup galera
# <wait for pcs status to show it as up>
# pcs resource manage galera
Comment 13 Jaison Raju 2015-12-05 04:15:54 EST
(In reply to Fabio Massimo Di Nitto from comment #7)
> (In reply to Fabio Massimo Di Nitto from comment #4)
> > (In reply to Jaison Raju from comment #0)
> > > Description of problem:
> > > Pacemaker does not recover resource galera . 
> > > Although the resource can be manually started using systemd script .
> > > After disabling the galera / galera-master resource , manual start of mariadb
> > > forms cluster successfully .
> > 
> > This is a very dangerous operation to do manually, specially if you forgot
> > to apply or roll back config change before handing over galera back to
> > pacemaker.
> > 
> > Please provide very detailed steps that have been used to bring the system
> > up with systemd and how it´s been changed back to move it again under
> > pacemaker.
> 
> Please provide those info.

Checked /var/lib/mysql/grastate.dat & found content on all node same .
Cross checked with sequence no in the following command & used the node with latest sequence no to perform operation :
# mysqld_safe --wsrep-recover

Put pacemaker cluster in standby .
Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
Started mariadb via systemd .
On other 2 nodes update wsrep_cluster_address with all three nodes .
Started the mariadb on other 2 nodes , one after the other .
Confirm that the galera cluster is formed with all 3 nodes using below command.
"SHOW STATUS LIKE 'wsrep%';"
Bring pacemaker cluster out of standby for all nodes .
Wait for services to come up .
Multiple services remain in stopped state including galera .

I have also tested with pacemaker cluster in up state & keeping galera in disable state , with the same above steps .

I need to collect the data requested by Michael .
Thanks a lot for help .
Comment 14 Michael Bayer 2015-12-05 15:27:05 EST
> Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
> Started mariadb via systemd .

OK, so just noticed this in the galera.cnf.  Pacemaker's Galera resource agents, at least the version we've run for a long time (not sure if we've improved this), will *NOT* work if this variable is filled out in the galera.cnf file, which in the SOS reports I can see it is:

    wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6"


This issue is so common that the resource agent actually states this in its error message when pcs won't start.

Can we please *remove* wsrep_cluster_address from *all* galera.cnf files, and start all over again.

Additionally, don't use systemctl to start or stop galera.   For manual control without pacemaker, use mysqld_safe.   When starting manually, the gcomm address can be specified here:

   mysqld_safe --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6"

However, if we just take out wsrep_cluster_address out of galera.cnf entirely and just do a from-the-beginning pacemaker start, ensuring the resource is both managed, enabled, and all nodes unbanned, the whole thing should just work.
Comment 15 Jaison Raju 2015-12-06 23:29:19 EST
(In reply to Michael Bayer from comment #14)
> > Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node .
> > Started mariadb via systemd .
> 
> OK, so just noticed this in the galera.cnf.  Pacemaker's Galera resource
> agents, at least the version we've run for a long time (not sure if we've
> improved this), will *NOT* work if this variable is filled out in the
> galera.cnf file, which in the SOS reports I can see it is:
> 
>    
> wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,
> pcmk-mac525400cb08b6"
> 
> 
> This issue is so common that the resource agent actually states this in its
> error message when pcs won't start.
> 
> Can we please *remove* wsrep_cluster_address from *all* galera.cnf files,
> and start all over again.
> 
> Additionally, don't use systemctl to start or stop galera.   For manual
> control without pacemaker, use mysqld_safe.   When starting manually, the
> gcomm address can be specified here:
> 
>    mysqld_safe
> --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,
> pcmk-mac525400cb08b6"
> 
> However, if we just take out wsrep_cluster_address out of galera.cnf
> entirely and just do a from-the-beginning pacemaker start, ensuring the
> resource is both managed, enabled, and all nodes unbanned, the whole thing
> should just work.

Hi Michael ,

The gcom entry was put only to recover using systemd .
I planned to remove it ( change to initial state) after the pacemaker shows service as up after unstandby .
Trying steps as per #12 now .

Regards,
Jaison R

Note You need to log in before you can comment on or make changes to this bug.