Red Hat Bugzilla – Bug 1288514
galera fails to start
Last modified: 2016-02-10 09:45:05 EST
Description of problem: Pacemaker does not recover resource galera . Although the resource can be manually started using systemd script . After disabling the galera / galera-master resource , manual start of mariadb forms cluster successfully . Although bringing pacemaker resource - galera up fails . There are other failed resources , which we suspect is the result of constraints that have failed . Version-Release number of selected component (if applicable): resource-agents-3.9.5-40.el7_1.9.x86_64 RHOS 6 How reproducible: Always on customer end Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: We notice the following error in messages : Dec 4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to detect last commit version by reading /var/lib/mysql/g rastate.dat Dec 4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit version found: 82281890 Dec 4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400eac655> to report database status bef ore Master instances can start. Dec 4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node <pcmk-mac525400cb08b6> to report database status bef ore Master instances can start. Dec 4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera Dec 4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5) Dec 4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0, cib-update=4878, confirmed=true) Dec 4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not running Dec 4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event: Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0, cib-update=4879, confirmed=true)
Is this a customer escalation? Also, collect information for the whole cluster as documented here: https://access.redhat.com/solutions/2055933 otherwise we can´t help.
(In reply to Jaison Raju from comment #0) > Description of problem: > Pacemaker does not recover resource galera . > Although the resource can be manually started using systemd script . > After disabling the galera / galera-master resource , manual start of mariadb > forms cluster successfully . This is a very dangerous operation to do manually, specially if you forgot to apply or roll back config change before handing over galera back to pacemaker. Please provide very detailed steps that have been used to bring the system up with systemd and how it´s been changed back to move it again under pacemaker. > Although bringing pacemaker resource - galera up fails . > > There are other failed resources , which we suspect is the result of > constraints that have failed . > > > Version-Release number of selected component (if applicable): > resource-agents-3.9.5-40.el7_1.9.x86_64 > RHOS 6 > > How reproducible: > Always on customer end > > Steps to Reproduce: > 1. > 2. > 3. > > Actual results: > > > Expected results: > > > Additional info: > We notice the following error in messages : > > > Dec 4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: attempting to > detect last commit version by reading /var/lib/mysql/g > rastate.dat > Dec 4 20:08:15 mac525400c5c048 galera(galera)[20046]: INFO: Last commit > version found: 82281890 > Dec 4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node > <pcmk-mac525400eac655> to report database status bef > ore Master instances can start. > Dec 4 20:08:16 mac525400c5c048 galera(galera)[20046]: INFO: Waiting on node > <pcmk-mac525400cb08b6> to report database status bef > ore Master instances can start. > Dec 4 20:08:17 mac525400c5c048 crmd[22853]: error: generic_get_metadata: > Failed to retrieve meta-data for ocf:heartbeat:galera > Dec 4 20:08:17 mac525400c5c048 crmd[22853]: warning: get_rsc_metadata: No > metadata found for galera::ocf:heartbeat: Input/output error (-5) > Dec 4 20:08:17 mac525400c5c048 crmd[22853]: notice: process_lrm_event: > Operation galera_start_0: ok (node=pcmk-mac525400c5c048, call=51647, rc=0, > cib-update=4878, confirmed=true) > Dec 4 20:08:18 mac525400c5c048 galera(galera)[20335]: INFO: MySQL is not > running > Dec 4 20:08:19 mac525400c5c048 crmd[22853]: notice: process_lrm_event: > Operation galera_stop_0: ok (node=pcmk-mac525400c5c048, call=51648, rc=0, > cib-update=4879, confirmed=true)
(In reply to Fabio Massimo Di Nitto from comment #4) > (In reply to Jaison Raju from comment #0) > > Description of problem: > > Pacemaker does not recover resource galera . > > Although the resource can be manually started using systemd script . > > After disabling the galera / galera-master resource , manual start of mariadb > > forms cluster successfully . > > This is a very dangerous operation to do manually, specially if you forgot > to apply or roll back config change before handing over galera back to > pacemaker. > > Please provide very detailed steps that have been used to bring the system > up with systemd and how it´s been changed back to move it again under > pacemaker. Please provide those info.
Hi Jaison - We need the following steps performed: 1. attach the /var/log/mysqld.log file from all three nodes to the collab-shell case. For subsequent runs of sosreport, please run it with the --all-logs option: # sosreport -o mysql --all-logs 2. While Galera is running (assuming its running via systemd still), test that the "clustecheck" command is referring to the correct username/password on all nodes: [node1] # clustercheck [node2] # clustercheck [node3] # clustercheck we should see a 200 status code. Otherwise, if Galera is not running, please verify that the connectivity info in /etc/sysconfig/clustercheck is correct for the current database settings. 3. if Galera is running via systemd, we'd like to bring pacemaker back to managing it, via these commands: # pcs resource unmanage galera # <ensure galera is running fully via systemctl> # pcs resource cleanup galera # <wait for pcs status to show it as up> # pcs resource manage galera
OK for the last step, we might have to tweak it a bit because pacemaker will be looking at a different pidfile.
OK so the modification to #3 is, when Galera is running fully under systemctl, *copy* the /var/run/mariadb/mariadb.pid file to /var/run/mysql/mysqld.pid, then run the "pcs resource cleanup" command. This way pacemaker sees the current pid of mariadb.
# pcs resource unmanage galera # <ensure galera is running fully via systemctl> # cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid # pcs resource cleanup galera # <wait for pcs status to show it as up> # pcs resource manage galera
revised, add an "enable" command: # pcs resource unmanage galera # <ensure galera is running fully via systemctl> # cp /var/run/mariadb/mariadb.pid /var/run/mysql/mysqld.pid # pcs resource enable galera # pcs resource cleanup galera # <wait for pcs status to show it as up> # pcs resource manage galera
(In reply to Fabio Massimo Di Nitto from comment #7) > (In reply to Fabio Massimo Di Nitto from comment #4) > > (In reply to Jaison Raju from comment #0) > > > Description of problem: > > > Pacemaker does not recover resource galera . > > > Although the resource can be manually started using systemd script . > > > After disabling the galera / galera-master resource , manual start of mariadb > > > forms cluster successfully . > > > > This is a very dangerous operation to do manually, specially if you forgot > > to apply or roll back config change before handing over galera back to > > pacemaker. > > > > Please provide very detailed steps that have been used to bring the system > > up with systemd and how it´s been changed back to move it again under > > pacemaker. > > Please provide those info. Checked /var/lib/mysql/grastate.dat & found content on all node same . Cross checked with sequence no in the following command & used the node with latest sequence no to perform operation : # mysqld_safe --wsrep-recover Put pacemaker cluster in standby . Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node . Started mariadb via systemd . On other 2 nodes update wsrep_cluster_address with all three nodes . Started the mariadb on other 2 nodes , one after the other . Confirm that the galera cluster is formed with all 3 nodes using below command. "SHOW STATUS LIKE 'wsrep%';" Bring pacemaker cluster out of standby for all nodes . Wait for services to come up . Multiple services remain in stopped state including galera . I have also tested with pacemaker cluster in up state & keeping galera in disable state , with the same above steps . I need to collect the data requested by Michael . Thanks a lot for help .
> Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node . > Started mariadb via systemd . OK, so just noticed this in the galera.cnf. Pacemaker's Galera resource agents, at least the version we've run for a long time (not sure if we've improved this), will *NOT* work if this variable is filled out in the galera.cnf file, which in the SOS reports I can see it is: wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6" This issue is so common that the resource agent actually states this in its error message when pcs won't start. Can we please *remove* wsrep_cluster_address from *all* galera.cnf files, and start all over again. Additionally, don't use systemctl to start or stop galera. For manual control without pacemaker, use mysqld_safe. When starting manually, the gcomm address can be specified here: mysqld_safe --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048,pcmk-mac525400cb08b6" However, if we just take out wsrep_cluster_address out of galera.cnf entirely and just do a from-the-beginning pacemaker start, ensuring the resource is both managed, enabled, and all nodes unbanned, the whole thing should just work.
(In reply to Michael Bayer from comment #14) > > Set wsrep_cluster_address="gcomm://" in /etc/my.cnf.d/galera.cnf on that node . > > Started mariadb via systemd . > > OK, so just noticed this in the galera.cnf. Pacemaker's Galera resource > agents, at least the version we've run for a long time (not sure if we've > improved this), will *NOT* work if this variable is filled out in the > galera.cnf file, which in the SOS reports I can see it is: > > > wsrep_cluster_address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048, > pcmk-mac525400cb08b6" > > > This issue is so common that the resource agent actually states this in its > error message when pcs won't start. > > Can we please *remove* wsrep_cluster_address from *all* galera.cnf files, > and start all over again. > > Additionally, don't use systemctl to start or stop galera. For manual > control without pacemaker, use mysqld_safe. When starting manually, the > gcomm address can be specified here: > > mysqld_safe > --wsrep-cluster-address="gcomm://pcmk-mac525400eac655,pcmk-mac525400c5c048, > pcmk-mac525400cb08b6" > > However, if we just take out wsrep_cluster_address out of galera.cnf > entirely and just do a from-the-beginning pacemaker start, ensuring the > resource is both managed, enabled, and all nodes unbanned, the whole thing > should just work. Hi Michael , The gcom entry was put only to recover using systemd . I planned to remove it ( change to initial state) after the pacemaker shows service as up after unstandby . Trying steps as per #12 now . Regards, Jaison R