Description of problem: During a staypuft install of three openstack HA controller nodes, galera cluster is in initially sync on all three nodes. However, a node becomes out of sync in just 10 minutes and ultimately is not recovered. A symptom of this issue is a puppet error message (clustercheck) when puppet is run again problematic node. mariadb detected as out of sync (from 192.168.0.7/pacemaker.log): Dec 11 10:45:36 [13553] maca25400702875.example.com pengine: warning: unpack_rsc_op: Processing failed op monitor for galera:1 on pcmk-maca25400702875: not running (7) From 192.168.0.7/mysqld.log: 141211 10:45:34 [Note] WSREP: Member 0.0 (maca25400702875.example.com) requested state transfer from '*any*'. Selected 1.0 (maca25400702876.example.com)(SYNCED) as donor. 141211 10:45:34 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2042) 141211 10:45:34 [Note] WSREP: Requesting state transfer: success, donor: 1 141211 10:45:34 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 0 bytes 141211 10:45:34 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/7 locked buffers terminate called after throwing an instance of 'gu::Exception' 141211 10:45:34 [Note] WSREP: Receiving IST: 8 writesets, seqnos 2027-2035 141211 10:45:34 [Warning] WSREP: 1.0 (maca25400702876.example.com): State transfer to 0.0 (maca25400702875.example.com) failed: -336134278 (Unknown error 336134278) 141211 10:45:34 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort. 141211 10:45:34 [Note] WSREP: gcomm: terminating thread Pacemaker attempts to recover the node: Dec 11 10:45:36 [13553] maca25400702875.example.com pengine: notice: LogActions: Recover galera:1 (Master pcmk-maca25400702875) Dec 11 10:46:38 [13553] maca25400702875.example.com pengine: notice: LogActions: Recover galera:1 (Slave pcmk-maca25400702875) Dec 11 10:46:42 [13553] maca25400702875.example.com pengine: notice: LogActions: Recover galera:1 (Slave pcmk-maca25400702875) But fails: Dec 11 10:46:42 [13553] maca25400702875.example.com pengine: info: clone_print: Master/Slave Set: galera-master [galera] Dec 11 10:46:42 [13553] maca25400702875.example.com pengine: info: native_print: galera (ocf::heartbeat:galera): FAILED pcmk-maca25400702875 Dec 11 10:46:42 [13553] maca25400702875.example.com pengine: info: short_print: Masters: [ pcmk-maca25400702876 pcmk-maca25400702877 ] Version-Release number of selected component (if applicable): mariadb-5.5.40-1.el7_0.x86_64 mariadb-galera-common-5.5.40-3.el7ost.x86_64 mariadb-galera-server-5.5.40-3.el7ost.x86_64 mariadb-libs-5.5.40-1.el7_0.x86_64 pacemaker-1.1.10-32.el7_0.1.x86_64 pacemaker-cli-1.1.10-32.el7_0.1.x86_64 pacemaker-cluster-libs-1.1.10-32.el7_0.1.x86_64 pacemaker-libs-1.1.10-32.el7_0.1.x86_64 Additional info: QE host: 10.8.30.168
Created attachment 967449 [details] galera, mysql, messages, audit logs for 3 controllers
Looks like this was triggered by some sort of network problem where the galera nodes could not communicate with one another, at least on 192.168.0.7: 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.88361S), skipping check 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.00183S), skipping check 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.75356S), skipping check Someone can correct me if I am wrong, but it looks like corosync also detected a blip in the network. Unrelated, but why are there both mysqld.log and mariadb.log? This bug is filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what is going on here. Best get one of the pacemaker folks to look at this, as well as figure out why the network is getting interrupted. Also, SSL is failing. I don't see config files so I have no idea if you have SSL enabled or disabled. This could potentially cause an IST to fail, which may be a problem for you.
Sorry, this was against RHOS-6. I've updated the version.
Would be useful to see galera.cnf from a machine where this happens. As for chasing down the network problem, is there a chance that neutron using the same interface that galera is using for network communication? Could this be causing the networking hiccup? Just brainstorming here.
(In reply to Ryan O'Hara from comment #3) > Looks like this was triggered by some sort of network problem where the > galera nodes could not communicate with one another, at least on 192.168.0.7: > > 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT1.88361S), skipping check > 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT2.00183S), skipping check > 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT2.75356S), skipping check > > Someone can correct me if I am wrong, but it looks like corosync also > detected a blip in the network. yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of corosync membership lost sorts of messages around the time this galera issue happens. Corosync and galera use the same interface I believe, so seeing they're both hosed points to a network related issue. Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN ] Corosync main process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms). Consider token timeout increase. Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor failed, forming new configuration. then shortly after. Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN ] Corosync main process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms). Consider token timeout increase. > Unrelated, but why are there both mysqld.log and mariadb.log? This bug is > filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but this is configurable. The galera package is hard coded to /var/log/mariadb/mariadb.log If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource create galera galera log=/var/log/mariadb/mariadb.log wsrep_cluster_address="gcom......' command, the log file will remain consistent. I could make the galera agent default to /var/log/mariadb/mariadb.log in the future if this causes problems. > is going on here. Best get one of the pacemaker folks to look at this, as > well as figure out why the network is getting interrupted. > > Also, SSL is failing. I don't see config files so I have no idea if you have > SSL enabled or disabled. This could potentially cause an IST to fail, which > may be a problem for you.
Shouldn't the resource agent recover the down node after networking is restored? That seems like the more crucial issue to me. Network issues or out-of-sync issues will happen occasionally, but I would expect that they would be recovered.
(In reply to David Vossel from comment #6) > (In reply to Ryan O'Hara from comment #3) > > Looks like this was triggered by some sort of network problem where the > > galera nodes could not communicate with one another, at least on 192.168.0.7: > > > > 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT1.88361S), skipping check > > 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT2.00183S), skipping check > > 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT2.75356S), skipping check > > > > Someone can correct me if I am wrong, but it looks like corosync also > > detected a blip in the network. > > yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of > corosync membership lost sorts of messages around the time this galera issue > happens. Corosync and galera use the same interface I believe, so seeing > they're both hosed points to a network related issue. Right. I wonder if neutron is also mucking with this interface. > Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN ] Corosync main > process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor > failed, forming new configuration. > > then shortly after. > > Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN ] Corosync main > process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms). > Consider token timeout increase. > > > Unrelated, but why are there both mysqld.log and mariadb.log? This bug is > > filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what > > I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but > this is configurable. The galera package is hard coded to > /var/log/mariadb/mariadb.log Yes, that is the issue. Crag pointed this out in a chat we had this morning. We should really use /var/log/mariadb/mariadb.log for consistency. > If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource > create galera galera log=/var/log/mariadb/mariadb.log > wsrep_cluster_address="gcom......' command, the log file will remain > consistent. > > I could make the galera agent default to /var/log/mariadb/mariadb.log in the > future if this causes problems. Agreed. If it not causing problems, just a matter of consitency. > > is going on here. Best get one of the pacemaker folks to look at this, as > > well as figure out why the network is getting interrupted. > > > > Also, SSL is failing. I don't see config files so I have no idea if you have > > SSL enabled or disabled. This could potentially cause an IST to fail, which > > may be a problem for you. David, if 2 of the 3 galera nodes are still up and running, does the RA start mysqld in read-only mode? I'm asking because I want to rule out the issue of grastate.dat getting seqno set to -1 as a side-effect of staring in read-only. My assumption is that if there is another galera node already running, there is no need to determine the position, and the RA will just join.
(In reply to Mike Burns from comment #7) > Shouldn't the resource agent recover the down node after networking is > restored? That seems like the more crucial issue to me. Network issues or > out-of-sync issues will happen occasionally, but I would expect that they > would be recovered. Sure. Unfortunately the machine where this occured had been reprovisioned and the galera.cnf was not included. I've also never seen this happen before, so it is difficult to say if this is a galera problem or RA problem, or some combination. I'd really like to see the galera.cnf since this rejoining the cluster will almost certainly result in an IST, which will use SSL if configured. I would like to rule out that galera SSL was misconfigured, but since no galera.cnf was provided this may not be possible.
I am closing this since we've heard nothing more about this.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days