Bug 1173370
| Summary: | galera node falls out of sync and resource agent fails to recover it | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Crag Wolfe <cwolfe> | ||||
| Component: | mariadb-galera | Assignee: | Ryan O'Hara <rohara> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ami Jeain <ajeain> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.0 (Juno) | CC: | cwolfe, dvossel, mbayer, mburns, schuzhoy, sgordon, yeylon | ||||
| Target Milestone: | --- | Keywords: | Unconfirmed, ZStream | ||||
| Target Release: | 6.0 (Juno) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-02-13 17:47:16 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Crag Wolfe
2014-12-12 00:43:54 UTC
Created attachment 967449 [details]
galera, mysql, messages, audit logs for 3 controllers
Looks like this was triggered by some sort of network problem where the galera nodes could not communicate with one another, at least on 192.168.0.7: 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.88361S), skipping check 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.00183S), skipping check 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.75356S), skipping check Someone can correct me if I am wrong, but it looks like corosync also detected a blip in the network. Unrelated, but why are there both mysqld.log and mariadb.log? This bug is filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what is going on here. Best get one of the pacemaker folks to look at this, as well as figure out why the network is getting interrupted. Also, SSL is failing. I don't see config files so I have no idea if you have SSL enabled or disabled. This could potentially cause an IST to fail, which may be a problem for you. Sorry, this was against RHOS-6. I've updated the version. Would be useful to see galera.cnf from a machine where this happens. As for chasing down the network problem, is there a chance that neutron using the same interface that galera is using for network communication? Could this be causing the networking hiccup? Just brainstorming here. (In reply to Ryan O'Hara from comment #3) > Looks like this was triggered by some sort of network problem where the > galera nodes could not communicate with one another, at least on 192.168.0.7: > > 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT1.88361S), skipping check > 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT2.00183S), skipping check > 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago > (PT2.75356S), skipping check > > Someone can correct me if I am wrong, but it looks like corosync also > detected a blip in the network. yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of corosync membership lost sorts of messages around the time this galera issue happens. Corosync and galera use the same interface I believe, so seeing they're both hosed points to a network related issue. Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN ] Corosync main process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms). Consider token timeout increase. Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor failed, forming new configuration. then shortly after. Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN ] Corosync main process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms). Consider token timeout increase. > Unrelated, but why are there both mysqld.log and mariadb.log? This bug is > filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but this is configurable. The galera package is hard coded to /var/log/mariadb/mariadb.log If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource create galera galera log=/var/log/mariadb/mariadb.log wsrep_cluster_address="gcom......' command, the log file will remain consistent. I could make the galera agent default to /var/log/mariadb/mariadb.log in the future if this causes problems. > is going on here. Best get one of the pacemaker folks to look at this, as > well as figure out why the network is getting interrupted. > > Also, SSL is failing. I don't see config files so I have no idea if you have > SSL enabled or disabled. This could potentially cause an IST to fail, which > may be a problem for you. Shouldn't the resource agent recover the down node after networking is restored? That seems like the more crucial issue to me. Network issues or out-of-sync issues will happen occasionally, but I would expect that they would be recovered. (In reply to David Vossel from comment #6) > (In reply to Ryan O'Hara from comment #3) > > Looks like this was triggered by some sort of network problem where the > > galera nodes could not communicate with one another, at least on 192.168.0.7: > > > > 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT1.88361S), skipping check > > 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT2.00183S), skipping check > > 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago > > (PT2.75356S), skipping check > > > > Someone can correct me if I am wrong, but it looks like corosync also > > detected a blip in the network. > > yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of > corosync membership lost sorts of messages around the time this galera issue > happens. Corosync and galera use the same interface I believe, so seeing > they're both hosed points to a network related issue. Right. I wonder if neutron is also mucking with this interface. > Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN ] Corosync main > process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor > failed, forming new configuration. > > then shortly after. > > Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN ] Corosync main > process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms). > Consider token timeout increase. > > > Unrelated, but why are there both mysqld.log and mariadb.log? This bug is > > filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what > > I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but > this is configurable. The galera package is hard coded to > /var/log/mariadb/mariadb.log Yes, that is the issue. Crag pointed this out in a chat we had this morning. We should really use /var/log/mariadb/mariadb.log for consistency. > If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource > create galera galera log=/var/log/mariadb/mariadb.log > wsrep_cluster_address="gcom......' command, the log file will remain > consistent. > > I could make the galera agent default to /var/log/mariadb/mariadb.log in the > future if this causes problems. Agreed. If it not causing problems, just a matter of consitency. > > is going on here. Best get one of the pacemaker folks to look at this, as > > well as figure out why the network is getting interrupted. > > > > Also, SSL is failing. I don't see config files so I have no idea if you have > > SSL enabled or disabled. This could potentially cause an IST to fail, which > > may be a problem for you. David, if 2 of the 3 galera nodes are still up and running, does the RA start mysqld in read-only mode? I'm asking because I want to rule out the issue of grastate.dat getting seqno set to -1 as a side-effect of staring in read-only. My assumption is that if there is another galera node already running, there is no need to determine the position, and the RA will just join. (In reply to Mike Burns from comment #7) > Shouldn't the resource agent recover the down node after networking is > restored? That seems like the more crucial issue to me. Network issues or > out-of-sync issues will happen occasionally, but I would expect that they > would be recovered. Sure. Unfortunately the machine where this occured had been reprovisioned and the galera.cnf was not included. I've also never seen this happen before, so it is difficult to say if this is a galera problem or RA problem, or some combination. I'd really like to see the galera.cnf since this rejoining the cluster will almost certainly result in an IST, which will use SSL if configured. I would like to rule out that galera SSL was misconfigured, but since no galera.cnf was provided this may not be possible. I am closing this since we've heard nothing more about this. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |