Bug 1173370

Summary:

galera node falls out of sync and resource agent fails to recover it

Product:

Red Hat OpenStack

Reporter:

Crag Wolfe <cwolfe>

Component:

mariadb-galera

Assignee:

Ryan O'Hara <rohara>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Ami Jeain <ajeain>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.0 (Juno)

CC:

cwolfe, dvossel, mbayer, mburns, schuzhoy, sgordon, yeylon

Target Milestone:

---

Keywords:

Unconfirmed, ZStream

Target Release:

6.0 (Juno)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-13 17:47:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
galera, mysql, messages, audit logs for 3 controllers	none

Description Crag Wolfe 2014-12-12 00:43:54 UTC

Description of problem:

During a staypuft install of three openstack HA controller nodes, galera cluster is in initially sync on all three nodes.  However, a node becomes out of sync in just 10 minutes and ultimately is not recovered.  A symptom of this issue is a puppet error message (clustercheck) when puppet is run again problematic node.

mariadb detected as out of sync (from 192.168.0.7/pacemaker.log):
Dec 11 10:45:36 [13553] maca25400702875.example.com    pengine:  warning: unpack_rsc_op:        Processing failed op monitor for galera:1 on pcmk-maca25400702875: not running (7)

From 192.168.0.7/mysqld.log:
141211 10:45:34 [Note] WSREP: Member 0.0 (maca25400702875.example.com) requested state transfer from '*any*'. Selected 1.0 (maca25400702876.example.com)(SYNCED) as donor.
141211 10:45:34 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2042)
141211 10:45:34 [Note] WSREP: Requesting state transfer: success, donor: 1
141211 10:45:34 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 0 bytes
141211 10:45:34 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/7 locked buffers
terminate called after throwing an instance of 'gu::Exception'
141211 10:45:34 [Note] WSREP: Receiving IST: 8 writesets, seqnos 2027-2035
141211 10:45:34 [Warning] WSREP: 1.0 (maca25400702876.example.com): State transfer to 0.0 (maca25400702875.example.com) failed: -336134278 (Unknown error 336134278)
141211 10:45:34 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.
141211 10:45:34 [Note] WSREP: gcomm: terminating thread

Pacemaker attempts to recover the node:

Dec 11 10:45:36 [13553] maca25400702875.example.com    pengine:   notice: LogActions:   Recover galera:1        (Master pcmk-maca25400702875)
Dec 11 10:46:38 [13553] maca25400702875.example.com    pengine:   notice: LogActions:   Recover galera:1        (Slave pcmk-maca25400702875)
Dec 11 10:46:42 [13553] maca25400702875.example.com    pengine:   notice: LogActions:   Recover galera:1        (Slave pcmk-maca25400702875)

But fails:

Dec 11 10:46:42 [13553] maca25400702875.example.com    pengine:     info: clone_print:   Master/Slave Set: galera-master [galera]
Dec 11 10:46:42 [13553] maca25400702875.example.com    pengine:     info: native_print:              galera     (ocf::heartbeat:galera):        FAILED pcmk-maca25400702875
Dec 11 10:46:42 [13553] maca25400702875.example.com    pengine:     info: short_print:       Masters: [ pcmk-maca25400702876 pcmk-maca25400702877 ]


Version-Release number of selected component (if applicable):
mariadb-5.5.40-1.el7_0.x86_64
mariadb-galera-common-5.5.40-3.el7ost.x86_64
mariadb-galera-server-5.5.40-3.el7ost.x86_64
mariadb-libs-5.5.40-1.el7_0.x86_64
pacemaker-1.1.10-32.el7_0.1.x86_64
pacemaker-cli-1.1.10-32.el7_0.1.x86_64
pacemaker-cluster-libs-1.1.10-32.el7_0.1.x86_64
pacemaker-libs-1.1.10-32.el7_0.1.x86_64

Additional info:

QE host: 10.8.30.168

Comment 1 Crag Wolfe 2014-12-12 00:50:38 UTC

Created attachment 967449 [details]
galera, mysql, messages, audit logs for 3 controllers

Comment 3 Ryan O'Hara 2014-12-12 16:53:36 UTC

Looks like this was triggered by some sort of network problem where the galera nodes could not communicate with one another, at least on 192.168.0.7:

141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.88361S), skipping check
141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.00183S), skipping check
141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.75356S), skipping check

Someone can correct me if I am wrong, but it looks like corosync also detected a blip in the network.

Unrelated, but why are there both mysqld.log and mariadb.log? This bug is filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what is going on here. Best get one of the pacemaker folks to look at this, as well as figure out why the network is getting interrupted.

Also, SSL is failing. I don't see config files so I have no idea if you have SSL enabled or disabled. This could potentially cause an IST to fail, which may be a problem for you.

Comment 4 Crag Wolfe 2014-12-12 17:11:05 UTC

Sorry, this was against RHOS-6.  I've updated the version.

Comment 5 Ryan O'Hara 2014-12-12 17:38:10 UTC

Would be useful to see galera.cnf from a machine where this happens. As for chasing down the network problem, is there a chance that neutron using the same interface that galera is using for network communication? Could this be causing the networking hiccup? Just brainstorming here.

Comment 6 David Vossel 2014-12-12 17:47:17 UTC

(In reply to Ryan O'Hara from comment #3)
> Looks like this was triggered by some sort of network problem where the
> galera nodes could not communicate with one another, at least on 192.168.0.7:
> 
> 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago
> (PT1.88361S), skipping check
> 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago
> (PT2.00183S), skipping check
> 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago
> (PT2.75356S), skipping check
> 
> Someone can correct me if I am wrong, but it looks like corosync also
> detected a blip in the network.

yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of corosync membership lost sorts of messages around the time this galera issue happens. Corosync and galera use the same interface I believe, so seeing they're both hosed points to a network related issue.

Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN  ] Corosync main process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms). Consider token timeout increase.
Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor failed, forming new configuration.

then shortly after.

Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN  ] Corosync main process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms). Consider token timeout increase.

> Unrelated, but why are there both mysqld.log and mariadb.log? This bug is
> filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what

I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but this is configurable. The galera package is hard coded to /var/log/mariadb/mariadb.log

If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource create galera galera log=/var/log/mariadb/mariadb.log wsrep_cluster_address="gcom......' command, the log file will remain consistent.

I could make the galera agent default to /var/log/mariadb/mariadb.log in the future if this causes problems.

> is going on here. Best get one of the pacemaker folks to look at this, as
> well as figure out why the network is getting interrupted.
> 
> Also, SSL is failing. I don't see config files so I have no idea if you have
> SSL enabled or disabled. This could potentially cause an IST to fail, which
> may be a problem for you.

Comment 7 Mike Burns 2014-12-12 18:00:49 UTC

Shouldn't the resource agent recover the down node after networking is restored?  That seems like the more crucial issue to me.  Network issues or out-of-sync issues will happen occasionally, but I would expect that they would be recovered.

Comment 8 Ryan O'Hara 2014-12-12 18:12:47 UTC

(In reply to David Vossel from comment #6)
> (In reply to Ryan O'Hara from comment #3)
> > Looks like this was triggered by some sort of network problem where the
> > galera nodes could not communicate with one another, at least on 192.168.0.7:
> > 
> > 141211 10:45:13 [Warning] WSREP: last inactive check more than PT1.5S ago
> > (PT1.88361S), skipping check
> > 141211 10:45:14 [Warning] WSREP: last inactive check more than PT1.5S ago
> > (PT2.00183S), skipping check
> > 141211 10:45:28 [Warning] WSREP: last inactive check more than PT1.5S ago
> > (PT2.75356S), skipping check
> > 
> > Someone can correct me if I am wrong, but it looks like corosync also
> > detected a blip in the network.
> 
> yeah. something interesting is going on 192.168.0.7. I'm seeing a lot of
> corosync membership lost sorts of messages around the time this galera issue
> happens. Corosync and galera use the same interface I believe, so seeing
> they're both hosed points to a network related issue.

Right. I wonder if neutron is also mucking with this interface.

> Dec 11 10:45:28 maca25400702875 corosync[13533]: [MAIN  ] Corosync main
> process was not scheduled for 2244.8887 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Dec 11 10:45:28 maca25400702875 corosync[13533]: [TOTEM ] A processor
> failed, forming new configuration.
> 
> then shortly after.
> 
> Dec 11 10:45:32 maca25400702875 corosync[13533]: [MAIN  ] Corosync main
> process was not scheduled for 1134.7559 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> 
> > Unrelated, but why are there both mysqld.log and mariadb.log? This bug is
> > filed agaist RHOS-5, but the RA is only usable on RHOS-6. I am not sure what
> 
> I have a theory. The galera OCF agent defaults to /var/log/mysqld.log, but
> this is configurable. The galera package is hard coded to
> /var/log/mariadb/mariadb.log

Yes, that is the issue. Crag pointed this out in a chat we had this morning. We should really use /var/log/mariadb/mariadb.log for consistency.

> If we set the log=/var/log/mariadb/mariadb.log during the 'pcs resource
> create galera galera log=/var/log/mariadb/mariadb.log
> wsrep_cluster_address="gcom......' command, the log file will remain
> consistent.
> 
> I could make the galera agent default to /var/log/mariadb/mariadb.log in the
> future if this causes problems.

Agreed. If it not causing problems, just a matter of consitency.

> > is going on here. Best get one of the pacemaker folks to look at this, as
> > well as figure out why the network is getting interrupted.
> > 
> > Also, SSL is failing. I don't see config files so I have no idea if you have
> > SSL enabled or disabled. This could potentially cause an IST to fail, which
> > may be a problem for you.

David, if 2 of the 3 galera nodes are still up and running, does the RA start mysqld in read-only mode? I'm asking because I want to rule out the issue of grastate.dat getting seqno set to -1 as a side-effect of staring in read-only. My assumption is that if there is another galera node already running, there is no need to determine the position, and the RA will just join.

Comment 9 Ryan O'Hara 2014-12-12 18:17:26 UTC

(In reply to Mike Burns from comment #7)
> Shouldn't the resource agent recover the down node after networking is
> restored?  That seems like the more crucial issue to me.  Network issues or
> out-of-sync issues will happen occasionally, but I would expect that they
> would be recovered.

Sure. Unfortunately the machine where this occured had been reprovisioned and the galera.cnf was not included. I've also never seen this happen before, so it is difficult to say if this is a galera problem or RA problem, or some combination.

I'd really like to see the galera.cnf since this rejoining the cluster will almost certainly result in an IST, which will use SSL if configured. I would like to rule out that galera SSL was misconfigured, but since no galera.cnf was provided this may not be possible.

Comment 13 Ryan O'Hara 2015-02-13 17:47:16 UTC

I am closing this since we've heard nothing more about this.

Comment 14 Red Hat Bugzilla 2023-09-14 02:52:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days