Bug 1817036

Summary:	Cleanup race with wsrep_rsync_sst_tunnel may prevent full galera cluster bootstrap
Product:	Red Hat Enterprise Linux 8	Reporter:	Damien Ciabrini <dciabrin>
Component:	mariadb	Assignee:	Michal Schorm <mschorm>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Lukáš Zachar <lzachar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	databases-maint, dciabrin, hhorak, michele, mschorm
Target Milestone:	rc	Keywords:	ZStream
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1899021 1899022 1899024 1899025 (view as bug list)		Environment:
Last Closed:	2021-01-15 15:41:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1856699, 1899021, 1899022, 1899024, 1899025

Description Damien Ciabrini 2020-03-25 13:08:57 UTC

Description of problem:
When Galera is configured to use TLS, nodes can request full synchronization to any node in the cluster via a rsync encapsulated in a socat tunnel. This is implemented with wsrep_sst_rsync_tunnel [1]. The node requesting a full synchronization is called a joiner, the node sending the data over rsync is called the donor.

Currently, when a donor finishes transferring data via rsync over socat, it signals success to the galera server, and tears down the socat tunnel afterwards.

Donor jobs happen sequentially, but when a node in a galera cluster acts as a donor for two other nodes in a row, it may happen that it initiates the second donor job before the first job had a chance to tear down the socat tunnel. What happens in that case is that the first job unexpectedly kills the socat process spawned by the second job, and the transfer fails.

Version-Release number of selected component (if applicable):
mariadb-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64

How reproducible:
Randomly, although one can add 'sleep' in wsrep_sst_rsync_tunnel to exhibit the race and make it fail consistently.

Steps to Reproduce:
1. Deploy a 3 node galera cluster with TLS encryption configured in galera.
2. wait for the first node to bootstrap a new galera cluster.
3. wait the the first node to serve as a donor for the remaining 2 joining nodes.

Actual results:
Sometimes one of the joiner node will fail to synchronize because the socat tunnel use to transfer data gets killed.

Expected results:
Transfoer should always succeed

Additional info:
I'm using ra-tester to spin up 3 vms and set up a pristine 3 node galera cluster with TLS encryption. So I can help verifying the bz quickly if needed.

[1] https://github.com/dciabrin/wsrep_sst_rsync_tunnel
[2] https://github.com/dciabrin/ra-tester

Comment 1 Damien Ciabrini 2020-03-25 13:25:20 UTC

Fixed upstream in https://github.com/dciabrin/wsrep_sst_rsync_tunnel/commit/590d23665dae33940e15ef48c2e28737ec151953

Comment 4 Michal Schorm 2020-04-16 13:55:05 UTC

Hi,
Here's the test build for you:

http://people.redhat.com/mschorm/.rhbz-1817036/

Please let us know if that works for you (both the build itself and the way I gave you access to it)

Comment 5 Damien Ciabrini 2020-05-28 09:16:34 UTC

Hey Michal sorry I overlook the answer...

> Please let us know if that works for you (both the build itself and the way
> I gave you access to it)

Thanks for the repo that made my life easier.
The build works just fine thank you. Here's how I validated it:

1. start from an empty galera database, and configure galera to use TLS and the tunnelled SST script wsrep_sst_rsync_tunnel:

wsrep_provider_options = gmcast.listen_addr=tcp://172.17.54.218:4567;socket.ssl_key=/tls/mysql.key;socket.ssl_cert=/tls/mysql.crt;socket.ssl_ca=/tls/all-mysql.crt;
wsrep_sst_method=rsync_tunnel
[sst]
tca=/tls/all-mysql.crt
tcert=/tls/mysql.pem
sockopt="verify=1"


2. confirm that the cluster start and SST runs properly

3. stop the cluster, and tweak the wsrep_sst_rsync_tunnel script to add an additional delay that would cause the issue 100% before the fix:

       [...]
       echo "done $STATE"
 >>>   sleep 8
       [...]

4. clean grastate and restart the cluster to force SST at bootstrap

rm -rf /var/lib/mysql/grastate.dat



I confirm the cluster restarts just fine, so the problem is fixed.