1299404 – Galera resource agent cannot connect to custom host/port

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1299404 - Galera resource agent cannot connect to custom host/port

Summary: Galera resource agent cannot connect to custom host/port

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Damien Ciabrini
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1304711
TreeView+	depends on / blocked

Reported:	2016-01-18 10:05 UTC by Damien Ciabrini
Modified:	2016-11-04 00:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:	resource-agents-3.9.5-56.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1304711 (view as bug list)
Environment:
Last Closed:	2016-11-04 00:00:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2174	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2016-11-03 13:16:36 UTC

Description Damien Ciabrini 2016-01-18 10:05:04 UTC

Description of problem:

The galera resource agent currently targets a local UNIX socket
to connect to mariadb for polling WSREP status. As such, it cannot
be distinguished from regular client connections and is tied
to the the same limits and behaviour.

This is problematic as the resource agent can fail due to uncontrolled
events, like for instance max connection reached, or slow connections
(e.g. when mariadb threadpool is engaged). Ultimately such failures can
cause nodes to be fenced.

Galera resource agent should accept custom host/port options, in order
to benefit from mariadb's extra-port and extra-max-connections [1]

[1] https://mariadb.com/kb/en/mariadb/thread-pool-in-mariadb/#best-of-both-worlds-running-with-pool-of-threads-and-with-one-thread-per-connection

Comment 2 Damien Ciabrini 2016-01-18 10:07:32 UTC

fixed upstream in https://github.com/ClusterLabs/resource-agents/pull/680

Comment 3 Ofer Blaut 2016-02-02 08:45:23 UTC

Hi

We need instructions how to test in in OSPd

Thanks

Ofer

Comment 9 Damien Ciabrini 2016-02-05 14:07:42 UTC

Instruction for test

While galera cluster is up, create a new user for the test

    create user 'testuser'@'%' identified by 'test';

Allow user to log in to a tcp socket

    grant all privileges on 'testuser'@'%';

Stop galera.

    pcs resource disable galera

Once stopped, change /etc/my.cnf.d/galera.cnf to have mysql listen to
port 4242 on an available interface:

    bind_address=0.0.0.0
    port=4242

Then configure /etc/sysconfig/clustercheck to make the resource
agent contact mysqld on the tcp port above. Modify the entries so
that the file looks like:

    MYSQL_USERNAME="testuser"
    MYSQL_PASSWORD="test"
    MYSQL_HOST="127.0.0.1"
    MYSQL_PORT="4242"

(don't set MYSQL_HOST to "localhost" otherwise mysql will try to
connect via UNIX socket)

Start galera again.

    pcs resource enable galera

After all machine are marked as master, verify that mysqld is listening
to port 4242.

    netstat -tnlp | grep mysqld

If so, the test is working as expected since the resource agent polls
mysqld regularly to ensure that galera has started properly and is
still running.

For cleaning up, stop galera, reset the two config files to their
original values, and restart galera. Once started, you can delete the
test user

    drop user 'testuser'@'%';

Comment 11 Asaf Hirshberg 2016-03-08 10:32:26 UTC

After following  Damien Ciabrini instructions on comment 9 the end result is that mysql listening to the given port:
[root@overcloud-controller-0 ~]# netstat -tnlp | grep mysqld
tcp        0      0 0.0.0.0:4242            0.0.0.0:*               LISTEN      20232/mysqld  

All machines were marked as masters but all openstack resources managed by pacemaker are in failed state:
Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=493, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:18 2016', queued=0ms, exec=134380ms
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=507, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2108ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-0 'not running' (7): call=483, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-0 'not running' (7): call=476, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=472, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120568ms
* openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=483, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:19 2016', queued=0ms, exec=134432ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=497, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2122ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-1 'not running' (7): call=475, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-1 'not running' (7): call=468, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-1 'not running' (7): call=464, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120717ms
* openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=474, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:19 2016', queued=0ms, exec=134380ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=488, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2108ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-2 'not running' (7): call=466, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-2 'not running' (7): call=459, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-2 'not running' (7): call=455, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120576ms

After deleting the changes all resources started. 
So I want to make sure that this is the expected results of the changes made. 

System info:
RHEL-OSP director 8.0 puddle - 2016-03-03.1
[root@overcloud-controller-0 ~]# rpm -qa|grep resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64

Comment 12 Damien Ciabrini 2016-03-08 10:49:52 UTC

Asaf, I think some steps are missing in #c9.

Could you please check whether haproxy is listening as expected on port 3306 on the VIP, and that the haproxy.cfg setting forwards request to port 4242, i.e. config looks similar to the lines below:

  server overcloud-controller-0 192.0.2.21:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 192.0.2.20:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 192.0.2.22:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

Comment 13 Asaf Hirshberg 2016-03-08 17:18:45 UTC

Veirifed on RHEL-OSP director 8.0 puddle - 2016-03-03.1
[root@overcloud-controller-0 ~]# rpm -qa|grep resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64


1) Create a new user in mysqland grant it with all privileges:
	i) create user 'testuser'@'%' identified by 'test';
	ii) grant all privileges on *.* to 'testuser'@'%';

2) Stop galera cluster using pacemaker:
	i)   pcs resource disable galera

3) After galera is down change the port mysql listening to in /etc/my.cnf.d/galera.cnf:
	i) vi /etc/my.cnf.d/galera.cnf :
	bind_address=0.0.0.0
        port=4242

4) Then configure /etc/sysconfig/clustercheck to make the resource
   agent contact mysqld on the tcp port above. Modify the entries so
   that the file looks like:
	i) vi /etc/sysconfig/clustercheck
		MYSQL_USERNAME="testuser"
    		MYSQL_PASSWORD="test"
    		MYSQL_HOST="127.0.0.1"
    		MYSQL_PORT="4242"

5) Change also the ports haproxy forwarding to in /etc/haproxy/haproxy.cfg at mysql section:
	i) vi /etc/haproxy/haproxy.cfg 
		listen mysql
  		bind 172.17.0.11:3306 transparent
  		option tcpka
  		option httpchk
  		stick on dst
  		stick-table type ip size 1000
  		timeout client 90m
  		timeout server 90m
 		server overcloud-controller-0 172.17.0.16:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  		server overcloud-controller-1 172.17.0.17:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  		server overcloud-controller-2 172.17.0.15:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

6) Restart haproxy service:
	i) systemctl restart haproxy
	ii) systemctl status haproxy

7) Start galera cluster using pacemaker:
	i) pcs resource enalbe galera

8) Verify that the galera cluster is up and machines are marked as master, also make sure that all openstack-resources are started.
	i) pcs status
	ii) netstat -tupln |grep haproxy # haproxy should listen to the old port
	iii) clustercheck

Comment 15 errata-xmlrpc 2016-11-04 00:00:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html

Note You need to log in before you can comment on or make changes to this bug.