Bug 1299404

Summary: Galera resource agent cannot connect to custom host/port
Product: Red Hat Enterprise Linux 7 Reporter: Damien Ciabrini <dciabrin>
Component: resource-agentsAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED ERRATA QA Contact: Asaf Hirshberg <ahirshbe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: agk, ahirshbe, cfeist, cluster-maint, dciabrin, fdinitto, mbayer, mkolaja, oalbrigt, royoung, snagar
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-56.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1304711 (view as bug list) Environment:
Last Closed: 2016-11-04 00:00:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1304711    

Description Damien Ciabrini 2016-01-18 10:05:04 UTC
Description of problem:

The galera resource agent currently targets a local UNIX socket
to connect to mariadb for polling WSREP status. As such, it cannot
be distinguished from regular client connections and is tied
to the the same limits and behaviour.

This is problematic as the resource agent can fail due to uncontrolled
events, like for instance max connection reached, or slow connections
(e.g. when mariadb threadpool is engaged). Ultimately such failures can
cause nodes to be fenced.

Galera resource agent should accept custom host/port options, in order
to benefit from mariadb's extra-port and extra-max-connections [1]

[1] https://mariadb.com/kb/en/mariadb/thread-pool-in-mariadb/#best-of-both-worlds-running-with-pool-of-threads-and-with-one-thread-per-connection

Comment 2 Damien Ciabrini 2016-01-18 10:07:32 UTC
fixed upstream in https://github.com/ClusterLabs/resource-agents/pull/680

Comment 3 Ofer Blaut 2016-02-02 08:45:23 UTC
Hi

We need instructions how to test in in OSPd

Thanks

Ofer

Comment 9 Damien Ciabrini 2016-02-05 14:07:42 UTC
Instruction for test

While galera cluster is up, create a new user for the test

    create user 'testuser'@'%' identified by 'test';

Allow user to log in to a tcp socket

    grant all privileges on 'testuser'@'%';

Stop galera.

    pcs resource disable galera

Once stopped, change /etc/my.cnf.d/galera.cnf to have mysql listen to
port 4242 on an available interface:

    bind_address=0.0.0.0
    port=4242

Then configure /etc/sysconfig/clustercheck to make the resource
agent contact mysqld on the tcp port above. Modify the entries so
that the file looks like:

    MYSQL_USERNAME="testuser"
    MYSQL_PASSWORD="test"
    MYSQL_HOST="127.0.0.1"
    MYSQL_PORT="4242"

(don't set MYSQL_HOST to "localhost" otherwise mysql will try to
connect via UNIX socket)

Start galera again.

    pcs resource enable galera

After all machine are marked as master, verify that mysqld is listening
to port 4242.

    netstat -tnlp | grep mysqld

If so, the test is working as expected since the resource agent polls
mysqld regularly to ensure that galera has started properly and is
still running.

For cleaning up, stop galera, reset the two config files to their
original values, and restart galera. Once started, you can delete the
test user

    drop user 'testuser'@'%';

Comment 11 Asaf Hirshberg 2016-03-08 10:32:26 UTC
After following  Damien Ciabrini instructions on comment 9 the end result is that mysql listening to the given port:
[root@overcloud-controller-0 ~]# netstat -tnlp | grep mysqld
tcp        0      0 0.0.0.0:4242            0.0.0.0:*               LISTEN      20232/mysqld  

All machines were marked as masters but all openstack resources managed by pacemaker are in failed state:
Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=493, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:18 2016', queued=0ms, exec=134380ms
* openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=507, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2108ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-0 'not running' (7): call=483, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-0 'not running' (7): call=476, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=472, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120568ms
* openstack-nova-scheduler_start_0 on overcloud-controller-1 'not running' (7): call=483, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:19 2016', queued=0ms, exec=134432ms
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=497, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2122ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-1 'not running' (7): call=475, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-1 'not running' (7): call=468, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-1 'not running' (7): call=464, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120717ms
* openstack-nova-scheduler_start_0 on overcloud-controller-2 'not running' (7): call=474, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:19 2016', queued=0ms, exec=134380ms
* openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=488, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:36 2016', queued=0ms, exec=2108ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-2 'not running' (7): call=466, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:11 2016', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-2 'not running' (7): call=459, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:12:09 2016', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-2 'not running' (7): call=455, status=complete, exitreason='none',
    last-rc-change='Tue Mar  8 10:09:06 2016', queued=0ms, exec=120576ms

After deleting the changes all resources started. 
So I want to make sure that this is the expected results of the changes made. 

System info:
RHEL-OSP director 8.0 puddle - 2016-03-03.1
[root@overcloud-controller-0 ~]# rpm -qa|grep resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64

Comment 12 Damien Ciabrini 2016-03-08 10:49:52 UTC
Asaf, I think some steps are missing in #c9.

Could you please check whether haproxy is listening as expected on port 3306 on the VIP, and that the haproxy.cfg setting forwards request to port 4242, i.e. config looks similar to the lines below:

  server overcloud-controller-0 192.0.2.21:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 192.0.2.20:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 192.0.2.22:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

Comment 13 Asaf Hirshberg 2016-03-08 17:18:45 UTC
Veirifed on RHEL-OSP director 8.0 puddle - 2016-03-03.1
[root@overcloud-controller-0 ~]# rpm -qa|grep resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64


1) Create a new user in mysqland grant it with all privileges:
	i) create user 'testuser'@'%' identified by 'test';
	ii) grant all privileges on *.* to 'testuser'@'%';

2) Stop galera cluster using pacemaker:
	i)   pcs resource disable galera

3) After galera is down change the port mysql listening to in /etc/my.cnf.d/galera.cnf:
	i) vi /etc/my.cnf.d/galera.cnf :
	bind_address=0.0.0.0
        port=4242

4) Then configure /etc/sysconfig/clustercheck to make the resource
   agent contact mysqld on the tcp port above. Modify the entries so
   that the file looks like:
	i) vi /etc/sysconfig/clustercheck
		MYSQL_USERNAME="testuser"
    		MYSQL_PASSWORD="test"
    		MYSQL_HOST="127.0.0.1"
    		MYSQL_PORT="4242"

5) Change also the ports haproxy forwarding to in /etc/haproxy/haproxy.cfg at mysql section:
	i) vi /etc/haproxy/haproxy.cfg 
		listen mysql
  		bind 172.17.0.11:3306 transparent
  		option tcpka
  		option httpchk
  		stick on dst
  		stick-table type ip size 1000
  		timeout client 90m
  		timeout server 90m
 		server overcloud-controller-0 172.17.0.16:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  		server overcloud-controller-1 172.17.0.17:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  		server overcloud-controller-2 172.17.0.15:4242 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

6) Restart haproxy service:
	i) systemctl restart haproxy
	ii) systemctl status haproxy

7) Start galera cluster using pacemaker:
	i) pcs resource enalbe galera

8) Verify that the galera cluster is up and machines are marked as master, also make sure that all openstack-resources are started.
	i) pcs status
	ii) netstat -tupln |grep haproxy # haproxy should listen to the old port
	iii) clustercheck

Comment 15 errata-xmlrpc 2016-11-04 00:00:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html