Bug 1269652

Summary:	MySQL state database monitoring
Product:	Red Hat OpenStack	Reporter:	David Hill <dhill>
Component:	mariadb-galera	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED WORKSFORME	QA Contact:	Udi Shkalim <ushkalim>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	5.0 (RHEL 6)	CC:	dhill, mbayer, srevivo
Target Milestone:	---	Keywords:	ZStream
Target Release:	5.0 (RHEL 6)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-07 19:56:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2015-10-07 20:39:53 UTC

Description of problem:

we recently took our MySQL server offline for an upgrade and did some monitoring on OpenStack while this was occurring to see what error messages got generated.

While we want to get Red hat's feedback on how we might be able to refine our monitoring filter strings based on what we say during the outage (See below), we'd also like to get some guidance and/or provide us with a health check script that we could periodically run on the controllers which would allow us to quickly assess the health and availability of the MySQL server that OpenStack is using as well as the underlying OpenStack state databases.  

Here's what we saw via our monitoring during the db outage. Attached TCR report of events from Tivol.  We also captured some screenshots from the ITM OpenStack agent workspaces.   Keep in mind that we don’t really know what events would have come from the mySQL team’s db monitoring.

Approx 700 events were generated.

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Break the mysql cluster
2.
3.

Actual results:
Everything breaks

Expected results:
Everything breaks

Additional info:

Comment 2 Michael Bayer 2015-10-08 03:49:46 UTC

The events here are fairly simple in that there are database connectivity issues occurring very frequently.   I don't know the details of this setup but one common mis-configuration that causes this is if the HAProxy timeouts are set incorrectly, such that application-pooled database connections, which expect to remain around for 60m, are being cut off by a 90 second timeout by HAProxy. 

We'd like to see the haproxy config being used here as there may be a simple cause for these database connectivity issues.

Comment 3 David Hill 2015-10-09 19:50:49 UTC

They had to bring down the whole cluster because it was being upgraded.   When a minor update is applied on a cluster, should they bring the whole cluster down or it's safe to do it while it's online?

Comment 4 Michael Bayer 2015-10-09 20:30:20 UTC

well that depends a lot on what they are upgrading.   if the case here is that they shut off the DB to do some upgrades and the services were returning errors for that time, that's not a big deal, as long as the services successfully came back online when the databases were restarted.

Comment 5 David Hill 2015-10-19 18:08:22 UTC

Well, they added two hosts to the cluster and they brought this down on purpose in order to know what could happen to the various RHOS services.  Do we have any kind of monitoring scripts/health checks/what to monitor/etc to provide the customer with?

Comment 6 Michael Bayer 2015-10-20 16:10:18 UTC

Well the basic monitoring that's done on an OSP / Galera setup is done by Pacemaker.  Running "pcs status" will show the state of the Galera cluster:

[root@rhel7-1 ~]# pcs status
Cluster name: rhos6
Last updated: Tue Oct 20 11:57:23 2015
Last change: Fri Oct 16 13:02:10 2015
Stack: corosync
Current DC: rhel7-2 (2) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
7 Resources configured


Online: [ rhel7-1 rhel7-2 rhel7-3 ]

Full list of resources:

 Clone Set: lb-haproxy-clone [lb-haproxy]
     Started: [ rhel7-1 rhel7-2 rhel7-3 ]
 vip-db	(ocf::heartbeat:IPaddr2):	Started rhel7-2 
 Master/Slave Set: galera-master [galera]
     Masters: [ rhel7-1 rhel7-2 rhel7-3 ]

PCSD Status:
  rhel7-1: Online
  rhel7-2: Online
  rhel7-3: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Above, the part of the display that says "Masters: [ rhel7-1 rhel7-2 rhel7-3 ]" illustrates all three nodes are up and ready for connections.   If you instead saw this:

 Clone Set: lb-haproxy-clone [lb-haproxy]
     Started: [ rhel7-1 rhel7-2 rhel7-3 ]
 vip-db	(ocf::heartbeat:IPaddr2):	Started rhel7-2 
 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master rhel7-2 
     Masters: [ rhel7-1 rhel7-3 ]

Failed actions:
    galera_monitor_10000 on rhel7-2 'not running' (7): call=32, status=complete, exit-reason='none', last-rc-change='Tue Oct 20 12:01:05 2015', queued=0ms, exec=0ms


that indicates the rhel7-2 node is having a problem; you'd want to look in /var/log/mysqld.log on that server to see what the problem is, and after that do a "pcs resource cleanup" to have pacemaker restart.

Comment 10 David Hill 2016-11-14 15:35:01 UTC

Case from customer was closed.  We will reopen if we need.