Bug 1269652
| Summary: | MySQL state database monitoring | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | David Hill <dhill> |
| Component: | mariadb-galera | Assignee: | Damien Ciabrini <dciabrin> |
| Status: | CLOSED WORKSFORME | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.0 (RHEL 6) | CC: | dhill, mbayer, srevivo |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | 5.0 (RHEL 6) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-07 19:56:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
David Hill
2015-10-07 20:39:53 UTC
The events here are fairly simple in that there are database connectivity issues occurring very frequently. I don't know the details of this setup but one common mis-configuration that causes this is if the HAProxy timeouts are set incorrectly, such that application-pooled database connections, which expect to remain around for 60m, are being cut off by a 90 second timeout by HAProxy. We'd like to see the haproxy config being used here as there may be a simple cause for these database connectivity issues. They had to bring down the whole cluster because it was being upgraded. When a minor update is applied on a cluster, should they bring the whole cluster down or it's safe to do it while it's online? well that depends a lot on what they are upgrading. if the case here is that they shut off the DB to do some upgrades and the services were returning errors for that time, that's not a big deal, as long as the services successfully came back online when the databases were restarted. Well, they added two hosts to the cluster and they brought this down on purpose in order to know what could happen to the various RHOS services. Do we have any kind of monitoring scripts/health checks/what to monitor/etc to provide the customer with? Well the basic monitoring that's done on an OSP / Galera setup is done by Pacemaker. Running "pcs status" will show the state of the Galera cluster:
[root@rhel7-1 ~]# pcs status
Cluster name: rhos6
Last updated: Tue Oct 20 11:57:23 2015
Last change: Fri Oct 16 13:02:10 2015
Stack: corosync
Current DC: rhel7-2 (2) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
7 Resources configured
Online: [ rhel7-1 rhel7-2 rhel7-3 ]
Full list of resources:
Clone Set: lb-haproxy-clone [lb-haproxy]
Started: [ rhel7-1 rhel7-2 rhel7-3 ]
vip-db (ocf::heartbeat:IPaddr2): Started rhel7-2
Master/Slave Set: galera-master [galera]
Masters: [ rhel7-1 rhel7-2 rhel7-3 ]
PCSD Status:
rhel7-1: Online
rhel7-2: Online
rhel7-3: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Above, the part of the display that says "Masters: [ rhel7-1 rhel7-2 rhel7-3 ]" illustrates all three nodes are up and ready for connections. If you instead saw this:
Clone Set: lb-haproxy-clone [lb-haproxy]
Started: [ rhel7-1 rhel7-2 rhel7-3 ]
vip-db (ocf::heartbeat:IPaddr2): Started rhel7-2
Master/Slave Set: galera-master [galera]
galera (ocf::heartbeat:galera): FAILED Master rhel7-2
Masters: [ rhel7-1 rhel7-3 ]
Failed actions:
galera_monitor_10000 on rhel7-2 'not running' (7): call=32, status=complete, exit-reason='none', last-rc-change='Tue Oct 20 12:01:05 2015', queued=0ms, exec=0ms
that indicates the rhel7-2 node is having a problem; you'd want to look in /var/log/mysqld.log on that server to see what the problem is, and after that do a "pcs resource cleanup" to have pacemaker restart.
Case from customer was closed. We will reopen if we need. |