| Summary: | Galera cluster fails and services are restarted by pacemaker | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> |
| Component: | mariadb-galera | Assignee: | Damien Ciabrini <dciabrin> |
| Status: | CLOSED NOTABUG | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.0 (Juno) | CC: | dciabrin, fdinitto, jmelvin, mbayer, srevivo |
| Target Milestone: | --- | Keywords: | Unconfirmed |
| Target Release: | 6.0 (Juno) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-16 16:33:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jeremy
2016-10-19 19:02:49 UTC
The provided sosreports have been generated with an old version of sos package and lack interesting logs from the galera servers stored in /var/log/mysqld.log I would advise to upgrade this package so that we have complete logs to analyze. That being said, I think the reason why galera cluster experienced a failure is due to DB connection exhaustion. So customer would need to determine if this high db workload is usual/expected, and if so, raise two configuration flags: . max_connections in /etc/my.cnf.d/server.cnf . open-files-limit in the galera resource definition (pacemaker) I can see cinder service health reporting fail due error 1040 'Too many connection'. Pacemaker's resource monitoring also hit the issue at some point, so the monitor action could not determine the state of the galera server and returned an error. notice: operation_finished: galera_monitor_10000:25438:stderr [ ERROR 1040 (08004): Too many connections ] Consequently, this made pacemaker stop the resource on that node, and HAProxy to signal resource down on that node. Jeremy, I don't think SSL warning are important, but I would need /var/log/mysqld.log to confirm. Regarding open-files-limit, the warning you saw probably come from the logfile used during the pre-bootstrap phase of the galera cluster (/var/log/mariadb/mariadb.log). As explained in [1], they are not critical; the running galera server should have the open file limit set as expected. You can log to a mariadb server locally to confirm with: MariaDB [(none)]> show variables like 'open_files_limit'; If you need more help, could you ensure to get the missing /var/log/musqld.log from all controller in sosreport? [1] http://damien.ciabrini.name/posts/2016/03/troubleshooting-open_files_limit-in-mariadb.html Ok, late at confirming. the SSL warnings are harmless, I can see in the logs that the nodes succesfully connect to the galera gcomm communication channel over SSL. Moreover, I also confirm that the open_file_limit option is correctly taken into account when the galera server started, as I don't see any error messages in /var/log/mysqld.log So I think raising the number of connection is enough to fix customer's issue. |