Bug 1386901 - Galera cluster fails and services are restarted by pacemaker
Summary: Galera cluster fails and services are restarted by pacemaker
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: mariadb-galera
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 6.0 (Juno)
Assignee: Damien Ciabrini
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-19 19:02 UTC by Jeremy
Modified: 2019-12-16 07:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-16 16:33:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jeremy 2016-10-19 19:02:49 UTC
Description of problem: Need to determine root cause for why galera cluster fails. This also caused pacemaker to restart the services. seeing ssl errors that may be the cause? 

##controller 2 mariadb.log
150708  1:57:25 [Warning] Failed to setup SSL
150708  1:57:25 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed


Version-Release number of selected component (if applicable):
galera-25.3.5-6.el7ost.x86_64
mariadb-5.5.41-2.el7_0.x86_64

How reproducible:
unknown



Additional info:
Seeing max open files messages, however they seem to be set to 16384:
150708  1:57:24 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)


[root@itw-rhos-ctl-2 keystone]# ps aux |grep mysql
root     10192  0.0  0.0 112640   960 pts/0    S+   14:49   0:00 grep --color=auto mysql
root     36331  0.0  0.0  11740  1656 ?        S    Oct18   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
mysql    37473  1.4  1.1 21895372 733304 ?     Sl   Oct18  14:02 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153883


[root@itw-rhos-ctl-1 tmp]# ps aux |grep mysql
root     22160  0.0  0.0  11740  1652 ?        S    Oct18   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2
mysql    23136  0.5  0.5 1911156 364312 ?      Sl   Oct18   5:16 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153854

ps aux |grep mysql
root     20031  0.0  0.0 112640   960 pts/0    S+   14:50   0:00 grep --color=auto mysql
root     25246  0.0  0.0  11740  1652 ?        S    Oct18   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/logmysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2
mysql    26377  0.5  0.5 1908940 383164 ?      Sl   Oct18   5:13 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153851

Comment 3 Damien Ciabrini 2016-10-20 09:05:11 UTC
The provided sosreports have been generated with an old version of sos package and lack interesting logs from the galera servers stored in /var/log/mysqld.log
I would advise to upgrade this package so that we have complete logs to analyze.

That being said, I think the reason why galera cluster experienced a failure is due to DB connection exhaustion. So customer would need to determine if this high db workload is usual/expected, and if so, raise two configuration flags:
 . max_connections in /etc/my.cnf.d/server.cnf
 . open-files-limit in the galera resource definition (pacemaker)

I can see cinder service health reporting fail due error 1040 'Too many connection'. Pacemaker's resource monitoring also hit the issue at some point, so the monitor action could not determine the state of the galera server and returned an error.

  notice: operation_finished: galera_monitor_10000:25438:stderr [ ERROR 1040 (08004): Too many connections ]

Consequently, this made pacemaker stop the resource on that node, and HAProxy to signal resource down on that node.


Jeremy, I don't think SSL warning are important, but I would need /var/log/mysqld.log to confirm.

Regarding open-files-limit, the warning you saw probably come from the logfile used during the pre-bootstrap phase of the galera cluster (/var/log/mariadb/mariadb.log). As explained in [1], they are not critical; the running galera server should have the open file limit set as expected. You can log to a mariadb server locally to confirm with:

  MariaDB [(none)]> show variables like 'open_files_limit';


If you need more help, could you ensure to get the missing /var/log/musqld.log from all controller in sosreport? 

[1]  http://damien.ciabrini.name/posts/2016/03/troubleshooting-open_files_limit-in-mariadb.html

Comment 5 Damien Ciabrini 2016-11-10 09:30:51 UTC
Ok, late at confirming. the SSL warnings are harmless, I can see in the logs that the nodes succesfully connect to the galera gcomm communication channel over SSL.

Moreover, I also confirm that the open_file_limit option is correctly taken into account when the galera server started, as I don't see any error messages in /var/log/mysqld.log

So I think raising the number of connection is enough to fix customer's issue.


Note You need to log in before you can comment on or make changes to this bug.