Hide Forgot
Description of problem: Need to determine root cause for why galera cluster fails. This also caused pacemaker to restart the services. seeing ssl errors that may be the cause? ##controller 2 mariadb.log 150708 1:57:25 [Warning] Failed to setup SSL 150708 1:57:25 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed Version-Release number of selected component (if applicable): galera-25.3.5-6.el7ost.x86_64 mariadb-5.5.41-2.el7_0.x86_64 How reproducible: unknown Additional info: Seeing max open files messages, however they seem to be set to 16384: 150708 1:57:24 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) [root@itw-rhos-ctl-2 keystone]# ps aux |grep mysql root 10192 0.0 0.0 112640 960 pts/0 S+ 14:49 0:00 grep --color=auto mysql root 36331 0.0 0.0 11740 1656 ? S Oct18 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm:// mysql 37473 1.4 1.1 21895372 733304 ? Sl Oct18 14:02 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153883 [root@itw-rhos-ctl-1 tmp]# ps aux |grep mysql root 22160 0.0 0.0 11740 1652 ? S Oct18 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 mysql 23136 0.5 0.5 1911156 364312 ? Sl Oct18 5:16 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153854 ps aux |grep mysql root 20031 0.0 0.0 112640 960 pts/0 S+ 14:50 0:00 grep --color=auto mysql root 25246 0.0 0.0 11740 1652 ? S Oct18 0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/logmysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 mysql 26377 0.5 0.5 1908940 383164 ? Sl Oct18 5:13 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://pcmk-itw-rhos-ctl-3,pcmk-itw-rhos-ctl-1,pcmk-itw-rhos-ctl-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=52d74809-2536-11e5-9a81-cfbc9dd257f0:117153851
The provided sosreports have been generated with an old version of sos package and lack interesting logs from the galera servers stored in /var/log/mysqld.log I would advise to upgrade this package so that we have complete logs to analyze. That being said, I think the reason why galera cluster experienced a failure is due to DB connection exhaustion. So customer would need to determine if this high db workload is usual/expected, and if so, raise two configuration flags: . max_connections in /etc/my.cnf.d/server.cnf . open-files-limit in the galera resource definition (pacemaker) I can see cinder service health reporting fail due error 1040 'Too many connection'. Pacemaker's resource monitoring also hit the issue at some point, so the monitor action could not determine the state of the galera server and returned an error. notice: operation_finished: galera_monitor_10000:25438:stderr [ ERROR 1040 (08004): Too many connections ] Consequently, this made pacemaker stop the resource on that node, and HAProxy to signal resource down on that node. Jeremy, I don't think SSL warning are important, but I would need /var/log/mysqld.log to confirm. Regarding open-files-limit, the warning you saw probably come from the logfile used during the pre-bootstrap phase of the galera cluster (/var/log/mariadb/mariadb.log). As explained in [1], they are not critical; the running galera server should have the open file limit set as expected. You can log to a mariadb server locally to confirm with: MariaDB [(none)]> show variables like 'open_files_limit'; If you need more help, could you ensure to get the missing /var/log/musqld.log from all controller in sosreport? [1] http://damien.ciabrini.name/posts/2016/03/troubleshooting-open_files_limit-in-mariadb.html
Ok, late at confirming. the SSL warnings are harmless, I can see in the logs that the nodes succesfully connect to the galera gcomm communication channel over SSL. Moreover, I also confirm that the open_file_limit option is correctly taken into account when the galera server started, as I don't see any error messages in /var/log/mysqld.log So I think raising the number of connection is enough to fix customer's issue.