1257130 – Deployment of rhel-osp-installer will stall when deploying galera

Bug 1257130 - Deployment of rhel-osp-installer will stall when deploying galera

Summary: Deployment of rhel-osp-installer will stall when deploying galera

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	foreman
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Mike Burns
QA Contact:	yeylon@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-26 10:30 UTC by Chen
Modified:	2019-08-15 05:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-09-02 08:52:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sosreports of the deployed clients (18.21 MB, application/x-gzip) 2015-08-26 10:32 UTC, Chen	no flags	Details
28th Aug sosreports and yaml file (17.99 MB, application/x-gzip) 2015-08-28 09:47 UTC, Chen	no flags	Details
foreman server's sosreport (14.81 MB, application/x-xz) 2015-08-29 07:05 UTC, Chen	no flags	Details
Reproduced sosreports on 31 Aug. (18.31 MB, application/x-gzip) 2015-08-31 09:51 UTC, Chen	no flags	Details
mysqld.log for 3 deployed clients (8.82 KB, application/x-gzip) 2015-09-01 01:52 UTC, Chen	no flags	Details
View All

Description Chen 2015-08-26 10:30:17 UTC

Description of problem:

galera can not start during the deployment

Version-Release number of selected component (if applicable):

OSP 6.0

How reproducible:

100%

Steps to Reproduce:
1. Deploy a 3 controller nodes HA environment using rhel-osp-installer


Actual results:

Galera clustered mariadb can not start and the deployment will fail

Expected results:

The deployment should succeed automatically

Additional info:

Comment 3 Chen 2015-08-26 10:32:00 UTC

Created attachment 1067216 [details]
sosreports of the deployed clients

Comment 4 Jason Guiditta 2015-08-26 15:00:41 UTC

It appears stonith was not correctly configured or disabled - one or the other has to happen for things to work correctly (and we strongly discourage disabling stonith for any real testing).  I see in var/log messages on the cluster control node:

Aug 25 22:26:16 mac5cb9018a8d7c crmd[15973]: warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Aug 25 22:26:17 mac5cb9018a8d7c pengine[15972]: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Aug 25 22:26:17 mac5cb9018a8d7c pengine[15972]: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Aug 25 22:26:17 mac5cb9018a8d7c pengine[15972]: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity


If this is a test environment, try setting fencing_type parameter in foreman to disabled to see what effect that has on the next deployment attempt.  Assuming that is found to be working, the next step would be to work through properly configuring fencing (that would be best done on a fresh deployment to verify you can get it to run from scratch).

Comment 5 Jason Guiditta 2015-08-26 15:02:43 UTC

Also, if this does not solve the problem, it would be helpful to have the yaml for each of the 3 control nodes in the cluster (you get this in foreman, click on the host in the list, then click the 'yaml' button on the left hand side).

Comment 6 Chen 2015-08-28 09:44:31 UTC

Hi Jason,

Thank you for your reply.

I added fencing_type disabled parameters in the Hosts -> Operating Systems -> RHEL Server 7.1 -> Parameters and tried twice to redeploy but the result didn't change.

The deployment never proceeds after the following messages:

puppet-agent: (/Stage[main])/Quickstack::Pacemaker::Galera/File[/etc/my.cnf.d/galera.cnf]/content content changed '{md5}XXXXXX' to '{md5}XXXXXX'

Please check the newest sosreports and yaml file.

BTW, we are trying to deploy a 3 controller node without any compute nodes. Not sure whether this is related. 

Best Regards,
Chen

Comment 7 Chen 2015-08-28 09:47:06 UTC

Created attachment 1067948 [details]
28th Aug sosreports and yaml file

Comment 8 Michael Bayer 2015-08-28 15:47:58 UTC

I don't see any record of /var/log/mysqld.log, and the sos.log seems like it wasn't looking for it:

2015-08-25 22:32:08,355 INFO: [plugin:mysql] collecting path '/etc/my.cnf'
2015-08-25 22:32:08,355 INFO: [plugin:mysql] collecting path '/etc/my.cnf.d/galera.cnf'
2015-08-25 22:32:08,356 INFO: [plugin:mysql] collecting path '/etc/ld.so.conf.d/mariadb-x86_64.conf'
2015-08-25 22:32:08,356 INFO: [plugin:mysql] collecting path '/var/log/mariadb/mariadb.log'
2015-08-25 22:32:08,356 INFO: [plugin:mysql] collecting path '/etc/my.cnf.d/mysql-clients.cnf'
2015-08-25 22:32:08,357 INFO: [plugin:mysql] collecting path '/etc/my.cnf.d/client.cnf'
2015-08-25 22:32:08,357 INFO: [plugin:mysql] collecting path '/etc/my.cnf.d/server.cnf'

do these servers have a /var/log/mysqld.log file present?   I also notice there's no /var/log/pacemaker.log present.

If this all just means, "the installer didn't get that far", then that's why that would be.    But then I don't see why the description here is "mariadb can't start".  Where's the pacemaker log ?

Comment 9 Chen 2015-08-29 04:06:03 UTC

Hi Michael,

>do these servers have a /var/log/mysqld.log file present?

I don't see /var/log/mysqld.log in the server. Perhaps this file doesn't exist in RHEL7 mariadb ?

>I also notice there's no /var/log/pacemaker.log present.

And you can check pacemaker.log in sos_command/cluster/crm_report/<host>/pacemaker.log in sosreport.

>the installer didn't get that far", then that's why that would be.    
>But then I don't see why the description here is "mariadb can't start".

Sorry for the confusion. Currently the installation stalls after the messages I posted earlier. 

Chen

Comment 10 Chen 2015-08-29 07:05:20 UTC

Created attachment 1068215 [details]
foreman server's sosreport

Comment 11 Chen 2015-08-29 07:06:54 UTC

Uploaded the sosreport of foreman server.

Chen

Comment 13 Andrew Beekhof 2015-08-31 00:41:37 UTC

You need to look for the lrmd logs on each machine:

Aug 25 22:28:28 [15970] mac5cb9018a8d7c.zte.com       lrmd:   notice: operation_finished: 	galera_promote_0:21937:stderr [ ocf-exit-reason:MySQL server failed to start (pid=22049) (rc=0), please check your installation ]
Aug 25 22:28:28 [15970] mac5cb9018a8d7c.zte.com       lrmd:     info: log_finished: 	finished - rsc:galera action:promote call_id:96 pid:21937 exit-code:1 exec-time:43472ms queue-time:1ms

Remember we set on-fail to block for promote operations so that we don't fence nodes while they're trying to sync from their peer.

Also, the OCS_RESKEY "defaults" aren't changed (They're a terrible convention that upstream created at some point).

What happens is that OCS_RESKEY_ gets prefixed to the name=value pairs in the resource definition. Ie.  

  pcs resource create galera galera enable_creation=true wsrep_cluster_address="gcomm://${node_list}" additional_parameters='--open-files-limit=16384' meta master-max=3 ordered=true op promote timeout=300s on-fail=block --master

results in OCS_RESKEY_enable_creation=true etc etc

and then lines like this:

: ${OCF_RESKEY_log=${OCF_RESKEY_log_default}}

Mean, use OCF_RESKEY_log if it exists (ie. is in the resource definition, which it isn't by default).
Only if it hasn't been defined do we use the value of OCF_RESKEY_log_default

Comment 14 Michael Bayer 2015-08-31 03:06:49 UTC

andrew - so, any idea why there's no /var/log/mysqld.log on these servers, or otherwise, my question was, how do i see what log file location the resource agent *was* using ?   (e.g. i didn't mean "change the default", i meant, where the default is *overridden*).

I can't diagnose why galera didn't start without the logs.

Comment 15 Andrew Beekhof 2015-08-31 04:05:24 UTC

(In reply to Michael Bayer from comment #14)
> andrew - so, any idea why there's no /var/log/mysqld.log on these servers,
> or otherwise, my question was, how do i see what log file location the
> resource agent *was* using ?   (e.g. i didn't mean "change the default", i
> meant, where the default is *overridden*).

It would have been overridden by setting log=/some/where in the pacemaker configuration.  But all I see are the "normal" ones.

          <instance_attributes id="galera-instance_attributes">
            <nvpair id="galera-instance_attributes-additional_parameters" name="additional_parameters" value="--open-files-limit=16384"/>
            <nvpair id="galera-instance_attributes-enable_creation" name="enable_creation" value="true"/>
            <nvpair id="galera-instance_attributes-wsrep_cluster_address" name="wsrep_cluster_address" value="gcomm://pcmk-mac5cb9018a8d7c,pcmk-mac5cb9018a7fa4,pcmk-mac5cb9018a8d74"/>
          </instance_attributes>

> I can't diagnose why galera didn't start without the logs.

I feel your pain, but as far as I can see, the cluster isn't telling it an alternate location.  I do see var/log/mariadb/mariadb.log in the sosreport though, could that be it?

Comment 16 Michael Bayer 2015-08-31 04:14:55 UTC

like I noted above, the var/log/mariadb/mariadb.log gets populated when the installers first run systemd a bit, but you can see from the timestamps I have above they end in mariadb.log well before you can see lots more galera activity in the pacemaker log subsequent to mariadb.log being used.   All the pacemaker setups I've worked with and my own here all write out to /var/log/mysqld.log - that's where the resource agent points it all.    Here's what it looks like on one of my test env's running the same versions of everything:

root     20654     1  0 Aug17 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --read-only=true
mysql    21346 20654  0 Aug17 ?        00:34:05 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --read-only=true --log-error=/var/log/mysqld.log --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --wsrep_start_position=778f2688-3ba5-11e5-8bc5-7725a0ed193e:603


e.g. --log-error is passed on the commandline with the OCF version of it.

Chen - do any of these servers have a mysqld process running right now that can be seen in ps?

Comment 17 Michael Bayer 2015-08-31 04:29:28 UTC

Nevermind, I can see it:

root     21381  0.0  0.0  11740  1652 ?        S    22:26   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
mysql    22305  0.2  0.1 1384456 99760 ?       Sl   22:26   0:01 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

Chen - this is in sosreport-mac5cb9018a8d74.zte.com.01497001-20150825223337/ps.  However there is no sosreport-mac5cb9018a8d74.zte.com.01497001-20150825223337/var/log/mysqld.log.  Is it possible to check again?   The ps that was running shows this is the file it's logging to.

Comment 18 Chen 2015-08-31 08:59:32 UTC

Hi Michael,

Please help us have a look at the sosreports which I uploaded at 2015-08-26. They are containing the mysqld.log.

At the same time, I'm using the CU's environment to reproduce it again. If I succeed I'll upload the newest sosreports.

Again, please help us check the sosreports which have been uploaded on 2015-08-26.

Thank you !

Chen

Comment 19 Chen 2015-08-31 09:51:25 UTC

Created attachment 1068571 [details]
Reproduced sosreports on 31 Aug.

Comment 20 Chen 2015-08-31 09:52:42 UTC

Hi all,

Sorry for the lateness.

I've reproduced this issue again and uploaded the deployed client sosreports.

Chen

Comment 21 Chen 2015-08-31 09:56:44 UTC

In the pcs status the galera is shown as unmanaged while there are many SSL certificates error in the mysqld.log. Is the SSL issue the causing the connection timeout here or not relevant at all ?

Comment 22 Michael Bayer 2015-08-31 14:42:14 UTC

I dont think the SSL issues are the problem, but a node shown in "unmanaged" state seems wrong to me; everything should be in the "managed" state, as to put it in "unmanaged" requires a particular pcs command is run which should not be done via any automated means.   I'll take a look and then probably have to reach out to others in order to determine what it means that a Galera node is coming up as unmanaged, with some suspicion towards the installer tools themselves.

Comment 23 Michael Bayer 2015-08-31 15:45:43 UTC

Chen -

OK, there's an older version of sosreport that does not automatically include /var/log/mysqld.log when it is run; in order to get this file included with this older version you need to run sosreport with the --all-logs option.

In the latest dump, two of the three servers are running mysqld and they are logging to /var/log/mysqld.log:

classics-MacBook-Pro:sosreport-mac5cb9018a7fa4.zte.com-20150901013533 classic$ cat ps | grep mysql
root     21869  0.0  0.0  11740  1652 ?        S    01:28   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://pcmk-mac5cb9018a7fa4,pcmk-mac5cb9018a8d74,pcmk-mac5cb9018a8d7c
mysql    22838  0.2  0.1 1507780 103384 ?      Sl   01:28   0:01 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://pcmk-mac5cb9018a7fa4,pcmk-mac5cb9018a8d74,pcmk-mac5cb9018a8d7c --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1


sosreport-mac5cb9018a8d7c.zte.com-20150901013406 classic$ cat ps | grep mysql
root     20729  0.0  0.0  11740  1652 ?        S    01:27   0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://
mysql    21654  0.2  0.1 1384456 99284 ?       Sl   01:27   0:01 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm:// --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

these daemons are running with --log-error=/var/log/mysqld.log, so that file has to be there.   However, it is most important that I see the mysqld.log that is most likely on the sosreport-mac5cb9018a8d74.zte.com-20150901013224 node, since this is the node that fails to start.   Seeing that mysqld.log is present on the other two nodes will confirm that the file is being written as the ps list reports, however.

Comment 24 Chen 2015-09-01 01:52:43 UTC

Created attachment 1068841 [details]
mysqld.log for 3 deployed clients

Hi Michael,

Thank you for your patience.
I manually collected mysqld.log and uploaded them.

Comment 25 Michael Bayer 2015-09-01 02:28:58 UTC

OK, there are extensive SSL handshake errors between two of the nodes, one node keeps running and the other is rejected:

150901  1:26:20 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.37:40159 failed: 1: 'End of file.' ( )
150901  1:26:21 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.62:4567 failed: 1: 'SSL error.' ( 336134278: 'error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed')
150901  1:26:21 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.21:4567 failed: 1: 'SSL error.' ( 336134278: 'error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed')
150901  1:26:22 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.37:4567 failed: 1: 'SSL error.' ( 336134278: 'error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed')
150901  1:26:22 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.37:40162 failed: 1: 'End of file.' ( )
150901  1:26:22 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.62:4567 failed: 1: 'SSL error.' ( 336134278: 'error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed')
150901  1:26:23 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.0.21:4567 failed: 1: 'SSL error.' ( 336134278: 'error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed')
150901  1:26:23 [Note] WSREP: view((empty))
150901  1:26:23 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():141
150901  1:26:23 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
150901  1:26:23 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://pcmk-mac5cb9018a7fa4,pcmk-mac5cb9018a8d74,pcmk-mac5cb9018a8d7c': -110 (Connection timed out)
150901  1:26:23 [ERROR] WSREP: gcs connect failed: Connection timed out
150901  1:26:23 [ERROR] WSREP: wsrep::connect() failed: 7
150901  1:26:23 [ERROR] Aborting

150901  1:26:23 [Note] WSREP: Service disconnected.
150901  1:26:24 [Note] WSREP: Some threads may fail to exit.
150901  1:26:24 [Note] /usr/libexec/mysqld: Shutdown complete

150901 01:26:24 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended


Hoping Ryan can tell us what this means, haven't seen this before myself.  Ryan does this look familiar?    Is the installer not copying certificates correctly or something like that?

Comment 26 Chen 2015-09-01 10:21:55 UTC

Hi Michael,

Besides the SSL3 certificate error, are there any suspicious points to check ?

Ryan, could you please give us the thought of certificate error ?

Now the deployment has stalled here for several days. It'll be appreciated if we could make any progress here because we are not sure whether any new issue will come in the next steps of the deployment...

Thank you !

Chen

Comment 27 Ryan O'Hara 2015-09-01 13:25:56 UTC

(In reply to Chen from comment #26)
> Hi Michael,
> 
> Besides the SSL3 certificate error, are there any suspicious points to check
> ?
> 
> Ryan, could you please give us the thought of certificate error ?

1. Check that the key and crt files are present, non-empty, and exactly the same on all three nodes. Also check the permissions on each file.

2. Check that each node has the correct date/time. If your system time is off my too much you could get SSL errors.

Ryan


> Now the deployment has stalled here for several days. It'll be appreciated
> if we could make any progress here because we are not sure whether any new
> issue will come in the next steps of the deployment...
> 
> Thank you !
> 
> Chen

Comment 28 Chen 2015-09-02 05:01:33 UTC

Hi Ryan,

>2. Check that each node has the correct date/time. If your system 
>time is off my too much you could get SSL errors.

I checked the server that they have almost 2 minutes off.

After I specify a correct ntp server and used ntpdate to sync the time in advance, the error seems to disappear and deployment finished successfully. Thank you very much for your help !

Please allow me some more time to contact the on-site GPS and I'll ask him to double confirm it. 

Chen

Comment 29 Chen 2015-09-02 08:52:04 UTC

Hi Ryan, Michael, Andrew and Jason,

Just got the response from on-site GPS that this issue has been solved by adjusting a correct NTP server according to Ryan's suggestion.

Due to the particular environment of ZTE company, the deployed clients have no access to the NTP server we specified in advance in rhel-osp-installer, which leads a 2 minutes off among those hosts. As a result, SSL error happens and galera can not form a cluster.

Thank you very much for your help !

Best Regards,
Chen

Note You need to log in before you can comment on or make changes to this bug.