Created attachment 925288 [details] mariadb.log where the error happens. Rubygem-Staypuft: HA-neutron deployment fails - puppet agent run fails on keystone which reports OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None Environment: rhel-osp-installer-0.1.8-1.el6ost.noarch openstack-foreman-installer-2.0.18-1.el6ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch openstack-puppet-modules-2014.1-19.9.el6ost.noarch Steps to reproduce: 1. Install rhel-osp-installer 2. Configure/run an HA neutron deployment (3 controllers +2 compute) Result: The deployment gets paused with error (after several hours of being stuck) on 60% installing the controllers. Running puppet agent on one controller - I get the following: Error: /Stage[main]/Keystone::Roles::Admin/Keystone_role[_member_]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Neutron::Keystone::Auth/Keystone_service[neutron]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_service[ceilometer]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Keystone::Roles::Admin/Keystone_tenant[admin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ tenant-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_role[ResellerAdmin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Nova::Keystone::Auth/Keystone_service[nova_ec2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Cinder::Keystone::Auth/Keystone_service[cinderv2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: /Stage[main]/Heat::Keystone::Auth/Keystone_service[heat]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Error: Could not prefetch keystone_endpoint provider 'keystone': Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ endpoint-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) Checking the keystone log file: 2014-08-08 19:17:35.525 6649 TRACE keystone.common.wsgi OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None
Created attachment 925289 [details] keystone.log where the error happens.
Created attachment 925290 [details] the messages log from the host where the error happens.
Created attachment 925292 [details] mariadb.log where the error doesn't happen.
This looks suspicious in /var/log/messages: Aug 8 14:56:29 maca25400702876 mysqld_safe: 140808 14:56:29 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.29hlr6' --pid-file='/var/lib/mysql/maca25400702876.example.com-recover.pid' Aug 8 14:56:32 maca25400702876 mysqld_safe: 140808 14:56:32 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1 Aug 8 14:56:35 maca25400702876 rsyncd[16020]: rsyncd version 3.0.9 starting, listening on port 4444 Aug 8 14:56:35 maca25400702876 rsyncd[16034]: name lookup failed for 192.168.100.138: Name or service not known Aug 8 14:56:35 maca25400702876 rsyncd[16034]: connect from UNKNOWN (192.168.100.138) Aug 8 14:56:35 maca25400702876 rsyncd[16034]: rsync to rsync_sst/ from UNKNOWN (192.168.100.138) Aug 8 14:56:35 maca25400702876 rsyncd[16034]: receiving file list Aug 8 14:56:36 maca25400702876 rsyncd[16048]: name lookup failed for 192.168.100.138: Name or service not known Aug 8 14:56:36 maca25400702876 rsyncd[16048]: connect from UNKNOWN (192.168.100.138) Aug 8 14:56:36 maca25400702876 rsyncd[16034]: sent 72 bytes received 18877038 bytes total size 18874368 Aug 8 14:56:36 maca25400702876 rsyncd[16048]: rsync to rsync_sst-log_dir/ from UNKNOWN (192.168.100.138) Aug 8 14:56:36 maca25400702876 rsyncd[16048]: receiving file list Aug 8 14:56:37 maca25400702876 rsyncd[16048]: sent 73 bytes received 10487256 bytes total size 10485760 Aug 8 14:56:37 maca25400702876 rsyncd[16050]: name lookup failed for 192.168.100.138: Name or service not known Aug 8 14:56:37 maca25400702876 rsyncd[16050]: connect from UNKNOWN (192.168.100.138) Aug 8 14:56:37 maca25400702876 rsyncd[16050]: rsync to rsync_sst/./mysql from UNKNOWN (192.168.100.138) Aug 8 14:56:37 maca25400702876 rsyncd[16050]: receiving file list In my working setup, similar rsync messages (except that they succeed) reference the cluster_control_ip unlike the above. So the question is, where is 192.168.100.138 coming from (either cluster_control_ip is set wrong, or galera or puppet is doing some extra (bad) inference)? Based on messages attached, the cluster_control_ip *should* be one of 192.168.0.9, 192.168.0.10, 192.168.0.11.
Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch
Instead of previous comment: Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch with HA-Nova deployment.
Further investigation with sseago and sasha on IRC revealed private_ip was not set in the same subnet as the pacemaker cluster members / cluster_control_ip, most likely the underlying cause to this bug.
Moving back to staypuft based on comment 7
related to the ip address changing randomly bug which is fixed in staypuft 0.2.5, please retest with that version.
Didn't reproduced on rhel-osp-installer-0.1.10-2.el6ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1090.html