Bug 1128285 - Rubygem-Staypuft: HA-neutron deployment fails - puppet agent run fails on keystone which reports OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None
Summary: Rubygem-Staypuft: HA-neutron deployment fails - puppet agent run fails on ke...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rubygem-staypuft
Version: 5.0 (RHEL 7)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ga
: Installer
Assignee: Scott Seago
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-08 19:34 UTC by Alexander Chuzhoy
Modified: 2014-08-21 18:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-08-21 18:08:34 UTC


Attachments (Terms of Use)
mariadb.log where the error happens. (54.64 KB, text/x-log)
2014-08-08 19:34 UTC, Alexander Chuzhoy
no flags Details
keystone.log where the error happens. (976.22 KB, text/x-log)
2014-08-08 19:35 UTC, Alexander Chuzhoy
no flags Details
the messages log from the host where the error happens. (3.22 MB, text/plain)
2014-08-08 19:38 UTC, Alexander Chuzhoy
no flags Details
mariadb.log where the error doesn't happen. (55.08 KB, text/x-log)
2014-08-08 19:47 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1090 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory 2014-08-22 15:28:08 UTC

Description Alexander Chuzhoy 2014-08-08 19:34:17 UTC
Created attachment 925288 [details]
mariadb.log where the error happens.

Rubygem-Staypuft:  HA-neutron deployment fails - puppet agent run fails on keystone which reports OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None


Environment:
rhel-osp-installer-0.1.8-1.el6ost.noarch
openstack-foreman-installer-2.0.18-1.el6ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch
openstack-puppet-modules-2014.1-19.9.el6ost.noarch

Steps to reproduce:
1. Install rhel-osp-installer
2. Configure/run an HA neutron deployment (3 controllers +2 compute)

Result:
The deployment gets paused with error (after several hours of being stuck) on 60% installing the controllers.

Running puppet agent on one controller - I get the following:
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_role[_member_]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                             
Error: /Stage[main]/Neutron::Keystone::Auth/Keystone_service[neutron]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                       
Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_service[ceilometer]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                 
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_tenant[admin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ tenant-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_role[ResellerAdmin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Nova::Keystone::Auth/Keystone_service[nova_ec2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Cinder::Keystone::Auth/Keystone_service[cinderv2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Heat::Keystone::Auth/Keystone_service[heat]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: Could not prefetch keystone_endpoint provider 'keystone': Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ endpoint-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)



Checking the keystone log file:
2014-08-08 19:17:35.525 6649 TRACE keystone.common.wsgi OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None

Comment 1 Alexander Chuzhoy 2014-08-08 19:35:42 UTC
Created attachment 925289 [details]
keystone.log where the error happens.

Comment 2 Alexander Chuzhoy 2014-08-08 19:38:16 UTC
Created attachment 925290 [details]
the messages log from the host where the error happens.

Comment 3 Alexander Chuzhoy 2014-08-08 19:47:39 UTC
Created attachment 925292 [details]
mariadb.log where the error doesn't happen.

Comment 4 Crag Wolfe 2014-08-08 21:17:52 UTC
This looks suspicious in /var/log/messages:

Aug  8 14:56:29 maca25400702876 mysqld_safe: 140808 14:56:29 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.29hlr6' --pid-file='/var/lib/mysql/maca25400702876.example.com-recover.pid'
Aug  8 14:56:32 maca25400702876 mysqld_safe: 140808 14:56:32 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
Aug  8 14:56:35 maca25400702876 rsyncd[16020]: rsyncd version 3.0.9 starting, listening on port 4444
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: rsync to rsync_sst/ from UNKNOWN (192.168.100.138)
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: receiving file list
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:36 maca25400702876 rsyncd[16034]: sent 72 bytes  received 18877038 bytes  total size 18874368
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: rsync to rsync_sst-log_dir/ from UNKNOWN (192.168.100.138)
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: receiving file list
Aug  8 14:56:37 maca25400702876 rsyncd[16048]: sent 73 bytes  received 10487256 bytes  total size 10485760
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: rsync to rsync_sst/./mysql from UNKNOWN (192.168.100.138)
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: receiving file list

In my working setup, similar rsync messages (except that they succeed) reference the cluster_control_ip unlike the above.  So the question is, where is 192.168.100.138 coming from (either cluster_control_ip is set wrong, or galera or puppet is doing some extra (bad) inference)?  Based on messages attached, the cluster_control_ip *should* be one of 192.168.0.9, 192.168.0.10, 192.168.0.11.

Comment 5 Alexander Chuzhoy 2014-08-08 21:22:18 UTC
Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch

Comment 6 Alexander Chuzhoy 2014-08-08 21:26:47 UTC
Instead of previous comment:
Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch with HA-Nova deployment.

Comment 7 Crag Wolfe 2014-08-08 21:44:40 UTC
Further investigation with sseago and sasha on IRC revealed private_ip was not set in the same subnet as the pacemaker cluster members / cluster_control_ip, most likely the underlying cause to this bug.

Comment 8 Mike Burns 2014-08-11 12:28:40 UTC
Moving back to staypuft based on comment 7

Comment 9 Mike Burns 2014-08-12 20:03:07 UTC
related to the ip address changing randomly bug which is fixed in staypuft 0.2.5, please retest with that version.

Comment 11 Leonid Natapov 2014-08-18 10:01:58 UTC
Didn't reproduced on rhel-osp-installer-0.1.10-2.el6ost.noarch

Comment 12 errata-xmlrpc 2014-08-21 18:08:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1090.html


Note You need to log in before you can comment on or make changes to this bug.