Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1128285

Summary: Rubygem-Staypuft: HA-neutron deployment fails - puppet agent run fails on keystone which reports OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: rubygem-staypuftAssignee: Scott Seago <sseago>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: aberezin, cwolfe, mburns, morazi, rhos-maint, yeylon
Target Milestone: gaKeywords: TestOnly
Target Release: Installer   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-08-21 18:08:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mariadb.log where the error happens.
none
keystone.log where the error happens.
none
the messages log from the host where the error happens.
none
mariadb.log where the error doesn't happen. none

Description Alexander Chuzhoy 2014-08-08 19:34:17 UTC
Created attachment 925288 [details]
mariadb.log where the error happens.

Rubygem-Staypuft:  HA-neutron deployment fails - puppet agent run fails on keystone which reports OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None


Environment:
rhel-osp-installer-0.1.8-1.el6ost.noarch
openstack-foreman-installer-2.0.18-1.el6ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch
openstack-puppet-modules-2014.1-19.9.el6ost.noarch

Steps to reproduce:
1. Install rhel-osp-installer
2. Configure/run an HA neutron deployment (3 controllers +2 compute)

Result:
The deployment gets paused with error (after several hours of being stuck) on 60% installing the controllers.

Running puppet agent on one controller - I get the following:
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_role[_member_]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                             
Error: /Stage[main]/Neutron::Keystone::Auth/Keystone_service[neutron]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                       
Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_service[ceilometer]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)                                                                                                                                                 
Error: /Stage[main]/Keystone::Roles::Admin/Keystone_tenant[admin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ tenant-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Ceilometer::Keystone::Auth/Keystone_role[ResellerAdmin]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ role-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Nova::Keystone::Auth/Keystone_service[nova_ec2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Cinder::Keystone::Auth/Keystone_service[cinderv2]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: /Stage[main]/Heat::Keystone::Auth/Keystone_service[heat]: Could not evaluate: Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ service-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
Error: Could not prefetch keystone_endpoint provider 'keystone': Execution of '/usr/bin/keystone --os-endpoint http://10.8.29.214:35357/v2.0/ endpoint-list' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500)



Checking the keystone log file:
2014-08-08 19:17:35.525 6649 TRACE keystone.common.wsgi OperationalError: (OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None

Comment 1 Alexander Chuzhoy 2014-08-08 19:35:42 UTC
Created attachment 925289 [details]
keystone.log where the error happens.

Comment 2 Alexander Chuzhoy 2014-08-08 19:38:16 UTC
Created attachment 925290 [details]
the messages log from the host where the error happens.

Comment 3 Alexander Chuzhoy 2014-08-08 19:47:39 UTC
Created attachment 925292 [details]
mariadb.log where the error doesn't happen.

Comment 4 Crag Wolfe 2014-08-08 21:17:52 UTC
This looks suspicious in /var/log/messages:

Aug  8 14:56:29 maca25400702876 mysqld_safe: 140808 14:56:29 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.29hlr6' --pid-file='/var/lib/mysql/maca25400702876.example.com-recover.pid'
Aug  8 14:56:32 maca25400702876 mysqld_safe: 140808 14:56:32 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
Aug  8 14:56:35 maca25400702876 rsyncd[16020]: rsyncd version 3.0.9 starting, listening on port 4444
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: rsync to rsync_sst/ from UNKNOWN (192.168.100.138)
Aug  8 14:56:35 maca25400702876 rsyncd[16034]: receiving file list
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:36 maca25400702876 rsyncd[16034]: sent 72 bytes  received 18877038 bytes  total size 18874368
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: rsync to rsync_sst-log_dir/ from UNKNOWN (192.168.100.138)
Aug  8 14:56:36 maca25400702876 rsyncd[16048]: receiving file list
Aug  8 14:56:37 maca25400702876 rsyncd[16048]: sent 73 bytes  received 10487256 bytes  total size 10485760
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: name lookup failed for 192.168.100.138: Name or service not known
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: connect from UNKNOWN (192.168.100.138)
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: rsync to rsync_sst/./mysql from UNKNOWN (192.168.100.138)
Aug  8 14:56:37 maca25400702876 rsyncd[16050]: receiving file list

In my working setup, similar rsync messages (except that they succeed) reference the cluster_control_ip unlike the above.  So the question is, where is 192.168.100.138 coming from (either cluster_control_ip is set wrong, or galera or puppet is doing some extra (bad) inference)?  Based on messages attached, the cluster_control_ip *should* be one of 192.168.0.9, 192.168.0.10, 192.168.0.11.

Comment 5 Alexander Chuzhoy 2014-08-08 21:22:18 UTC
Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch

Comment 6 Alexander Chuzhoy 2014-08-08 21:26:47 UTC
Instead of previous comment:
Reproduced with rhel-osp-installer-0.1.9-1.el6ost.noarch with HA-Nova deployment.

Comment 7 Crag Wolfe 2014-08-08 21:44:40 UTC
Further investigation with sseago and sasha on IRC revealed private_ip was not set in the same subnet as the pacemaker cluster members / cluster_control_ip, most likely the underlying cause to this bug.

Comment 8 Mike Burns 2014-08-11 12:28:40 UTC
Moving back to staypuft based on comment 7

Comment 9 Mike Burns 2014-08-12 20:03:07 UTC
related to the ip address changing randomly bug which is fixed in staypuft 0.2.5, please retest with that version.

Comment 11 Leonid Natapov 2014-08-18 10:01:58 UTC
Didn't reproduced on rhel-osp-installer-0.1.10-2.el6ost.noarch

Comment 12 errata-xmlrpc 2014-08-21 18:08:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1090.html