Created attachment 899956 [details] gdb python backtrace from stuck neutron-server Description of problem: When deploying openstack with packstack on rhos 5 neutron-server gets stuck on trying to communicate with qpid. Version-Release number of selected component (if applicable): python-qpid-0.18-10.el7.noarch python-neutron-2014.1-22.el7ost.noarch python-neutronclient-2.3.4-1.el7ost.noarch openstack-neutron-2014.1-22.el7ost.noarch openstack-neutron-openvswitch-2014.1-22.el7ost.noarch How reproducible: always Steps to Reproduce: 1. install rhos and run packstack --allinone --amqp-server=qpid 2. you'll get failure with message: ERROR : Error appeared during Puppet run: <host_ip>_provision.pp Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached You will find full trace in log /var/tmp/packstack/20140527-114456-DjC6dN/manifests/<host_ip>_provision.pp.log Actual results: neutron is stuck in poll Expected results: when updating python-qpid to 0.24 everything works fine see Bug: #1097306 Additional info:
The issue also appears when using latest packages: python-qpid-0.18-11. # rpm -qa | grep qpid qpid-cpp-server-0.18-23.el7.x86_64 qpid-qmf-0.18-23.el7.x86_64 qpid-cpp-client-ssl-0.18-23.el7.x86_64 qpid-cpp-server-store-0.18-23.el7.x86_64 qpid-tests-0.18-2.el7.noarch qpid-cpp-server-ha-0.18-23.el7.x86_64 python-qpid-0.18-11.el7.noarch qpid-qmf-devel-0.18-23.el7.x86_64 qpid-cpp-client-devel-0.18-23.el7.x86_64 qpid-cpp-server-devel-0.18-23.el7.x86_64 qpid-cpp-server-rdma-0.18-23.el7.x86_64 ruby-qpid-qmf-0.18-23.el7.x86_64 qpid-qmf-debuginfo-0.18-23.el7.x86_64 qpid-cpp-client-rdma-0.18-23.el7.x86_64 qpid-cpp-server-cluster-0.18-23.el7.x86_64 qpid-tools-0.18-10.el7.noarch rh-qpid-cpp-tests-0.18-23.el7.x86_64 qpid-cpp-client-0.18-23.el7.x86_64 python-qpid-qmf-0.18-23.el7.x86_64 qpid-cpp-server-ssl-0.18-23.el7.x86_64 qpid-cpp-debuginfo-0.18-23.el7.x86_64
(originally for rhbz1097306, but it belongs here): I see this error also when using rabbitmq. # packstack --allinone [...] Applying Puppet manifests [ ERROR ] ERROR : Error appeared during Puppet run: <host_ip>_provision.pp Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached You will find full trace in log /var/tmp/packstack/20140527-185339-Z7G1u4/manifests/<host_ip>_provision.pp.log # rpm -qa |grep rabbit rabbitmq-server-3.1.5-6.0.el7ost.noarch # rpm -qa |grep qpid python-qpid-0.18-10.el7.noarch (I didn't install the python-qpid package) Other relevant packages: openstack-packstack-2014.1.1-0.15.dev1068.el7ost.noarch openstack-packstack-puppet-2014.1.1-0.15.dev1068.el7ost.noarch python-neutron-2014.1-22.el7ost.noarch python-neutronclient-2.3.4-1.el7ost.noarch openstack-neutron-2014.1-22.el7ost.noarch openstack-neutron-openvswitch-2014.1-22.el7ost.noarch
Created attachment 900106 [details] gdb python backtrace with select() and monkey_patch(select=False) I suspect this is related to monkey patching. Some detail: For the next release of the python qpid client we've changed the implementation to prefer poll() to select(). This was to fix a scale issue hit in the field (select() fails for sockets with a FD value >=1024, regardless of ulimit settings). See https://issues.apache.org/jira/browse/QPID-5588 for the gory details. This fix was merged to the python-qpid-0.18 release. With Petr's help, I was able to reproduce this problem quite easily. If I back out the poll() change and use select() instead, the problem goes away. However - once I'm using select instead of poll() if I disable monkey_patching of select the exact same failure occurs. In fact, it appears as if the stack trace is the same - with the exception of hanging in select() instead of poll(). Could this problem be caused by monkey patching _not_ handling poll correctly?
Confirmed - poll() is not supported by eventlet, only select() is. I've posted a JIRA upstream at qpid proposing a fix for this in the python client.
Created attachment 900480 [details] Proposed patch
patch works fine packstack deployed correctly.
This issue is fixed with python-qpid-0.18-12.el7: # neutron net-list --format=csv --column=id --quote=none id 0339e1fa-b012-466d-8b40-d87aac4a02f0 f2e01faf-53e1-494b-96dc-a8267835ab6a Before the fix, openstack and qpidd communication wasn't working at all, only following errors were displayed in the qpidd log: [System] error Connection <host_ip>:5672-<host_ip>:57023 No protocol received closing With the fix restarting openstack is quick and all the nodes are created successfully on qpidd side. Once the packstack installation will be retested and the package in errata, this issue may be moved to verified.
packstack installation was successful without any issues, see bug 1097306, comment 22 for details.
This issue has been fixed. Verified on rhel7 (x86_64). packages under test: python-qpid-0.18-12.el7.noarch openstack-packstack-2014.1.1-0.19.dev1102.el7ost -> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0832.html