Bug 1102040 - neutron-server gets stuck in poll python-qpid 0.18
Summary: neutron-server gets stuck in poll python-qpid 0.18
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: python-qpid
Version: 2.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.5.1
: ---
Assignee: Ken Giusti
QA Contact: Petr Matousek
URL:
Whiteboard:
Depends On:
Blocks: 1097306 1105094 1143749
TreeView+ depends on / blocked
 
Reported: 2014-05-28 11:31 UTC by Lukas Bezdicka
Modified: 2014-12-18 19:58 UTC (History)
11 users (show)

Fixed In Version: python-qpid-0.18-12
Doc Type: Bug Fix
Doc Text:
Previously, the neutron messaging client rewrote (by method of "monkey-patching") the python selector module to support eventlet threading. The rewritten client did not update select.poll() during this process, which is used by qpid-python to manage I/O. This resulted in poll() deadlocks and neutron server hangs. The fix introduces updates to the qpid-python library that avoid calling poll() if eventlet threading is detected. Instead, the eventlet-aware select() is called, which prevents deadlocks from occurring and corrects the originally reported issue.
Clone Of:
: 1105094 1143749 (view as bug list)
Environment:
Last Closed: 2014-07-03 10:18:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
gdb python backtrace from stuck neutron-server (11.56 KB, text/plain)
2014-05-28 11:31 UTC, Lukas Bezdicka
no flags Details
gdb python backtrace with select() and monkey_patch(select=False) (33.68 KB, text/plain)
2014-05-28 19:11 UTC, Ken Giusti
no flags Details
Proposed patch (1.30 KB, patch)
2014-05-29 18:20 UTC, Ken Giusti
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-5790 0 None None None Never
Red Hat Product Errata RHBA-2014:0832 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 2 update 2014-07-03 14:17:58 UTC

Description Lukas Bezdicka 2014-05-28 11:31:21 UTC
Created attachment 899956 [details]
gdb python backtrace from stuck neutron-server

Description of problem:
When deploying openstack with packstack on rhos 5 neutron-server gets stuck on trying to communicate with qpid.

Version-Release number of selected component (if applicable):
python-qpid-0.18-10.el7.noarch
python-neutron-2014.1-22.el7ost.noarch
python-neutronclient-2.3.4-1.el7ost.noarch
openstack-neutron-2014.1-22.el7ost.noarch
openstack-neutron-openvswitch-2014.1-22.el7ost.noarch


How reproducible:
always

Steps to Reproduce:
1. install rhos and run packstack --allinone --amqp-server=qpid
2. you'll get failure with message:
ERROR : Error appeared during Puppet run: <host_ip>_provision.pp
Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached
You will find full trace in log /var/tmp/packstack/20140527-114456-DjC6dN/manifests/<host_ip>_provision.pp.log

Actual results:
neutron is stuck in poll

Expected results:
when updating python-qpid to 0.24 everything works fine see Bug: #1097306


Additional info:

Comment 2 Petr Matousek 2014-05-28 12:36:22 UTC
The issue also appears when using latest packages: python-qpid-0.18-11.

# rpm -qa | grep qpid
qpid-cpp-server-0.18-23.el7.x86_64
qpid-qmf-0.18-23.el7.x86_64
qpid-cpp-client-ssl-0.18-23.el7.x86_64
qpid-cpp-server-store-0.18-23.el7.x86_64
qpid-tests-0.18-2.el7.noarch
qpid-cpp-server-ha-0.18-23.el7.x86_64
python-qpid-0.18-11.el7.noarch
qpid-qmf-devel-0.18-23.el7.x86_64
qpid-cpp-client-devel-0.18-23.el7.x86_64
qpid-cpp-server-devel-0.18-23.el7.x86_64
qpid-cpp-server-rdma-0.18-23.el7.x86_64
ruby-qpid-qmf-0.18-23.el7.x86_64
qpid-qmf-debuginfo-0.18-23.el7.x86_64
qpid-cpp-client-rdma-0.18-23.el7.x86_64
qpid-cpp-server-cluster-0.18-23.el7.x86_64
qpid-tools-0.18-10.el7.noarch
rh-qpid-cpp-tests-0.18-23.el7.x86_64
qpid-cpp-client-0.18-23.el7.x86_64
python-qpid-qmf-0.18-23.el7.x86_64
qpid-cpp-server-ssl-0.18-23.el7.x86_64
qpid-cpp-debuginfo-0.18-23.el7.x86_64

Comment 3 Luigi Toscano 2014-05-28 12:55:58 UTC
(originally for rhbz1097306, but it belongs here):

I see this error also when using rabbitmq.

# packstack --allinone
[...]
Applying Puppet manifests                         [ ERROR ]

ERROR : Error appeared during Puppet run: <host_ip>_provision.pp
Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached
You will find full trace in log /var/tmp/packstack/20140527-185339-Z7G1u4/manifests/<host_ip>_provision.pp.log



# rpm -qa |grep rabbit
rabbitmq-server-3.1.5-6.0.el7ost.noarch
# rpm -qa |grep qpid
python-qpid-0.18-10.el7.noarch
(I didn't install the python-qpid package)


Other relevant packages:
openstack-packstack-2014.1.1-0.15.dev1068.el7ost.noarch
openstack-packstack-puppet-2014.1.1-0.15.dev1068.el7ost.noarch
python-neutron-2014.1-22.el7ost.noarch
python-neutronclient-2.3.4-1.el7ost.noarch
openstack-neutron-2014.1-22.el7ost.noarch
openstack-neutron-openvswitch-2014.1-22.el7ost.noarch

Comment 5 Ken Giusti 2014-05-28 19:11:53 UTC
Created attachment 900106 [details]
gdb python backtrace with select() and monkey_patch(select=False)

I suspect this is related to monkey patching.

Some detail:

For the next release of the python qpid client we've changed the implementation to prefer poll() to select().   This was to fix a scale issue hit in the field (select() fails for sockets with a FD value >=1024, regardless of ulimit settings).  See https://issues.apache.org/jira/browse/QPID-5588 for the gory details.

This fix was merged to the python-qpid-0.18 release.

With Petr's help, I was able to reproduce this problem quite easily.  If I back out the poll() change and use select() instead, the problem goes away.

However - once I'm using select instead of poll() if I disable monkey_patching of select the exact same failure occurs.  In fact, it appears as if the stack trace is the same - with the exception of hanging in select() instead of poll().

Could this problem be caused by monkey patching _not_ handling poll correctly?

Comment 6 Ken Giusti 2014-05-28 20:13:45 UTC
Confirmed - poll() is not supported by eventlet, only select() is.

I've posted a JIRA upstream at qpid proposing a fix for this in the python client.

Comment 7 Ken Giusti 2014-05-29 18:20:33 UTC
Created attachment 900480 [details]
Proposed patch

Comment 10 Lukas Bezdicka 2014-05-30 14:06:15 UTC
patch works fine packstack deployed correctly.

Comment 11 Petr Matousek 2014-05-30 16:23:07 UTC
This issue is fixed with python-qpid-0.18-12.el7:

# neutron net-list --format=csv --column=id --quote=none
id
0339e1fa-b012-466d-8b40-d87aac4a02f0
f2e01faf-53e1-494b-96dc-a8267835ab6a


Before the fix, openstack and qpidd communication wasn't working at all, only following errors were displayed in the qpidd log:
[System] error Connection <host_ip>:5672-<host_ip>:57023 No protocol received closing

With the fix restarting openstack is quick and all the nodes are created successfully on qpidd side.

Once the packstack installation will be retested and the package in errata, this issue may be moved to verified.

Comment 12 Petr Matousek 2014-05-30 18:26:49 UTC
packstack installation was successful without any issues, see bug 1097306, comment 22 for details.

Comment 14 Petr Matousek 2014-06-05 11:30:11 UTC
This issue has been fixed. Verified on rhel7 (x86_64).

packages under test:
python-qpid-0.18-12.el7.noarch
openstack-packstack-2014.1.1-0.19.dev1102.el7ost

-> VERIFIED

Comment 16 errata-xmlrpc 2014-07-03 10:18:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0832.html


Note You need to log in before you can comment on or make changes to this bug.