Bug 1102040
Summary: | neutron-server gets stuck in poll python-qpid 0.18 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Lukas Bezdicka <lbezdick> | ||||||||
Component: | python-qpid | Assignee: | Ken Giusti <kgiusti> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Petr Matousek <pematous> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 2.4 | CC: | acathrow, freznice, jross, kgiusti, lbezdick, lzhaldyb, mburns, mcressma, mwagner, pematous, tross | ||||||||
Target Milestone: | 2.5.1 | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | python-qpid-0.18-12 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, the neutron messaging client rewrote (by method of "monkey-patching") the python selector module to support eventlet threading. The rewritten client did not update select.poll() during this process, which is used by qpid-python to manage I/O. This resulted in poll() deadlocks and neutron server hangs. The fix introduces updates to the qpid-python library that avoid calling poll() if eventlet threading is detected. Instead, the eventlet-aware select() is called, which prevents deadlocks from occurring and corrects the originally reported issue.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1105094 1143749 (view as bug list) | Environment: | |||||||||
Last Closed: | 2014-07-03 10:18:12 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1097306, 1105094, 1143749 | ||||||||||
Attachments: |
|
The issue also appears when using latest packages: python-qpid-0.18-11. # rpm -qa | grep qpid qpid-cpp-server-0.18-23.el7.x86_64 qpid-qmf-0.18-23.el7.x86_64 qpid-cpp-client-ssl-0.18-23.el7.x86_64 qpid-cpp-server-store-0.18-23.el7.x86_64 qpid-tests-0.18-2.el7.noarch qpid-cpp-server-ha-0.18-23.el7.x86_64 python-qpid-0.18-11.el7.noarch qpid-qmf-devel-0.18-23.el7.x86_64 qpid-cpp-client-devel-0.18-23.el7.x86_64 qpid-cpp-server-devel-0.18-23.el7.x86_64 qpid-cpp-server-rdma-0.18-23.el7.x86_64 ruby-qpid-qmf-0.18-23.el7.x86_64 qpid-qmf-debuginfo-0.18-23.el7.x86_64 qpid-cpp-client-rdma-0.18-23.el7.x86_64 qpid-cpp-server-cluster-0.18-23.el7.x86_64 qpid-tools-0.18-10.el7.noarch rh-qpid-cpp-tests-0.18-23.el7.x86_64 qpid-cpp-client-0.18-23.el7.x86_64 python-qpid-qmf-0.18-23.el7.x86_64 qpid-cpp-server-ssl-0.18-23.el7.x86_64 qpid-cpp-debuginfo-0.18-23.el7.x86_64 (originally for rhbz1097306, but it belongs here): I see this error also when using rabbitmq. # packstack --allinone [...] Applying Puppet manifests [ ERROR ] ERROR : Error appeared during Puppet run: <host_ip>_provision.pp Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached You will find full trace in log /var/tmp/packstack/20140527-185339-Z7G1u4/manifests/<host_ip>_provision.pp.log # rpm -qa |grep rabbit rabbitmq-server-3.1.5-6.0.el7ost.noarch # rpm -qa |grep qpid python-qpid-0.18-10.el7.noarch (I didn't install the python-qpid package) Other relevant packages: openstack-packstack-2014.1.1-0.15.dev1068.el7ost.noarch openstack-packstack-puppet-2014.1.1-0.15.dev1068.el7ost.noarch python-neutron-2014.1-22.el7ost.noarch python-neutronclient-2.3.4-1.el7ost.noarch openstack-neutron-2014.1-22.el7ost.noarch openstack-neutron-openvswitch-2014.1-22.el7ost.noarch Created attachment 900106 [details] gdb python backtrace with select() and monkey_patch(select=False) I suspect this is related to monkey patching. Some detail: For the next release of the python qpid client we've changed the implementation to prefer poll() to select(). This was to fix a scale issue hit in the field (select() fails for sockets with a FD value >=1024, regardless of ulimit settings). See https://issues.apache.org/jira/browse/QPID-5588 for the gory details. This fix was merged to the python-qpid-0.18 release. With Petr's help, I was able to reproduce this problem quite easily. If I back out the poll() change and use select() instead, the problem goes away. However - once I'm using select instead of poll() if I disable monkey_patching of select the exact same failure occurs. In fact, it appears as if the stack trace is the same - with the exception of hanging in select() instead of poll(). Could this problem be caused by monkey patching _not_ handling poll correctly? Confirmed - poll() is not supported by eventlet, only select() is. I've posted a JIRA upstream at qpid proposing a fix for this in the python client. Created attachment 900480 [details]
Proposed patch
patch works fine packstack deployed correctly. This issue is fixed with python-qpid-0.18-12.el7: # neutron net-list --format=csv --column=id --quote=none id 0339e1fa-b012-466d-8b40-d87aac4a02f0 f2e01faf-53e1-494b-96dc-a8267835ab6a Before the fix, openstack and qpidd communication wasn't working at all, only following errors were displayed in the qpidd log: [System] error Connection <host_ip>:5672-<host_ip>:57023 No protocol received closing With the fix restarting openstack is quick and all the nodes are created successfully on qpidd side. Once the packstack installation will be retested and the package in errata, this issue may be moved to verified. packstack installation was successful without any issues, see bug 1097306, comment 22 for details. This issue has been fixed. Verified on rhel7 (x86_64). packages under test: python-qpid-0.18-12.el7.noarch openstack-packstack-2014.1.1-0.19.dev1102.el7ost -> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0832.html |
Created attachment 899956 [details] gdb python backtrace from stuck neutron-server Description of problem: When deploying openstack with packstack on rhos 5 neutron-server gets stuck on trying to communicate with qpid. Version-Release number of selected component (if applicable): python-qpid-0.18-10.el7.noarch python-neutron-2014.1-22.el7ost.noarch python-neutronclient-2.3.4-1.el7ost.noarch openstack-neutron-2014.1-22.el7ost.noarch openstack-neutron-openvswitch-2014.1-22.el7ost.noarch How reproducible: always Steps to Reproduce: 1. install rhos and run packstack --allinone --amqp-server=qpid 2. you'll get failure with message: ERROR : Error appeared during Puppet run: <host_ip>_provision.pp Error: Could not prefetch neutron_network provider 'neutron': Execution of '/usr/bin/neutron net-list --format=csv --column=id --quote=none' returned 1: Connection to neutron failed: Maximum attempts reached You will find full trace in log /var/tmp/packstack/20140527-114456-DjC6dN/manifests/<host_ip>_provision.pp.log Actual results: neutron is stuck in poll Expected results: when updating python-qpid to 0.24 everything works fine see Bug: #1097306 Additional info: