1302922 – Reset connection occasionally hangs

Bug 1302922 - Reset connection occasionally hangs

Summary: Reset connection occasionally hangs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	async
Target Release:	5.0 (RHEL 7)
Assignee:	Flavio Percoco
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1303903
TreeView+	depends on / blocked

Reported:	2016-01-29 01:00 UTC by Matthew Booth
Modified:	2019-09-12 09:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:	python-oslo-messaging-1.3.1-7.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1303903 (view as bug list)
Environment:
Last Closed:	2016-02-11 13:44:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Workaround patch (3.48 KB, patch) 2016-01-29 01:02 UTC, Matthew Booth	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0167	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Bug Fix Advisory	2016-02-11 18:41:37 UTC

Description Matthew Booth 2016-01-29 01:00:39 UTC

Description of problem:
We're seeing hangs under heavy load when releasing a connection in ConnectionContext. Specifically, we're left waiting for a response to channel close which we never get. We see the conductor process spinning on epoll_wait when it gets into this state without doing any work between calls to epoll_wait, so the root cause could be that eventlet isn't scheduling us correctly, or that the response is never sent. Or something else, we genuinely have no idea.

Unfortunately I can't now find a clean stack trace containing this. However, here's a stack trace from a timeout generated by the attached patch:

2016-01-28 13:31:30.945 372 WARNING oslo.messaging._drivers.amqp [-] Timed out waiting for connection to reset
2016-01-28 13:31:30.945 372 ERROR oslo.messaging._drivers.amqp [-] Fail to reset the connection, drop it
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp Traceback (most recent call last):
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqp.py", line 141, in _done
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     self.connection.reset()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 684, in reset
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     self.channel.close()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 179, in close
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     (20, 41),  # Channel.close_ok
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     self.channel_id, allowed_methods)
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     self.method_reader.read_method()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 186, in read_method
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     self._next_method()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 107, in _next_method
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     frame_type, channel, payload = read_frame()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 154, in read_frame
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     frame_header = read(7, True)
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 277, in _read
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     s = recv(n - len(rbuf))
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 264, in recv
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     timeout_exc=socket.timeout("timed out"))
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 155, in trampoline
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     return hub.switch()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 187, in switch
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp     return self.greenlet.switch()
2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp Timeout: 60 seconds

We got the original stack traces from guru meditations of wedged nova conductor processes.

How reproducible:
Occasional under very heavy load.

Comment 2 Matthew Booth 2016-01-29 01:02:42 UTC

Created attachment 1119319 [details]
Workaround patch

This patch doesn't resolve the underlying issue of why the reset connection hangs. Instead, it wraps the reset in a timeout and throws the connection away after 60 seconds of trying to reset.

Comment 3 Matthew Booth 2016-02-02 11:36:38 UTC

I've pushed this to rhos-5.0-patches. It's just waiting to be built.

Comment 8 errata-xmlrpc 2016-02-11 13:44:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0167.html

Note You need to log in before you can comment on or make changes to this bug.