| Summary: | Reset connection occasionally hangs | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Matthew Booth <mbooth> | ||||
| Component: | python-oslo-messaging | Assignee: | Flavio Percoco <fpercoco> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.0 (RHEL 7) | CC: | apevec, dmaley, lhh, oblaut, yeylon | ||||
| Target Milestone: | async | Keywords: | ZStream | ||||
| Target Release: | 5.0 (RHEL 7) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | python-oslo-messaging-1.3.1-7.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1303903 (view as bug list) | Environment: | |||||
| Last Closed: | 2016-02-11 13:44:44 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1303903 | ||||||
| Attachments: |
|
||||||
Created attachment 1119319 [details]
Workaround patch
This patch doesn't resolve the underlying issue of why the reset connection hangs. Instead, it wraps the reset in a timeout and throws the connection away after 60 seconds of trying to reset.
I've pushed this to rhos-5.0-patches. It's just waiting to be built. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0167.html |
Description of problem: We're seeing hangs under heavy load when releasing a connection in ConnectionContext. Specifically, we're left waiting for a response to channel close which we never get. We see the conductor process spinning on epoll_wait when it gets into this state without doing any work between calls to epoll_wait, so the root cause could be that eventlet isn't scheduling us correctly, or that the response is never sent. Or something else, we genuinely have no idea. Unfortunately I can't now find a clean stack trace containing this. However, here's a stack trace from a timeout generated by the attached patch: 2016-01-28 13:31:30.945 372 WARNING oslo.messaging._drivers.amqp [-] Timed out waiting for connection to reset 2016-01-28 13:31:30.945 372 ERROR oslo.messaging._drivers.amqp [-] Fail to reset the connection, drop it 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp Traceback (most recent call last): 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqp.py", line 141, in _done 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp self.connection.reset() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 684, in reset 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp self.channel.close() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 179, in close 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp (20, 41), # Channel.close_ok 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp self.channel_id, allowed_methods) 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp self.method_reader.read_method() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 186, in read_method 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp self._next_method() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 107, in _next_method 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp frame_type, channel, payload = read_frame() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 154, in read_frame 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp frame_header = read(7, True) 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 277, in _read 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp s = recv(n - len(rbuf)) 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 264, in recv 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp timeout_exc=socket.timeout("timed out")) 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 155, in trampoline 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp return hub.switch() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 187, in switch 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp return self.greenlet.switch() 2016-01-28 13:31:30.945 372 TRACE oslo.messaging._drivers.amqp Timeout: 60 seconds We got the original stack traces from guru meditations of wedged nova conductor processes. How reproducible: Occasional under very heavy load.