Bug 963042 - HubProxy timeout setting in ProxyHelper not taking effect for connect timeouts
Summary: HubProxy timeout setting in ProxyHelper not taking effect for connect timeouts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: lab controller
Version: 0.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: 19.2
Assignee: matt jia
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-15 03:22 UTC by Dan Callaghan
Modified: 2018-02-06 00:41 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-20 03:18:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Dan Callaghan 2013-05-15 03:22:51 UTC
After an iptables-related mishap, beaker-provision was found stuck polling for queued commands:

2013-05-14 19:26:01 [DEBUG   ] { 1622} bkr.labcontroller.provision:117 Finished handling command 8752419
2013-05-14 19:26:17 [DEBUG   ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands
2013-05-14 19:26:44 [DEBUG   ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands
2013-05-14 19:27:09 [DEBUG   ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands
(END)

(The current local time is 23:17:43.)

We already added a timeout for HubProxy in ProxyHelper to cover this exact scenario but evidently it is not working correctly.

Comment 1 Dan Callaghan 2014-11-03 05:29:14 UTC
I suspect the problem is that we are setting a read timeout but not a connect timeout.

I found beaker-watchdog on of our devel LCs stuck for 4 days waiting for connect in an XMLRPC request.

(gdb) py-bt
#9 <built-in method do_handshake of ssl.SSLContext object at remote 0x1b9b4d0>
#11 file '/usr/lib64/python2.6/ssl.py', in 'do_handshake'
#14 file '/usr/lib64/python2.6/ssl.py', in '__init__'
#25 file '/usr/lib64/python2.6/ssl.py', in 'wrap_socket'
#29 file '/usr/lib64/python2.6/httplib.py', in 'connect'
#37 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in 'connect'
#40 file '/usr/lib64/python2.6/httplib.py', in 'send'
#43 file '/usr/lib64/python2.6/httplib.py', in '_send_output'
#46 file '/usr/lib64/python2.6/httplib.py', in 'endheaders'
#49 file '/usr/lib64/python2.6/xmlrpclib.py', in 'send_content'
#52 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request'
#59 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request'
#63 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request'
#66 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__'
#76 file '/usr/lib/python2.6/site-packages/bkr/common/hub.py', in '_login'
#80 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop'
#83 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main'
#86 file '/usr/bin/beaker-watchdog', in '<module>'

Comment 2 Dan Callaghan 2014-11-19 06:04:36 UTC
Matt and I spent some time reproducing both the read and connect timeouts in a dev environment.

We saw that read timeouts are enforced as expected. Hacking the LC source to set timeout=5 gives exceptions like this 35 seconds apart (5 second timeout + 30 seconds sleep in retry_request_decorator):

Nov 19 15:58:56 lab beaker-watchdog[17943]: bkr.labcontroller.proxy WARNING XML-RPC connection to beaker.dcallagh failed:  (4 retries left)
Nov 19 15:58:56 lab beaker-watchdog[17943]:  Traceback (most recent call last):
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 63, in request
Nov 19 15:58:56 lab beaker-watchdog[17943]:      result = transport_class.request(self, *args, **kwargs)
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 395, in _request
Nov 19 15:58:56 lab beaker-watchdog[17943]:      errcode, errmsg, headers = h.getreply()
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/httplib.py", line 1064, in getreply
Nov 19 15:58:56 lab beaker-watchdog[17943]:      response = self._conn.getresponse()
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse
Nov 19 15:58:56 lab beaker-watchdog[17943]:      response.begin()
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/httplib.py", line 391, in begin
Nov 19 15:58:56 lab beaker-watchdog[17943]:      version, status, reason = self._read_status()
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status
Nov 19 15:58:56 lab beaker-watchdog[17943]:      line = self.fp.readline()
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/socket.py", line 433, in readline
Nov 19 15:58:56 lab beaker-watchdog[17943]:      data = recv(1)
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/ssl.py", line 215, in recv
Nov 19 15:58:56 lab beaker-watchdog[17943]:      return self.read(buflen)
Nov 19 15:58:56 lab beaker-watchdog[17943]:    File "/usr/lib64/python2.6/ssl.py", line 136, in read
Nov 19 15:58:56 lab beaker-watchdog[17943]:      return self._sslobj.read(len)
Nov 19 15:58:56 lab beaker-watchdog[17943]:  SSLError: The read operation timed out

We also saw that connection timeouts are enforced. The timeout period does not seem right however -- instead of the expected 35 seconds apart, we were seeing the exceptions 133 seconds(?) apart:

Nov 19 15:39:01 lab beaker-watchdog[17876]: bkr.labcontroller.proxy WARNING XML-RPC connection to beaker.dcallagh.beakerdevs.lab.eng.bne.redhat.com failed: Connection timed out (3 retries left)
Nov 19 15:39:01 lab beaker-watchdog[17876]:  Traceback (most recent call last):
Nov 19 15:39:01 lab beaker-watchdog[17876]:    File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 63, in request
Nov 19 15:39:01 lab beaker-watchdog[17876]:      result = transport_class.request(self, *args, **kwargs)
Nov 19 15:39:01 lab beaker-watchdog[17876]:    File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 394, in _request
Nov 19 15:39:01 lab beaker-watchdog[17876]:      self.send_content(h, request_body)
Nov 19 15:39:01 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content
Nov 19 15:39:01 lab beaker-watchdog[17876]:      connection.endheaders()
Nov 19 15:39:01 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
Nov 19 15:39:01 lab beaker-watchdog[17876]:      self._send_output()
Nov 19 15:39:02 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
Nov 19 15:39:02 lab beaker-watchdog[17876]:      self.send(msg)
Nov 19 15:39:02 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/httplib.py", line 739, in send
Nov 19 15:39:02 lab beaker-watchdog[17876]:      self.connect()
Nov 19 15:39:02 lab beaker-watchdog[17876]:    File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 140, in connect
Nov 19 15:39:02 lab beaker-watchdog[17876]:      httplib.HTTPSConnection.connect(self)
Nov 19 15:39:02 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/httplib.py", line 1112, in connect
Nov 19 15:39:02 lab beaker-watchdog[17876]:      sock = socket.create_connection((self.host, self.port), self.timeout)
Nov 19 15:39:02 lab beaker-watchdog[17876]:    File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
Nov 19 15:39:02 lab beaker-watchdog[17876]:      raise error, msg
Nov 19 15:39:02 lab beaker-watchdog[17876]:  error: [Errno 110] Connection timed out

However, the 133 seconds is probably some other (insignificant) issue, since 133 seconds is still short enough that it couldn't have added up to a process stuck for 4 days.

Out of curiosity I checked the gdb stack trace in each case, this is what they looked like:

(gdb) py-bt
#4 file '<string>', in 'connect'
#8 file '/usr/lib64/python2.6/socket.py', in 'create_connection'
#12 file '/usr/lib64/python2.6/httplib.py', in 'connect'
#20 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in 'connect'
#23 file '/usr/lib64/python2.6/httplib.py', in 'send'
#26 file '/usr/lib64/python2.6/httplib.py', in '_send_output'
#29 file '/usr/lib64/python2.6/httplib.py', in 'endheaders'
#32 file '/usr/lib64/python2.6/xmlrpclib.py', in 'send_content'
#35 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request'
#42 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request'
#46 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request'
#49 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__'
#59 file '/usr/lib/python2.6/site-packages/bkr/common/hub.py', in '_login'
#63 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop'
#66 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main'
#69 file '/usr/bin/beaker-watchdog', in '<module>'

(gdb) py-bt
#2 <built-in method read of ssl.SSLContext object at remote 0x1404870>
#4 file '/usr/lib64/python2.6/ssl.py', in 'read'
#8 file '/usr/lib64/python2.6/ssl.py', in 'recv'
#12 file '/usr/lib64/python2.6/socket.py', in 'readline'
#16 file '/usr/lib64/python2.6/httplib.py', in '_read_status'
#19 file '/usr/lib64/python2.6/httplib.py', in 'begin'
#22 file '/usr/lib64/python2.6/httplib.py', in 'getresponse'
#25 file '/usr/lib64/python2.6/httplib.py', in 'getreply'
#28 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request'
#35 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request'
#39 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request'
#42 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__'
#52 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop'
#55 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main'
#58 file '/usr/bin/beaker-watchdog', in '<module>'

Interestingly, the stack trace in comment 1 is actually different -- the process was stuck in SSLContext.do_handshake. So this suggests there may be a problem with timeout enforcement in the Python ssl code.

Comment 3 Nick Coghlan 2014-11-20 05:45:05 UTC
The do_handshake stack trace sounds a bit like https://hg.python.org/cpython/rev/ce4916ca06dd/

That was fixed in Python 2.6.6 though, so it shouldn't be affecting current RHEL systems.

Comment 4 Dan Callaghan 2014-11-20 05:53:57 UTC
(In reply to Nick Coghlan from comment #3)

Yep we found http://bugs.python.org/5103 but that is indeed fixed in RHEL6. The difference is that urllib2 seems to be passing down the timeout to ssl somewhere properly, but our code (and/or httplib) is not.

Kinda of tempted to just swap all XMLRPC calls to use requests instead of xmlrpclib, to avoid this whole mess...

Comment 5 matt jia 2014-11-21 06:54:20 UTC
The issue is caused by not setting the default socket timeout. By default, the socket timeout is 0 and it will cause ssl.wrap_socket to hang in httplib.py line 1096 when the server has connection problems.

Patch is on gerrit:

http://gerrit.beaker-project.org/#/c/3498/

The default timeout value could be argued.

Cheers, Matt Jia

Comment 6 matt jia 2014-11-21 06:58:20 UTC
And the issue only happens on python2.6 since python 2.7 is using xmlrpclib.Transport which the timeout is set when initializing.

Cheers,Matt Jia

Comment 7 matt jia 2014-11-25 00:19:52 UTC
I have set the timeout before doing ssl handshake rather than using the default global socket timeout and verfied that it works on the rhel6.5.

Cheers ,Matt Jia

Comment 10 wangdong 2014-12-22 09:35:17 UTC
Verify passed on beaker-lab-controller-19.2-0.git.20.3a02055.el6eng

Steps

1. After install complete open 1 terminal and run 'nc -l 8080'

2. Open another terminal run 'beaker-watchdog -f'

3. After 120s timeout, the trackback contain blow error.
    self._sslobj.do_handshake()
    SSLError: _ssl.c:495: The handshake operation timed out

Change status to verified.

Comment 11 Dan Callaghan 2015-01-20 03:18:42 UTC
Beaker 19.2 has been released.


Note You need to log in before you can comment on or make changes to this bug.