After an iptables-related mishap, beaker-provision was found stuck polling for queued commands: 2013-05-14 19:26:01 [DEBUG ] { 1622} bkr.labcontroller.provision:117 Finished handling command 8752419 2013-05-14 19:26:17 [DEBUG ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands 2013-05-14 19:26:44 [DEBUG ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands 2013-05-14 19:27:09 [DEBUG ] { 1622} bkr.labcontroller.provision:47 Polling for queued commands (END) (The current local time is 23:17:43.) We already added a timeout for HubProxy in ProxyHelper to cover this exact scenario but evidently it is not working correctly.
I suspect the problem is that we are setting a read timeout but not a connect timeout. I found beaker-watchdog on of our devel LCs stuck for 4 days waiting for connect in an XMLRPC request. (gdb) py-bt #9 <built-in method do_handshake of ssl.SSLContext object at remote 0x1b9b4d0> #11 file '/usr/lib64/python2.6/ssl.py', in 'do_handshake' #14 file '/usr/lib64/python2.6/ssl.py', in '__init__' #25 file '/usr/lib64/python2.6/ssl.py', in 'wrap_socket' #29 file '/usr/lib64/python2.6/httplib.py', in 'connect' #37 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in 'connect' #40 file '/usr/lib64/python2.6/httplib.py', in 'send' #43 file '/usr/lib64/python2.6/httplib.py', in '_send_output' #46 file '/usr/lib64/python2.6/httplib.py', in 'endheaders' #49 file '/usr/lib64/python2.6/xmlrpclib.py', in 'send_content' #52 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request' #59 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request' #63 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request' #66 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__' #76 file '/usr/lib/python2.6/site-packages/bkr/common/hub.py', in '_login' #80 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop' #83 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main' #86 file '/usr/bin/beaker-watchdog', in '<module>'
Matt and I spent some time reproducing both the read and connect timeouts in a dev environment. We saw that read timeouts are enforced as expected. Hacking the LC source to set timeout=5 gives exceptions like this 35 seconds apart (5 second timeout + 30 seconds sleep in retry_request_decorator): Nov 19 15:58:56 lab beaker-watchdog[17943]: bkr.labcontroller.proxy WARNING XML-RPC connection to beaker.dcallagh failed: (4 retries left) Nov 19 15:58:56 lab beaker-watchdog[17943]: Traceback (most recent call last): Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 63, in request Nov 19 15:58:56 lab beaker-watchdog[17943]: result = transport_class.request(self, *args, **kwargs) Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 395, in _request Nov 19 15:58:56 lab beaker-watchdog[17943]: errcode, errmsg, headers = h.getreply() Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/httplib.py", line 1064, in getreply Nov 19 15:58:56 lab beaker-watchdog[17943]: response = self._conn.getresponse() Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse Nov 19 15:58:56 lab beaker-watchdog[17943]: response.begin() Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/httplib.py", line 391, in begin Nov 19 15:58:56 lab beaker-watchdog[17943]: version, status, reason = self._read_status() Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status Nov 19 15:58:56 lab beaker-watchdog[17943]: line = self.fp.readline() Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/socket.py", line 433, in readline Nov 19 15:58:56 lab beaker-watchdog[17943]: data = recv(1) Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/ssl.py", line 215, in recv Nov 19 15:58:56 lab beaker-watchdog[17943]: return self.read(buflen) Nov 19 15:58:56 lab beaker-watchdog[17943]: File "/usr/lib64/python2.6/ssl.py", line 136, in read Nov 19 15:58:56 lab beaker-watchdog[17943]: return self._sslobj.read(len) Nov 19 15:58:56 lab beaker-watchdog[17943]: SSLError: The read operation timed out We also saw that connection timeouts are enforced. The timeout period does not seem right however -- instead of the expected 35 seconds apart, we were seeing the exceptions 133 seconds(?) apart: Nov 19 15:39:01 lab beaker-watchdog[17876]: bkr.labcontroller.proxy WARNING XML-RPC connection to beaker.dcallagh.beakerdevs.lab.eng.bne.redhat.com failed: Connection timed out (3 retries left) Nov 19 15:39:01 lab beaker-watchdog[17876]: Traceback (most recent call last): Nov 19 15:39:01 lab beaker-watchdog[17876]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 63, in request Nov 19 15:39:01 lab beaker-watchdog[17876]: result = transport_class.request(self, *args, **kwargs) Nov 19 15:39:01 lab beaker-watchdog[17876]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 394, in _request Nov 19 15:39:01 lab beaker-watchdog[17876]: self.send_content(h, request_body) Nov 19 15:39:01 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content Nov 19 15:39:01 lab beaker-watchdog[17876]: connection.endheaders() Nov 19 15:39:01 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders Nov 19 15:39:01 lab beaker-watchdog[17876]: self._send_output() Nov 19 15:39:02 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output Nov 19 15:39:02 lab beaker-watchdog[17876]: self.send(msg) Nov 19 15:39:02 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/httplib.py", line 739, in send Nov 19 15:39:02 lab beaker-watchdog[17876]: self.connect() Nov 19 15:39:02 lab beaker-watchdog[17876]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 140, in connect Nov 19 15:39:02 lab beaker-watchdog[17876]: httplib.HTTPSConnection.connect(self) Nov 19 15:39:02 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/httplib.py", line 1112, in connect Nov 19 15:39:02 lab beaker-watchdog[17876]: sock = socket.create_connection((self.host, self.port), self.timeout) Nov 19 15:39:02 lab beaker-watchdog[17876]: File "/usr/lib64/python2.6/socket.py", line 567, in create_connection Nov 19 15:39:02 lab beaker-watchdog[17876]: raise error, msg Nov 19 15:39:02 lab beaker-watchdog[17876]: error: [Errno 110] Connection timed out However, the 133 seconds is probably some other (insignificant) issue, since 133 seconds is still short enough that it couldn't have added up to a process stuck for 4 days. Out of curiosity I checked the gdb stack trace in each case, this is what they looked like: (gdb) py-bt #4 file '<string>', in 'connect' #8 file '/usr/lib64/python2.6/socket.py', in 'create_connection' #12 file '/usr/lib64/python2.6/httplib.py', in 'connect' #20 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in 'connect' #23 file '/usr/lib64/python2.6/httplib.py', in 'send' #26 file '/usr/lib64/python2.6/httplib.py', in '_send_output' #29 file '/usr/lib64/python2.6/httplib.py', in 'endheaders' #32 file '/usr/lib64/python2.6/xmlrpclib.py', in 'send_content' #35 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request' #42 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request' #46 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request' #49 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__' #59 file '/usr/lib/python2.6/site-packages/bkr/common/hub.py', in '_login' #63 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop' #66 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main' #69 file '/usr/bin/beaker-watchdog', in '<module>' (gdb) py-bt #2 <built-in method read of ssl.SSLContext object at remote 0x1404870> #4 file '/usr/lib64/python2.6/ssl.py', in 'read' #8 file '/usr/lib64/python2.6/ssl.py', in 'recv' #12 file '/usr/lib64/python2.6/socket.py', in 'readline' #16 file '/usr/lib64/python2.6/httplib.py', in '_read_status' #19 file '/usr/lib64/python2.6/httplib.py', in 'begin' #22 file '/usr/lib64/python2.6/httplib.py', in 'getresponse' #25 file '/usr/lib64/python2.6/httplib.py', in 'getreply' #28 file '/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py', in '_request' #35 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py', in 'request' #39 file '/usr/lib64/python2.6/xmlrpclib.py', in '__request' #42 file '/usr/lib64/python2.6/xmlrpclib.py', in '__call__' #52 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main_loop' #55 file '/usr/lib/python2.6/site-packages/bkr/labcontroller/watchdog.py', in 'main' #58 file '/usr/bin/beaker-watchdog', in '<module>' Interestingly, the stack trace in comment 1 is actually different -- the process was stuck in SSLContext.do_handshake. So this suggests there may be a problem with timeout enforcement in the Python ssl code.
The do_handshake stack trace sounds a bit like https://hg.python.org/cpython/rev/ce4916ca06dd/ That was fixed in Python 2.6.6 though, so it shouldn't be affecting current RHEL systems.
(In reply to Nick Coghlan from comment #3) Yep we found http://bugs.python.org/5103 but that is indeed fixed in RHEL6. The difference is that urllib2 seems to be passing down the timeout to ssl somewhere properly, but our code (and/or httplib) is not. Kinda of tempted to just swap all XMLRPC calls to use requests instead of xmlrpclib, to avoid this whole mess...
The issue is caused by not setting the default socket timeout. By default, the socket timeout is 0 and it will cause ssl.wrap_socket to hang in httplib.py line 1096 when the server has connection problems. Patch is on gerrit: http://gerrit.beaker-project.org/#/c/3498/ The default timeout value could be argued. Cheers, Matt Jia
And the issue only happens on python2.6 since python 2.7 is using xmlrpclib.Transport which the timeout is set when initializing. Cheers,Matt Jia
I have set the timeout before doing ssl handshake rather than using the default global socket timeout and verfied that it works on the rhel6.5. Cheers ,Matt Jia
Verify passed on beaker-lab-controller-19.2-0.git.20.3a02055.el6eng Steps 1. After install complete open 1 terminal and run 'nc -l 8080' 2. Open another terminal run 'beaker-watchdog -f' 3. After 120s timeout, the trackback contain blow error. self._sslobj.do_handshake() SSLError: _ssl.c:495: The handshake operation timed out Change status to verified.
Beaker 19.2 has been released.