Bug 523712 - SSL.Connection.accept() blocking on a non-blocking Connection
Summary: SSL.Connection.accept() blocking on a non-blocking Connection
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: m2crypto
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Miloslav Trmač
QA Contact: BaseOS QE
URL:
Whiteboard:
Depends On:
Blocks: 610884 622934
TreeView+ depends on / blocked
 
Reported: 2009-09-16 14:21 UTC by Dan Kenigsberg
Modified: 2010-11-09 13:23 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 610884 (view as bug list)
Environment:
Last Closed: 2010-07-02 13:58:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fullest bt that I could get (271.46 KB, text/plain)
2010-07-01 08:03 UTC, Dan Kenigsberg
no flags Details

Description Dan Kenigsberg 2009-09-16 14:21:52 UTC
Description of problem:
we noticed once that python threading was blocked while m2crypto was read()ing data that never arrived. this is similar to bug 472690, but far rarer.

Version-Release number of selected component (if applicable):
m2crypto-0.16-6.el5.6

How reproducible:
it happened once.

Additional info:

(gdb) info threads
  27 Thread 0x423a5940 (LWP 31688)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  26 Thread 0x42da6940 (LWP 31689)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  25 Thread 0x437a7940 (LWP 31690)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  24 Thread 0x441a8940 (LWP 31691)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  23 Thread 0x44ba9940 (LWP 31692)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
* 22 Thread 0x455aa940 (LWP 31693)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  21 Thread 0x45fab940 (LWP 31694)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  20 Thread 0x469ac940 (LWP 31695)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  19 Thread 0x473ad940 (LWP 31696)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  18 Thread 0x47dae940 (LWP 31697)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  17 Thread 0x487af940 (LWP 31698)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  16 Thread 0x491b0940 (LWP 31699)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  15 Thread 0x49bb1940 (LWP 31700)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  14 Thread 0x4a5b2940 (LWP 31701)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  13 Thread 0x4afb3940 (LWP 31702)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  12 Thread 0x4b9b4940 (LWP 31703)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  11 Thread 0x4c3b5940 (LWP 31704)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  10 Thread 0x4cdb6940 (LWP 31705)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  9 Thread 0x4d7b7940 (LWP 31706)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  8 Thread 0x4e1b8940 (LWP 31707)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  7 Thread 0x4ffbb940 (LWP 31788)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  6 Thread 0x509bc940 (LWP 31789)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  5 Thread 0x4158b940 (LWP 31791)  0x00000032ee80d5cb in read () from /lib64/libpthread.so.0
  4 Thread 0x513bd940 (LWP 635)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  3 Thread 0x4f5ba940 (LWP 844)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  2 Thread 0x40aca940 (LWP 906)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
  1 Thread 0x2b0edb527f90 (LWP 31676)  0x00000032ee0cced2 in select () from /lib64/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread 0x4158b940 (LWP 31791))]#0  0x00000032ee80d5cb in read () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00000032ee80d5cb in read () from /lib64/libpthread.so.0
#1  0x00000032f0878242 in sock_read (b=<value optimized out>, out=<value optimized out>, outl=<value optimized out>)
    at /usr/include/bits/unistd.h:35
#2  0x00000032f087677f in BIO_read (b=<value optimized out>, out=<value optimized out>, outl=<value optimized out>) at bio_lib.c:212
#3  0x00000032f70209dd in ssl3_read_n (s=<value optimized out>, n=<value optimized out>, max=<value optimized out>,
    extend=<value optimized out>) at s3_pkt.c:198
#4  0x00000032f7020e9d in ssl3_read_bytes (s=<value optimized out>, type=<value optimized out>, buf=<value optimized out>,
    len=<value optimized out>, peek=<value optimized out>) at s3_pkt.c:266
#5  0x00000032f7021c80 in ssl3_get_message (s=<value optimized out>, st1=<value optimized out>, stn=<value optimized out>,
    mt=<value optimized out>, max=<value optimized out>, ok=<value optimized out>) at s3_both.c:394
#6  0x00000032f701825a in ssl3_check_client_hello (s=<value optimized out>) at s3_srvr.c:636
#7  0x00000032f701a425 in ssl3_accept (s=<value optimized out>) at s3_srvr.c:431
#8  0x00002b0ee1071317 in ssl_accept (ssl=0x2aaaac016970, timeout=-1) at SWIG/_m2crypto_wrap.c:5148
#9  0x00002b0ee1082bf7 in _wrap_ssl_accept (self=<value optimized out>, args=<value optimized out>) at SWIG/_m2crypto_wrap.c:14756
#10 0x00000032ef0950ba in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3563
#11 0x00000032ef094b66 in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3645
#12 0x00000032ef094b66 in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3645
#13 0x00000032ef094b66 in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3645
#14 0x00000032ef094b66 in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3645
#15 0x00000032ef094b66 in PyEval_EvalFrame (f=<value optimized out>) at Python/ceval.c:3645
#16 0x00000032ef095fe5 in PyEval_EvalCodeEx (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=<value optimized out>, kws=<value optimized out>, kwcount=<value optimized out>,
    defs=<value optimized out>, defcount=<value optimized out>, closure=<value optimized out>) at Python/ceval.c:2736
#17 0x00000032ef04c367 in function_call (func=<value optimized out>, arg=<value optimized out>, kw=<value optimized out>)
    at Objects/funcobject.c:548
#18 0x00000032ef0360f0 in PyObject_Call (func=<value optimized out>, arg=<value optimized out>, kw=<value optimized out>)
    at Objects/abstract.c:1795
#19 0x00000032ef03c1ef in instancemethod_call (func=<value optimized out>, arg=<value optimized out>, kw=<value optimized out>)
    at Objects/classobject.c:2447
#20 0x00000032ef0360f0 in PyObject_Call (func=<value optimized out>, arg=<value optimized out>, kw=<value optimized out>)
    at Objects/abstract.c:1795
#21 0x00000032ef08fc3d in PyEval_CallObjectWithKeywords (func=<value optimized out>, arg=<value optimized out>, kw=<value optimized out>)
    at Python/ceval.c:3430
#22 0x00000032ef0bbbbd in t_bootstrap (boot_raw=<value optimized out>) at Modules/threadmodule.c:434
#23 0x00000032ee8064a7 in start_thread () from /lib64/libpthread.so.0
#24 0x00000032ee0d3c2d in clone () from /lib64/libc.so.6

(a huge core of the python process is available upon request)

Comment 1 Miloslav Trmač 2009-10-09 13:17:48 UTC
Thanks for your report.

AFAICS from the backtrace alone, the thread is waiting until data come from a socket.  Is there any specific reason to believe there is a bug in m2crypto?

(As #507903 demonstrates, almost anything could be caused by a bug in M2Crypto - but this backtrace is roughly what I'd expect from an ordinary server that has currently nothing to do.)

Having a backtrace with most local variables would be ideal; knowing the parameters of the interrupted select() and read() system calls, as well as possible information about the referenced files (e.g. is it a local file, pipe, or a socket?) would probably be helpful as well.

Comment 2 Dan Kenigsberg 2009-10-09 13:55:35 UTC
while waiting for data (that never came), the thread blocked other threads forever. This reminded me of bug 472690 of m2crypto.

However, unlike back then, we saw the issue only once. I don't have answers to your questions but maybe the core I've set you reveals a few.

Comment 3 Miloslav Trmač 2009-10-09 14:39:01 UTC
The other threads are blocked in select(); if they were blocked on a lock, the backtrace should be showing futex().

Can you try to obtain a complete backtrace from the core, please?  I have tried on a RHEL5 machine, and it didn't show anything useful.

Comment 4 Dan Kenigsberg 2010-07-01 08:03:40 UTC
Created attachment 428196 [details]
fullest bt that I could get

This bug is rather rare, but it happened again. It seems that Thread 1 is stuck in a never-ending ssl_accept(). Until accept returns, my process cannot accept any new connection, obviously.

I'd appreciate if you find what blocks there for hours, and make it fail within a reasonable timeout.

Comment 5 Miloslav Trmač 2010-07-01 19:30:27 UTC
It seems the created sockets are blocking, in which case they correctly do what they should - block.  You can set a timeout for all operations on a socket using SSL.Connection.settimeout(); be prepared to handle the resulting SSLTimeoutError exceptions.

Tomas, could you look at the backtrace, please?  I originally suspected the client just isn't sending anything - actually the server is already in SSL3_ST_SR_FINISHED_A, and blocking (per danken for hours) waiting for a new TLS record.  Apart from figuring out the cause, if you could suggest how to write a client that reproduces such behavior, that would be helpful as well for testing the timeout handling.

Comment 6 Tomas Mraz 2010-07-01 20:09:16 UTC
The client would have to send the ChangeCipherSpec message but then not send the Finished message and just wait instead. Unfortunately I think that writing such client would require calls to openssl internal functions.

I did not look at the problem in depth but it seems to me this is just an application coding error. I don't quite understand why blocked SSL_accept() in one thread should block other threads from processing their requests - especially the main server listening loop. The thread running the SSL_accept should be separate from the main server listening loop - the thread should be created immediately after the accept() syscall.

Comment 7 Dan Kenigsberg 2010-07-02 06:53:13 UTC
(In reply to comment #5)
> It seems the created sockets are blocking, in which case they correctly do what
> they should - block.  You can set a timeout for all operations on a socket
> using SSL.Connection.settimeout(); be prepared to handle the resulting
> SSLTimeoutError exceptions.

I am already using SSL.Connection.settimeout() on the listening socket. Could it be that the timeout is not "inherited" by by the client socket returned by socket.socket.accept()? 

(In reply to comment #6)
> 
> I did not look at the problem in depth but it seems to me this is just an
> application coding error. I don't quite understand why blocked SSL_accept() in
> one thread should block other threads from processing their requests -
> especially the main server listening loop. The thread running the SSL_accept
> should be separate from the main server listening loop - the thread should be
> created immediately after the accept() syscall.    

Please note that my code is almost identical to M2Crypto.SSL.Connection.accept(), where SSL_accept() is called by the same thread that calles accpet().

Comment 8 Tomas Mraz 2010-07-02 07:08:23 UTC
Then I'd say, this is a not-quite-correct implementation in M2Crypto either.

SSL_accept() might definitely block depending on the client's behavior.

Comment 9 Miloslav Trmač 2010-07-02 13:02:09 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > It seems the created sockets are blocking, in which case they correctly do what
> > they should - block.  You can set a timeout for all operations on a socket
> > using SSL.Connection.settimeout(); be prepared to handle the resulting
> > SSLTimeoutError exceptions.
> 
> I am already using SSL.Connection.settimeout() on the listening socket. Could
> it be that the timeout is not "inherited" by by the client socket returned by
> socket.socket.accept()? 
Argh, you are right.  I was so focused on finding a specific reproducer that I missed this possibility.

Comment 11 Miloslav Trmač 2010-07-02 13:58:17 UTC
On second thought, changing this in m2crypto is not a good idea.

Looking at http://bugs.python.org/issue7995 , the Python socket module (for which SSL.Connection() is supposed to be a plug-in replacement where possible) is sets all sockets returned from socket.accept() as blocking.  (Indeed, if the timeout were inherited by TCP sockets managed by the socket module, the timeout would have already been inherited by the current m2crypto code.)

Second, making a socket non-blocking inside SSL.Connection.accept() wouldn't be useful.  If the non-blocking operation times out, the user should later repeat the operation - but SSL.Connection.accept() would call socket.accept() again, getting a _new_ client socket instead of continuing negotiation with the existing client.

So, a non-blocking server needs to use a manually written replacement for SSL.Connection.accept(), and vdsm already does this.  That is also the right place to make the client socket non-blocking.
        sock, addr = self.socket.accept()
        ssl = SSL.Connection(self.ctx, sock)
add
        ssl.settimeout(...)

Comment 12 Dan Kenigsberg 2010-07-02 15:25:04 UTC
I'll have to follow your suggestion and ssl.settimeout() before ssl.accept_ssl() as I reimplement SSL.Connection.accept() already.

However, I think that m2crypto should do that too. socket.socket.accept() on a non-blocking socket never blocks. SSL.Connection() which is supposed to be a plug-in replacement ;-) should have the same semantic. It should simply make sure that it always returns a non-blocking.
To me it seems like the reasonable behavior: accept() timed out, and the user does not really care if it happened before or after tcp handshake. In the latter case, the intermediate client socket goes to waste; too bad, just cut your losses and close it.

Comment 13 Miloslav Trmač 2010-07-07 21:02:27 UTC
That's not how TCP sockets work.

With TCP sockets, the handshake occurs in the background, and accepts() only waits until some handshake finishes and creates a socket for the connection.  A non-blocking accept(), or accept() interrupted with a signal, does not interrupt any handshakes in the background.

A program can therefore use poll()/select() with non-blocking accept() calls to reliably handle all incoming requests (within the limits of the queue length set by listen()).

Similarly, in the Python API non-blocking sockets and/or timeouts are "restartable" - if a read() or accept() times out, you can always try again later.


openssl does not work that way: the socket must be first created using regular accept(), an openssl data structure must be created for it, and only then is SSL_accept() used to perform the SSL negotiation.  This negotiation occurs in the foreground, with the calling thread blocking for I/O.  The calling process must therefore separately integrate accept() and SSL_accept() into their poll()/select() loop; there is no way to make the change transparent.

For true non-blocking sockets (settimeout(0)) such a program change is absolutely necessary: try adding settimeout(0) to your modified implementation - no connection will ever be accepted.

It would be possible to support non-zero timeouts within Connection.accept, dropping the connection - but that would be completely inconsistent with the other cases.


Note You need to log in before you can comment on or make changes to this bug.