Bug 534176 - SIGSEGV within ssl_DefRecv() in multi-threaded libcurl based application
Summary: SIGSEGV within ssl_DefRecv() in multi-threaded libcurl based application
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: curl
Version: 11
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kamil Dudka
QA Contact: Fedora Extras Quality Assurance
URL: http://permalink.gmane.org/gmane.comp...
Whiteboard:
Depends On: 534115 643565
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-11-10 20:09 UTC by Kamil Dudka
Modified: 2010-10-15 23:36 UTC (History)
6 users (show)

Fixed In Version: 7.19.7-2.fc12
Doc Type: Bug Fix
Doc Text:
Clone Of: 534115
Environment:
Last Closed: 2009-12-01 04:46:15 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
attempt to lock/unlock ss->recvLock in SSL_ForceHandshake() (985 bytes, patch)
2009-11-10 23:29 UTC, Kamil Dudka
no flags Details | Diff
Patch (479 bytes, patch)
2009-11-12 00:29 UTC, Kevin Baughman
no flags Details | Diff

Description Kamil Dudka 2009-11-10 20:09:59 UTC
+++ This bug was initially created as a clone of Bug #534115 +++

Version-Release number of selected component (if applicable):
nss-3.12.4-3.fc11.i586

How reproducible:
occasionally, depends also on count of threads

Actual results:
SIGSEGV, backtrace available

Expected results:
no SIGSEGV

Additional info:
Formerly reported at https://bugzilla.redhat.com/504257#c7

--- Additional comment from kdudka on 2009-11-10 12:59:06 EDT ---

Here is a scratch build of NSS with the mentioned patch partially reverted. It should not trigger the invalid writes:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1798840

Unfortunately it brings back some memory leaks fixed before. So it looks like we need to free the memory somewhere else. For now I am not sure if the invalid writes are actually related to the SEGFAULT or not.

--- Additional comment from curb_pks on 2009-11-10 14:11:17 EDT ---

I ran this build under valgrind, the invalid writes seem to be gone.  However I still saw the core dump and a valgrind invalid read error:

==10352== Thread 60:
==10352== Invalid read of size 4
==10352==    at 0x47F8BA1: ssl_DefRecv (ssldef.c:91)
==10352==    by 0x47F38C5: ssl3_GatherCompleteHandshake (ssl3gthr.c:90)
==10352==    by 0x47F64BA: ssl_GatherRecord1stHandshake (sslcon.c:1258)
==10352==    by 0x47FC6F4: ssl_Do1stHandshake (sslsecur.c:151)
==10352==    by 0x47FDDA6: SSL_ForceHandshake (sslsecur.c:407)
==10352==    by 0x47FDE76: SSL_ForceHandshakeWithTimeout (sslsecur.c:428)
==10352==    by 0x437BBC7: Curl_nss_connect (nss.c:1214)
==10352==    by 0x4372181: Curl_ssl_connect (sslgen.c:185)
==10352==    by 0x43514F8: Curl_http_connect (http.c:1792)
==10352==    by 0x43589AD: Curl_protocol_connect (url.c:3056)
==10352==    by 0x435DC7E: Curl_connect (url.c:4690)
==10352==    by 0x4366A03: Curl_perform (transfer.c:2481)
==10352==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==10352==
==10352== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==10352==  Access not within mapped region at address 0x8
==10352==    at 0x47F8BA1: ssl_DefRecv (ssldef.c:91)
==10352==    by 0x47F38C5: ssl3_GatherCompleteHandshake (ssl3gthr.c:90)
==10352==    by 0x47F64BA: ssl_GatherRecord1stHandshake (sslcon.c:1258)
==10352==    by 0x47FC6F4: ssl_Do1stHandshake (sslsecur.c:151)
==10352==    by 0x47FDDA6: SSL_ForceHandshake (sslsecur.c:407)
==10352==    by 0x47FDE76: SSL_ForceHandshakeWithTimeout (sslsecur.c:428)
==10352==    by 0x437BBC7: Curl_nss_connect (nss.c:1214)
==10352==    by 0x4372181: Curl_ssl_connect (sslgen.c:185)
==10352==    by 0x43514F8: Curl_http_connect (http.c:1792)
==10352==    by 0x43589AD: Curl_protocol_connect (url.c:3056)
==10352==    by 0x435DC7E: Curl_connect (url.c:4690)
==10352==    by 0x4366A03: Curl_perform (transfer.c:2481)

Comment 1 Kamil Dudka 2009-11-10 20:18:16 UTC
Kevin has confirmed me off-list he can see the same core dump with a hand built libcurl without the PEM support. It means this crash is unrelated to bug 534115 (invalid writes within the PEM module).

Comment 2 Kevin Baughman 2009-11-10 20:47:28 UTC
When the core dump occurs in ssl_DefRecv:

(gdb) p ss
$1 = (sslSocket *) 0x8326338
(gdb) p ss->fd
$2 = (PRFileDesc *) 0x0

When I run under the debugger before the core dump happens I see this happen:

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0xb3ffbb70 (LWP 2379)]
0x00e08422 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install sqlite-3.6.12-3.fc11.i586
(gdb) bt
#0  0x00e08422 in __kernel_vsyscall ()
#1  0x00a3d458 in send () from /lib/libpthread.so.0
#2  0x00ca3f43 in ?? () from /lib/libnspr4.so
#3  0x00882d6e in ssl_DefSend (ss=0x85b0d88, buf=0x85c09d0 "\25\3\1", len=23, flags=0) at ssldef.c:127
#4  0x00873824 in ssl3_SendRecord (ss=0x85b0d88, type=<value optimized out>, pIn=0xb3ffae4e "\1", nIn=2, flags=0) at ssl3con.c:2061
#5  0x00873bf1 in SSL3_SendAlert (ss=0x85b0d88, level=alert_warning, desc=close_notify) at ssl3con.c:2316
#6  0x00886bf2 in ssl_SecureClose (ss=0x85b0d88) at sslsecur.c:1039
#7  0x0088b287 in ssl_Close (fd=0xb68fbde0) at sslsock.c:1406
#8  0x00c89181 in PR_Close () from /lib/libnspr4.so
#9  0x002b7291 in Curl_nss_close (conn=0x84e71e8, sockindex=0) at nss.c:908
#10 0x002ae1e4 in Curl_ssl_close (conn=0x84e71e8, sockindex=0) at sslgen.c:368
#11 0x002964ab in Curl_disconnect (conn=0x84e71e8) at url.c:2414
#12 0x002966cb in Curl_done (connp=0xb3ffaff8, status=CURLE_SEND_ERROR, premature=false) at url.c:4897
#13 0x002a3617 in Curl_perform (data=0x845b030) at transfer.c:2625
#14 0x002a3fc3 in curl_easy_perform (curl=0x845b030) at easy.c:557

This stack calls:

ssl_DefClose

Which does this:

ss->fd = NULL;

Could this be causing the issue?

Comment 3 Kamil Dudka 2009-11-10 21:21:37 UTC
(In reply to comment #2)
> This stack calls:
> 
> ssl_DefClose
> 
> Which does this:
> 
> ss->fd = NULL;
> 
> Could this be causing the issue?  

No unless the already closed socket is later used for reading/writing...

Comment 4 Kamil Dudka 2009-11-10 21:40:51 UTC
ssl_Close() should lock the socket for reading/writing, thus wait for any unfinished I/O operation before closing it. See the big comment within that function...

It drives me to another idea. Which library are you using to create that threads in you application?

Comment 5 Kevin Baughman 2009-11-10 22:04:51 UTC
I'm using pthreads.

Comment 6 Kamil Dudka 2009-11-10 22:30:23 UTC
(In reply to comment #5)
> I'm using pthreads.  

This should be OK.

For now I am only reading the code, looking at the backtrace in comment #0. The "recv buf lock" should be held by ssl_GatherRecord1stHandshake(). This implies nobody closes the socket using ssl_Close() until the lock is released. Or am I wrong?

Comment 7 Kamil Dudka 2009-11-10 22:53:30 UTC
(In reply to comment #6)
It's actually not true. I've just mixed up ss->recvLock and ss->recvBufLock in the reasoning...

Comment 8 Kamil Dudka 2009-11-10 23:13:26 UTC
I stopped the debugger within ssl_DefRecv() on the way from SSL_ForceHandshake() and got this:

ss->recvLock->locked = 0
ss->recvBufLock->lock.locked = 1

Then I stopped it at the same place on the way from ssl_Recv() and got this:

ss->recvLock->locked = 1
ss->recvBufLock->lock.locked = 1

Is there actually anything else which can hold the ssl_Close() until the receive operation is completed?

It looks like ForceHandshake() is less thread-safe than other functions. Is it a bug or feature?

Comment 9 Kamil Dudka 2009-11-10 23:29:57 UTC
Created attachment 368969 [details]
attempt to lock/unlock ss->recvLock in SSL_ForceHandshake()

I am just build a scratch build with the patch applied for testing purposes:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1799721

Comment 10 Kamil Dudka 2009-11-10 23:35:34 UTC
Kevin, please test the patch or scratch build and leave a note here if anything changes. I hope it will not just cause a deadlock instead ... I have actually nothing to test it with except the singly-threaded curl tool and gdb :-)

Comment 11 Kevin Baughman 2009-11-11 17:04:26 UTC
Thanks Kamil!!

I tried the patch, and the core dump went away.  But my code now seems to have a bunch of threads stuck in this stack:

Thread 18356 (Thread 0x910c3b70 (LWP 6750)):
#0  0x0041b422 in __kernel_vsyscall ()
#1  0x005f3e43 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x0057cb94 in _L_lock_9571 () from /lib/libc.so.6
#3  0x0057abf4 in malloc () from /lib/libc.so.6
#4  0x00278018 in PR_Malloc () from /lib/libnspr4.so
#5  0x002e6853 in PORT_Alloc_Util (bytes=18432) at secport.c:113
#6  0x00251b06 in sslBuffer_Grow (b=0xb9c392c, newLen=18432) at sslsecur.c:454
#7  0x00247a02 in ssl3_GatherData (flags=<value optimized out>, gs=<value optimized out>, ss=<value optimized out>) at ssl3gthr.c:149
#8  ssl3_GatherCompleteHandshake (flags=<value optimized out>, gs=<value optimized out>, ss=<value optimized out>) at ssl3gthr.c:195
#9  0x0024a4bb in ssl_GatherRecord1stHandshake (ss=0xb9c36b8) at sslcon.c:1258
#10 0x002506f5 in ssl_Do1stHandshake (ss=0xb9c36b8) at sslsecur.c:151
#11 0x00251df7 in SSL_ForceHandshake (fd=0xb75f3cb8) at sslsecur.c:410
#12 0x00251ee7 in SSL_ForceHandshakeWithTimeout (fd=0xb75f3cb8, timeout=30000) at sslsecur.c:434
#13 0x0014dbc8 in Curl_nss_connect () from /usr/lib/libcurl.so.4
#14 0x00144182 in Curl_ssl_connect () from /usr/lib/libcurl.so.4
#15 0x001234f9 in Curl_http_connect () from /usr/lib/libcurl.so.4
#16 0x0012a9ae in Curl_protocol_connect () from /usr/lib/libcurl.so.4
#17 0x0012fc7f in Curl_connect () from /usr/lib/libcurl.so.4
#18 0x00138a04 in Curl_perform () from /usr/lib/libcurl.so.4
#19 0x00139793 in curl_easy_perform () from /usr/lib/libcurl.so.4

Comment 12 Kevin Baughman 2009-11-11 18:58:38 UTC
I reran the code under valgrind, and eventually it cored.  I saw a 4 of these:

==21045== Invalid read of size 4
==21045==    at 0x4975B02: PR_GetIdentitiesLayer (in /lib/libnspr4.so)
==21045==    by 0x48015F6: ssl_FindSocket (sslsock.c:248)
==21045==    by 0x4803418: SSL_GetChannelInfo (sslinfo.c:54)
==21045==    by 0x437AC91: Curl_nss_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4371181: Curl_ssl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43504F8: Curl_http_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43579AD: Curl_protocol_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x435CC7E: Curl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4365A03: Curl_perform (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4366792: curl_easy_perform (in /usr/lib/libcurl.so.4.1.1)

One of these:

==21045== Invalid read of size 4
==21045==    at 0x488AA77: CERT_NameToAsciiInvertible (alg1485.c:1100)
==21045==    by 0x488AB94: CERT_NameToAscii (alg1485.c:1149)
==21045==    by 0x4379779: (within /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x437ACD5: Curl_nss_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4371181: Curl_ssl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43504F8: Curl_http_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43579AD: Curl_protocol_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x435CC7E: Curl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4365A03: Curl_perform (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4366792: curl_easy_perform (in /usr/lib/libcurl.so.4.1.1)


And the actual core itself:

==21045== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==21045==  Access not within mapped region at address 0xD0
==21045==    at 0x488AA77: CERT_NameToAsciiInvertible (alg1485.c:1100)
==21045==    by 0x488AB94: CERT_NameToAscii (alg1485.c:1149)
==21045==    by 0x4379779: (within /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x437ACD5: Curl_nss_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4371181: Curl_ssl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43504F8: Curl_http_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x43579AD: Curl_protocol_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x435CC7E: Curl_connect (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4365A03: Curl_perform (in /usr/lib/libcurl.so.4.1.1)
==21045==    by 0x4366792: curl_easy_perform (in /usr/lib/libcurl.so.4.1.1)

Comment 13 Kevin Baughman 2009-11-11 19:07:25 UTC
Last stack trace had no debug symbols for curl:

#0  CERT_NameToAsciiInvertible (name=0xcc, strict=CERT_N2A_READABLE) at alg1485.c:1100
#1  0x0488ab95 in CERT_NameToAscii (name=0xcc) at alg1485.c:1149
#2  0x0437977a in display_cert_info (data=0x7c6e630, cert=0x0) at nss.c:669
#3  0x0437acd6 in display_conn_info (sock=<value optimized out>, conn=<value optimized out>) at nss.c:707
#4  Curl_nss_connect (sock=<value optimized out>, conn=<value optimized out>) at nss.c:1226
#5  0x04371182 in Curl_ssl_connect (conn=0x2f4a8370, sockindex=0) at sslgen.c:185
#6  0x043504f9 in Curl_http_connect (conn=0x2f4a8370, done=0x15a74e7e) at http.c:1792
#7  0x043579ae in Curl_protocol_connect (conn=0x2f4a8370, protocol_done=0x15a74e7e) at url.c:3056
#8  0x0435cc7f in setup_conn (protocol_done=<value optimized out>, hostaddr=<value optimized out>, conn=<value optimized out>) at url.c:4690
#9  Curl_connect (protocol_done=<value optimized out>, hostaddr=<value optimized out>, conn=<value optimized out>) at url.c:4766
#10 0x04365a04 in connect_host (conn=<value optimized out>, data=<value optimized out>) at transfer.c:2481
#11 Curl_perform (conn=<value optimized out>, data=<value optimized out>) at transfer.c:2562
#12 0x04366793 in curl_easy_perform (curl=0x7c6e630) at easy.c:557

Comment 14 Kevin Baughman 2009-11-11 20:26:00 UTC
==24503== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==24503==  Access not within mapped region at address 0xC
==24503==    at 0x47F7C0F: ssl_DefClose (ssldef.c:215)
==24503==    by 0x47FBBFF: ssl_SecureClose (sslsecur.c:1047)
==24503==    by 0x48002F6: ssl_Close (sslsock.c:1406)
==24503==    by 0x49741A0: PR_Close (in /lib/libnspr4.so)
==24503==    by 0x437A255: Curl_nss_connect (nss.c:1268)
==24503==    by 0x4371181: Curl_ssl_connect (sslgen.c:185)
==24503==    by 0x43504F8: Curl_http_connect (http.c:1792)
==24503==    by 0x43579AD: Curl_protocol_connect (url.c:3056)
==24503==    by 0x435CC7E: Curl_connect (url.c:4690)
==24503==    by 0x4365A03: Curl_perform (transfer.c:2481)
==24503==    by 0x4366792: curl_easy_perform (easy.c:557)

I also just got this core just now.  Seems like the same issue.

Comment 15 Kevin Baughman 2009-11-12 00:29:58 UTC
Created attachment 369117 [details]
Patch

Comment 16 Kevin Baughman 2009-11-12 01:30:39 UTC
Hi All,

I spent some time debugging the curl code while Kamil and I were debugging this issue and I think I found out what is going on.  Someplace in the code it looks like a data structure was released twice.  So I looked in curl_nss_connect and saw that the model structure will be freed twice if the handshake fails.

At line 939:
PRFileDesc *model = NULL;

At line 1201:
PR_Close(model); /* We don't need this any more */

Then at line 1214 if SSL_ForceHandshakeWithTimeout fails, goto error is called and at line 1267:

if(model)
  PR_Close(model);

I created a patch that sets model to NULL around line 1201 and that seems to fix my crash.  See below:

*** curl-7.19.6/lib/nss.c       Fri Jul 31 20:15:35 2009
--- /tmp/nss.c  Wed Nov 11 17:52:27 2009
***************
*** 1184,1190 ****
--- 1184,1192 ----
    connssl->handle = SSL_ImportFD(model, connssl->handle);
    if(!connssl->handle)
      goto error;
+
    PR_Close(model); /* We don't need this any more */
+   model = NULL;
 
    /* This is the password associated with the cert that we're using */
    if (data->set.str[STRING_KEY_PASSWD]) {

It looks at some point the handshake failed and between the first and the second close was called someone else was using the NSS socket stucture and the fd element was released.

Patch it attached.

Thanks,
Kevin

Comment 17 Kamil Dudka 2009-11-12 10:34:27 UTC
Kevin, thanks for digging this! Really nice catch!

Could you please test it once again with the NSS distributed along with Fedora? Has it solved all problems from the original report? Have the invalid writes from bug 534115 also disappeared?

Comment 18 Kamil Dudka 2009-11-12 10:55:46 UTC
Applied upstream, the builds for Fedora will follow.

Comment 19 Kamil Dudka 2009-11-12 10:59:37 UTC
Note the bug itself is not specific to multi-threaded applications. The double close occurs in a single thread as well. Only the impact is worse in case of multi-threading probably.

Comment 20 Fedora Update System 2009-11-12 13:22:03 UTC
curl-7.19.6-2.fc10 has been submitted as an update for Fedora 10.
http://admin.fedoraproject.org/updates/curl-7.19.6-2.fc10

Comment 21 Fedora Update System 2009-11-12 13:22:10 UTC
curl-7.19.7-1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/curl-7.19.7-1.fc12

Comment 22 Fedora Update System 2009-11-12 13:22:16 UTC
curl-7.19.7-2.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/curl-7.19.7-2.fc11

Comment 23 Fedora Update System 2009-11-13 02:26:17 UTC
curl-7.19.7-2.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-11454

Comment 24 Fedora Update System 2009-11-13 02:27:58 UTC
curl-7.19.6-2.fc10 has been pushed to the Fedora 10 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F10/FEDORA-2009-11457

Comment 25 Fedora Update System 2009-11-13 02:29:43 UTC
curl-7.19.7-1.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2009-11462

Comment 26 Kevin Baughman 2009-11-13 14:54:34 UTC
Looks good.  Was not able to test last night due to network issues, but it seems to be working.

Comment 27 Fedora Update System 2009-12-01 04:21:24 UTC
curl-7.19.7-3.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-12245

Comment 28 Fedora Update System 2009-12-01 04:36:36 UTC
curl-7.19.7-2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update curl'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2009-12235

Comment 29 Fedora Update System 2009-12-01 04:46:10 UTC
curl-7.19.6-2.fc10 has been pushed to the Fedora 10 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 30 Fedora Update System 2009-12-18 04:17:57 UTC
curl-7.19.7-3.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 31 Fedora Update System 2009-12-18 04:21:08 UTC
curl-7.19.7-2.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.