Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 902684

Summary:

Crash seen on ssl_setup_connection()

Product:

[Community] GlusterFS

Reporter:

Avra Sengupta <asengupt>

Component:

transport

Assignee:

Jeff Darcy <jdarcy>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

mainline

CC:

gluster-bugs, kparthas

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-24 17:47:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
GDB logs	none

Description Avra Sengupta 2013-01-22 09:19:17 UTC

Description of problem:
After applying the patch for bug 862834, while running the test-suite, we observe the following crash after bug-873367.t is run :
[root@Gotham /home/asengupt/Source]# gdb glusterd
GNU gdb (GDB) Fedora (7.4.50.20120120-52.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/sbin/glusterd...done.
(gdb) core /core*166
/core*166: No such file or directory.
(gdb) core /core.16604 
warning: core file may not match specified executable file.
[New LWP 16644]
[New LWP 16604]
[New LWP 16606]
[New LWP 16608]
[New LWP 16605]
[New LWP 16609]
[New LWP 16610]
[New LWP 16607]
[New LWP 16611]
[New LWP 16641]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id patchy.Gotham.d-backends-1'.
Program terminated with signal 11, Segmentation fault.
#0  X509_NAME_cmp (a=0x7fcf6c011e40, b=0x0) at x509_cmp.c:199
199		if (!b->canon_enc || b->modified)
(gdb) p b
$1 = (const X509_NAME *) 0x0
(gdb) bt
#0  X509_NAME_cmp (a=0x7fcf6c011e40, b=0x0) at x509_cmp.c:199
#1  0x00000033c8c610c4 in OBJ_bsearch_ex_ (key=key@entry=0x7fcf7de656e8, base_=0x1c06cc0, num=<optimized out>, size=size@entry=8, 
    cmp=0x33c8cf0330 <x509_object_cmp>, flags=flags@entry=2) at obj_dat.c:700
#2  0x00000033c8cb6f95 in internal_find (st=st@entry=0x1c06c90, data=0x0, data@entry=0x7fcf7de65720, ret_val_options=ret_val_options@entry=2)
    at stack.c:229
#3  0x00000033c8cb701a in sk_find (st=st@entry=0x1c06c90, data=data@entry=0x7fcf7de65720) at stack.c:237
#4  0x00000033c8cf040f in x509_object_idx_cnt (h=h@entry=0x1c06c90, type=type@entry=1, name=name@entry=0x7fcf6c011e40, pnmatch=pnmatch@entry=0x0)
    at x509_lu.c:454
#5  0x00000033c8cf0997 in X509_OBJECT_idx_by_subject (h=h@entry=0x1c06c90, type=type@entry=1, name=name@entry=0x7fcf6c011e40) at x509_lu.c:476
#6  0x00000033c8cf09a9 in X509_OBJECT_retrieve_by_subject (h=0x1c06c90, type=type@entry=1, name=name@entry=0x7fcf6c011e40) at x509_lu.c:483
#7  0x00000033c8cf0a18 in X509_STORE_get_by_subject (vs=vs@entry=0x7fcf7de65ae0, type=type@entry=1, name=name@entry=0x7fcf6c011e40, 
    ret=ret@entry=0x7fcf7de659f0) at x509_lu.c:298
#8  0x00000033c8cf128b in X509_STORE_CTX_get1_issuer (issuer=0x7fcf7de65a88, ctx=0x7fcf7de65ae0, x=0x7fcf6c011f30) at x509_lu.c:630
#9  0x00000033c8ced04f in X509_verify_cert (ctx=ctx@entry=0x7fcf7de65ae0) at x509_vfy.c:252
#10 0x00000033ccc3a92a in ssl_verify_cert_chain (s=s@entry=0x7fcf6c0008c0, sk=sk@entry=0x7fcf6c0115a0) at ssl_cert.c:535
#11 0x00000033ccc1c731 in ssl3_get_client_certificate (s=s@entry=0x7fcf6c0008c0) at s3_srvr.c:2998
#12 0x00000033ccc1d7c8 in ssl3_accept (s=0x7fcf6c0008c0) at s3_srvr.c:519
#13 0x00007fcf8529f652 in ssl_do (this=this@entry=0x1c1efc0, buf=buf@entry=0x0, len=len@entry=0, func=0x33ccc39350 <SSL_accept>) at socket.c:208
#14 0x00007fcf8529f779 in ssl_setup_connection (this=0x1c1efc0, server=1) at socket.c:278
#15 0x00007fcf852a8979 in socket_poller (ctx=0x1c1efc0) at socket.c:2231
#16 0x00000033b9807d14 in start_thread (arg=0x7fcf7de66700) at pthread_create.c:309
#17 0x00000033b94f168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) f 1
#1  0x00000033c8c610c4 in OBJ_bsearch_ex_ (key=key@entry=0x7fcf7de656e8, base_=0x1c06cc0, num=<optimized out>, size=size@entry=8, 
    cmp=0x33c8cf0330 <x509_object_cmp>, flags=flags@entry=2) at obj_dat.c:700
700			c=(*cmp)(key,p);
(gdb) p p
$2 = 0x1c06cc0 "0\224\300\001"
(gdb) 


Version-Release number of selected component (if applicable):


How reproducible:
It's a race. The crash occurs frequently enough on running the individual test-script(bug-873367.t), and almost everytime the whole run-tests.sh suite is run.

Steps to Reproduce:
1. Apply the patch in http://review.gluster.org/#change,4118
2. Run the run-tests.sh. 
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jeff Darcy 2013-01-23 17:12:31 UTC

What is the nature of this alleged race?  Which code paths are involved?  Bug873367 was a race, fixed by http://review.gluster.org/#change,4158 a couple of months ago.  How does http://review.gluster.org/#change,4118 break it again?

The stack trace shows that we haven't even gotten to any glusterd code yet; we're still in the transport code for accepting a connection.  The way we crash seems to indicate that we have a bad pointer to an SSL structure, and it was good when we spawned the thread.  How did it become bad?  Have the 4118 changes caused us to tear down a connection before we're even finished setting it up?  Or maybe corrupted memory somehow?

Comment 2 Jeff Darcy 2013-01-23 20:29:51 UTC

FWIW, I just ran bug-873367.t on my systems with current master plus the 4118 patch ten times with no failures of any sort.  That suggests that maybe it is a race condition, but its nature is still unclear.

Comment 3 Jeff Darcy 2013-01-23 22:54:35 UTC

After debugging this for a while, I realized something: we should never be in any of this code in glusterd.  The options to control both SSL and private-thread usage only affect the server and client volfiles - never glusterd, NFS, etc.  If use_ssl and own_thread are set for a glusterd socket transport, that means something else is amiss.  There should be two INFO messages in the glusterd log saying whether these features are enabled.  Avra, can you please check for a failed run and report what those messages say the state was when glusterd started?

Comment 4 Avra Sengupta 2013-01-28 06:55:24 UTC

Hi Jeff,

I am extremely sorry, as it was a miss on my part. The core was not generated by glusterd, but by glusterfsd. KP posted another patchset in http://review.gluster.org/#change,4118 after which I don't observe the crash at all. I am closing this bug for now. Will re-open it, if I encounter this issue again. Thanks.

Regards,
Avra

Comment 5 Avra Sengupta 2013-01-29 12:23:02 UTC

Hi Jeff,

I am re-opening this bug, as I still see the crash whenever I run bug-873367.t.
I isolated the point in bug-873367.t, which actually causes the crash. After the last second umount(Test 14), we stop the volume in Test 15. When the volume stop is executed, socket_disconnect is invoked. The ssl_ssl in this call, is corrupted and is causing the crash. I am attaching the gdb logs from the core. Please let me know if any other info is required.

Regards,
Avra

Comment 6 Avra Sengupta 2013-01-29 12:24:34 UTC

Created attachment 689771 [details]
GDB logs

Comment 7 Jeff Darcy 2013-01-29 17:44:30 UTC

The key here seems to be the fact that ssl_enabled=true and use_ssl=false, which is a combination that only happens for client connections when we're connecting to a glusterd portmapper. Sure enough, elsewhere in the gdb log we see that the translator name is patchy-client-0 and client functions are in the backtraces. Relatedly, the fact we have a non-NULL value for ssl_ssl implies that use_ssl must have been true at some time in the past, so the fact that it's false now implies that the socket was reconnected since then. Indeed, we can see that socket_gen=5, which indicates multiple disconnect/reconnect cycles. I added some code to print out these values during socket transitions, and the only place I see socket_gen=5 is in the NFS daemon. Here's the full sequence of operations on that socket.

* Connected with ssl_enabled=true and use_ssl=false (portmapper connection)
* Disconnected
* Connected with ssl_enabled=true and use_ssl=true (brick connection)
* Disconnected
* Connected with ssl_enabled=true and use_ssl=false

Another portmapper connection? Why? And why does it cause the test to fail (for you but not for me or during regression tests) when it doesn't even use NFS? I don't know the answers to those questions, but I do know that we can prevent that last connection from blowing up on the invalid ssl_ssl pointer when it's disconnected again. That's what http://review.gluster.com/#change,4449 does, but I'm concerned that we might just be masking a problem in the higher-level disconnect/reconnect logic.

Comment 8 Vijay Bellur 2013-01-29 18:01:17 UTC

CHANGE: http://review.gluster.org/4449 (socket: null out priv->ssl_ssl on disconnect) merged in master by Anand Avati (avati)

Comment 9 krishnan parthasarathi 2013-01-30 06:40:26 UTC

>That's what http://review.gluster.com/#change,4449 does, but I'm concerned that >we might just be masking a problem in the higher-level disconnect/reconnect >logic.

rpc_clnt layer's reconnect logic 'resets' remote-port to zero on the first connect to the brick process. This is done to ensure that the client xl is not taken by surprise when a brick goes down (resulting in a disconnect) and starts listening on a different port.
[Ref: rpc-clnt.c: rpc_clnt_notify, RPC_TRANSPORT_CONNECT case.]

FWIW, I used to get the same crash (on my laptop,) that Avra was seeing in his setup.