Bug 762113 (GLUSTER-381)

Summary: glusterfs crash in ib-verbs
Product: [Community] GlusterFS Reporter: Raghavendra G <raghavendra>
Component: ib-verbsAssignee: Raghavendra G <raghavendra>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Raghavendra G 2009-11-13 18:23:20 UTC
There is a race-condition in ib-verbs between thread running
send_completion_proc (thr 1) and the thread polling for events on
sockets (thr 2). Observe the following code path
1. wc.status indicates error in thr 1.
2. There is a peer registered corresponding to the combination of (qp,
device) on which this work request was sent.
3. thr1 calls transport_disconnect.
4. thr2 receives pollerr, and goes ahead and cleans up the transport.
5. thr1 continues execution and starts executing ib_verbs_quota_put.
This procedure uses priv->write_mutex where priv is stored in
trans->private. Now, thr1 will be accessing freed memory and may result
in receiving SIGSEGV.

There is also another race condition related to leaning up of transport.
1. (thr 1)In ib_verbs_send_completion_proc, the peer is looked up.
2. thr 2, unregisters the peer and cleans up the transport (there by peer).
3. thr 1, goes ahead and uses pointer peer pointing to freed memory.

Among the above two race conditions, first one can only happen while
processing the work completion struture of the work request which has
failed. But the second one can also happen while processing subsequent
work completions of work requests corresponding to same peer..

Comment 1 Raghavendra G 2009-11-13 21:22:54 UTC
Relevant information from backend server

================== snip =================
[2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on
`mlx4_0' returned error wc.status = 12, wc.vendor_err = 129, post->buf =
0x2aab9af5f000, wc.byte_len = 0, post->r
eused = 2070
[2009-11-10 13:59:26] E [ib-verbs.c:1241:ib_verbs_send_completion_proc]
ib-verbs: connection between client and s
erver not working. check by running 'ibv_srq_pingpong'. also make sure
subnet manager is running (eg: 'opensm'),
or check if ib-verbs port is valid (or active) by running
'ibv_devinfo'. contact Gluster Support Team if the pro
blem persists.
[2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on
`mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf =
0x2aaaea385000, wc.byte_len = 0, post->re
used = 12465
[2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on
`mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf =
0x2aaaec919000, wc.byte_len = 0, post->re
used = 48713
[2009-11-10 13:59:26] E [ib-verbs.c:2023:ib_verbs_event_handler]
transport/ib-verbs: server: pollin received on t
cp socket (peer: 172.20.3.120:985) after handshake is complete
[2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on
`mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf =
0x2aab7c34d000, wc.byte_len = 0, post->re
used = 5222
[2009-11-10 13:59:26] N [server-protocol.c:7825:notify] server:
172.20.3.120:985 disconnected
pending frames:

patchset: v2.0.7-61-g04de4b6
signal received: 11
time of crash: 2009-11-10 13:59:26
configuration details:
argp 1
backtrace 1
db.h 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 2.0.8rc9
/lib64/libc.so.6[0x3639a30280]
/usr/local/lib/glusterfs/2.0.8rc9/transport/ib-verbs.so[0x2aaaaaab09bb]
/lib64/libpthread.so.0[0x363a206367]
/lib64/libc.so.6(clone+0x6d)[0x3639ad2f7d]
---------

====================== snip ==================

Backtrace from thread '1'

========================
#0 ib_verbs_send_completion_proc (data=<value optimized out>) at
ib-verbs.c:349
349 ib-verbs.c: No such file or directory.
in ib-verbs.c
(gdb) bt
#0 ib_verbs_send_completion_proc (data=<value optimized out>) at
ib-verbs.c:349
#1 0x000000363a206367 in start_thread () from /lib64/libpthread.so.0
#2 0x0000003639ad2f7d in clone () from /lib64/libc.so.6
(gdb) l
344 in ib-verbs.c
(gdb)
=========================

backtrace thr 2

=========================

#0 0x000000363a20d2cb in read () from /lib64/libpthread.so.0
#1 0x000000336e609eac in __ibv_get_cq_event (channel=<value optimized
out>, cq=<value optimized out>,
cq_context=<value optimized out>) at /usr/include/bits/unistd.h:35
#2 0x00002aaaaaab04d4 in ib_verbs_recv_completion_proc (data=<value
optimized out>) at ib-verbs.c:1095
#3 0x000000363a206367 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003639ad2f7d in clone () from /lib64/libc.so.6

=========================

Comment 2 Anand Avati 2009-11-16 05:41:39 UTC
PATCH: http://patches.gluster.com/patch/2228 in master (transport/ib-verbs: fix race-condition resulting in freeing of transport while it was still being used.)

Comment 3 Anand Avati 2009-11-19 07:52:12 UTC
PATCH: http://patches.gluster.com/patch/2309 in release-2.0 (transport/ib-verbs: fix race-condition resulting in freeing of transport while it was still being used.)

Comment 4 Anand Avati 2009-11-19 07:57:48 UTC
PATCH: http://patches.gluster.com/patch/2310 in master (transport/ib-verbs: assign to qpreg before accessing it in __ib_verbs_lookup_peer.)