Hide Forgot
There is a race-condition in ib-verbs between thread running send_completion_proc (thr 1) and the thread polling for events on sockets (thr 2). Observe the following code path 1. wc.status indicates error in thr 1. 2. There is a peer registered corresponding to the combination of (qp, device) on which this work request was sent. 3. thr1 calls transport_disconnect. 4. thr2 receives pollerr, and goes ahead and cleans up the transport. 5. thr1 continues execution and starts executing ib_verbs_quota_put. This procedure uses priv->write_mutex where priv is stored in trans->private. Now, thr1 will be accessing freed memory and may result in receiving SIGSEGV. There is also another race condition related to leaning up of transport. 1. (thr 1)In ib_verbs_send_completion_proc, the peer is looked up. 2. thr 2, unregisters the peer and cleans up the transport (there by peer). 3. thr 1, goes ahead and uses pointer peer pointing to freed memory. Among the above two race conditions, first one can only happen while processing the work completion struture of the work request which has failed. But the second one can also happen while processing subsequent work completions of work requests corresponding to same peer..
Relevant information from backend server ================== snip ================= [2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc] transport/ib-verbs: send work request on `mlx4_0' returned error wc.status = 12, wc.vendor_err = 129, post->buf = 0x2aab9af5f000, wc.byte_len = 0, post->r eused = 2070 [2009-11-10 13:59:26] E [ib-verbs.c:1241:ib_verbs_send_completion_proc] ib-verbs: connection between client and s erver not working. check by running 'ibv_srq_pingpong'. also make sure subnet manager is running (eg: 'opensm'), or check if ib-verbs port is valid (or active) by running 'ibv_devinfo'. contact Gluster Support Team if the pro blem persists. [2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc] transport/ib-verbs: send work request on `mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf = 0x2aaaea385000, wc.byte_len = 0, post->re used = 12465 [2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc] transport/ib-verbs: send work request on `mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf = 0x2aaaec919000, wc.byte_len = 0, post->re used = 48713 [2009-11-10 13:59:26] E [ib-verbs.c:2023:ib_verbs_event_handler] transport/ib-verbs: server: pollin received on t cp socket (peer: 172.20.3.120:985) after handshake is complete [2009-11-10 13:59:26] E [ib-verbs.c:1229:ib_verbs_send_completion_proc] transport/ib-verbs: send work request on `mlx4_0' returned error wc.status = 5, wc.vendor_err = 249, post->buf = 0x2aab7c34d000, wc.byte_len = 0, post->re used = 5222 [2009-11-10 13:59:26] N [server-protocol.c:7825:notify] server: 172.20.3.120:985 disconnected pending frames: patchset: v2.0.7-61-g04de4b6 signal received: 11 time of crash: 2009-11-10 13:59:26 configuration details: argp 1 backtrace 1 db.h 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 2.0.8rc9 /lib64/libc.so.6[0x3639a30280] /usr/local/lib/glusterfs/2.0.8rc9/transport/ib-verbs.so[0x2aaaaaab09bb] /lib64/libpthread.so.0[0x363a206367] /lib64/libc.so.6(clone+0x6d)[0x3639ad2f7d] --------- ====================== snip ================== Backtrace from thread '1' ======================== #0 ib_verbs_send_completion_proc (data=<value optimized out>) at ib-verbs.c:349 349 ib-verbs.c: No such file or directory. in ib-verbs.c (gdb) bt #0 ib_verbs_send_completion_proc (data=<value optimized out>) at ib-verbs.c:349 #1 0x000000363a206367 in start_thread () from /lib64/libpthread.so.0 #2 0x0000003639ad2f7d in clone () from /lib64/libc.so.6 (gdb) l 344 in ib-verbs.c (gdb) ========================= backtrace thr 2 ========================= #0 0x000000363a20d2cb in read () from /lib64/libpthread.so.0 #1 0x000000336e609eac in __ibv_get_cq_event (channel=<value optimized out>, cq=<value optimized out>, cq_context=<value optimized out>) at /usr/include/bits/unistd.h:35 #2 0x00002aaaaaab04d4 in ib_verbs_recv_completion_proc (data=<value optimized out>) at ib-verbs.c:1095 #3 0x000000363a206367 in start_thread () from /lib64/libpthread.so.0 #4 0x0000003639ad2f7d in clone () from /lib64/libc.so.6 =========================
PATCH: http://patches.gluster.com/patch/2228 in master (transport/ib-verbs: fix race-condition resulting in freeing of transport while it was still being used.)
PATCH: http://patches.gluster.com/patch/2309 in release-2.0 (transport/ib-verbs: fix race-condition resulting in freeing of transport while it was still being used.)
PATCH: http://patches.gluster.com/patch/2310 in master (transport/ib-verbs: assign to qpreg before accessing it in __ib_verbs_lookup_peer.)