Bug 849133 - glusterfs rdma fuse client crashed due to possible split brain situation.
Summary: glusterfs rdma fuse client crashed due to possible split brain situation.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: 2.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Raghavendra G
QA Contact: shylesh
URL:
Whiteboard:
Depends On: 787612
Blocks: 858454
TreeView+ depends on / blocked
 
Reported: 2012-08-17 11:56 UTC by Vidya Sakar
Modified: 2016-09-17 12:13 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 787612
: 858454 (view as bug list)
Environment:
Last Closed: 2015-02-23 07:00:17 UTC
Embargoed:


Attachments (Terms of Use)

Description Vidya Sakar 2012-08-17 11:56:38 UTC
+++ This bug was initially created as a clone of Bug #787612 +++

Created attachment 559599 [details]
rdma fuse client log

Description of problem:
I was running sanity tests on dist-rep volume with rdma transport type. rdma fuse 
client crashed with signal 6.
Version-Release number of selected component (if applicable):
glusterfs-3.3.0qa19

How reproducible:
Often (2/2)

Steps to Reproduce:
1. Create a dist-rep volume with rdma transport type.
2. Start sanity tests.
  
Actual results:
fuse client crashed with following back trace.

Core was generated by `/usr/local/sbin/glusterfs --volfile-id=hosdu --volfile-server=10.1.10.24 /mnt/'.
Program terminated with signal 6, Aborted.
#0  0x0000003d4f232905 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 libibverbs-1.1.4-2.el6.x86_64 libmlx4-1.0.1-7.el6.x86_64
(gdb) bt
#0  0x0000003d4f232905 in raise () from /lib64/libc.so.6
#1  0x0000003d4f2340e5 in abort () from /lib64/libc.so.6
#2  0x0000003d4f22b9be in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003d4f22ba80 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fb723db987d in afr_get_call_child (this=0x17686c0, child_up=0x7fb710011720 "", read_child=-1, fresh_children=0x7fb71000cd60, call_child=0x7fb71d81986c, last_index=0x7fb71001d918) at afr-common.c:670
#5  0x00007fb723d5e599 in afr_stat (frame=0x7fb72ae67c78, this=0x17686c0, loc=0x7fb7100120e8) at afr-inode-read.c:257
#6  0x00007fb723b0e6c9 in dht_stat (frame=0x7fb72ae63ca4, this=0x176a560, loc=0x7fb7100120e8) at dht-inode-read.c:302
#7  0x00007fb72389bc55 in wb_stat (frame=0x7fb72ae66198, this=0x176b810, loc=0x7fb7100120e8) at write-behind.c:753
#8  0x00007fb72c270142 in default_stat (frame=0x7fb72ae68080, this=0x176caf0, loc=0x7fb7100120e8) at defaults.c:1147
#9  0x00007fb72c270142 in default_stat (frame=0x7fb72ae679c8, this=0x176dd20, loc=0x7fb7100120e8) at defaults.c:1147
#10 0x00007fb72c270142 in default_stat (frame=0x7fb72ae64810, this=0x176eee0, loc=0x7fb7100120e8) at defaults.c:1147
#11 0x00007fb72301d661 in sp_stat (frame=0x7fb72ae69ebc, this=0x17701b0, loc=0x7fb7100120e8) at stat-prefetch.c:3644
#12 0x00007fb722dde15b in io_stats_stat (frame=0x7fb72ae64158, this=0x1771510, loc=0x7fb7100120e8) at io-stats.c:1836
#13 0x00007fb72a9124ec in fuse_getattr_resume (state=0x7fb7100120d0) at fuse-bridge.c:536
#14 0x00007fb72a90e804 in fuse_resolve_and_resume (state=0x7fb7100120d0, fn=0x7fb72a911ef5 <fuse_getattr_resume>) at fuse-resolve.c:754
#15 0x00007fb72a913783 in fuse_getattr (this=0x1759d50, finh=0x7fb7100344c0, msg=0x7fb7100344e8) at fuse-bridge.c:615
#16 0x00007fb72a92c56e in fuse_thread_proc (data=0x1759d50) at fuse-bridge.c:3482
#17 0x0000003d4fa077e1 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003d4f2e577d in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x00007fb723d5e599 in afr_stat (frame=0x7fb72ae67c78, this=0x17686c0, loc=0x7fb7100120e8) at afr-inode-read.c:257
257             ret = afr_get_call_child (this, local->child_up, read_child,
(gdb) f 4
#4  0x00007fb723db987d in afr_get_call_child (this=0x17686c0, child_up=0x7fb710011720 "", read_child=-1, fresh_children=0x7fb71000cd60, call_child=0x7fb71d81986c, last_index=0x7fb71001d918) at afr-common.c:670
670             GF_ASSERT (read_child >= 0);
(gdb) 




Expected results:
There should be no crashes.

Additional info:

Entries from the client log. 


[2012-02-06 01:29:10.891992] W [client3_1-fops.c:418:client3_1_stat_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.892069] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-2: forced unwindi
ng frame type(GlusterFS 3.1) op(RELEASEDIR(42)) called at 2012-02-06 01:29:10.890713
[2012-02-06 01:29:10.892115] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-2: forced unwindi
ng frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-02-06 01:29:10.890985
[2012-02-06 01:29:10.892135] W [client3_1-fops.c:2249:client3_1_lookup_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected. Path: /run31647/pa
/f2
[2012-02-06 01:29:10.892169] I [client.c:1885:client_rpc_notify] 0-hosdu-client-2: disconnected
[2012-02-06 01:29:10.893072] E [rpc-clnt.c:771:rpc_clnt_handle_reply] 0-hosdu-client-3: cannot lookup the saved frame for reply with xid (1440190)
[2012-02-06 01:29:10.893102] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-02-06 01:29:10.892259
[2012-02-06 01:29:10.893137] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.893160] W [client3_1-fops.c:4721:client3_1_inodelk] 0-hosdu-client-2: failed to send the fop: Transport endpoint is not connected
[2012-02-06 01:29:10.896806] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440192x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 29) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.896834] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.896852] I [afr-lk-common.c:993:afr_lock_blocking] 0-hosdu-replicate-1: unable to lock on even one child
[2012-02-06 01:29:10.896869] I [afr-transaction.c:952:afr_post_blocking_inodelk_cbk] 0-hosdu-replicate-1: Blocking inodelks failed.
[2012-02-06 01:29:10.896926] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(READLINK(2)) called at 2012-02-06 01:29:10.891941
[2012-02-06 01:29:10.896947] W [client3_1-fops.c:460:client3_1_readlink_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.896968] W [fuse-bridge.c:1127:fuse_readlink_cbk] 0-glusterfs-fuse: 1487166: /run31647/pd/l2 => -1 (Transport endpoint is not connected)
[2012-02-06 01:29:10.897040] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(STAT(1)) called at 2012-02-06 01:29:10.892036
[2012-02-06 01:29:10.897088] W [client3_1-fops.c:418:client3_1_stat_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.900609] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440193x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 27) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.900638] W [client3_1-fops.c:2249:client3_1_lookup_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected. Path: /run316
47/p6/f2
[2012-02-06 01:29:10.904378] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440194x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 29) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.904407] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected


I have attached the client log. I have archived the core file and other logs.

Comment 3 Sachidananda Urs 2013-08-08 05:46:11 UTC
Moving out of Big Bend since RDMA support is not available in Big Bend,2.1

Comment 7 Mohammed Rafi KC 2015-02-23 07:00:17 UTC
we ran the sanity tests for rdma, and couldn't reproduce the bug. Hence closing the bug as cuurent release.


Note You need to log in before you can comment on or make changes to this bug.