Bug 849133

Summary: glusterfs rdma fuse client crashed due to possible split brain situation.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vidya Sakar <vinaraya>
Component: replicateAssignee: Raghavendra G <rgowdapp>
Status: CLOSED CURRENTRELEASE QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: low    
Version: 2.0CC: aavati, gluster-bugs, rhs-bugs, rkavunga, rwheeler, sdharane, storage-qa-internal, surs, vagarwal, vbellur, vbhat
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 787612
: 858454 (view as bug list) Environment:
Last Closed: 2015-02-23 07:00:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 787612    
Bug Blocks: 858454    

Description Vidya Sakar 2012-08-17 11:56:38 UTC
+++ This bug was initially created as a clone of Bug #787612 +++

Created attachment 559599 [details]
rdma fuse client log

Description of problem:
I was running sanity tests on dist-rep volume with rdma transport type. rdma fuse 
client crashed with signal 6.
Version-Release number of selected component (if applicable):
glusterfs-3.3.0qa19

How reproducible:
Often (2/2)

Steps to Reproduce:
1. Create a dist-rep volume with rdma transport type.
2. Start sanity tests.
  
Actual results:
fuse client crashed with following back trace.

Core was generated by `/usr/local/sbin/glusterfs --volfile-id=hosdu --volfile-server=10.1.10.24 /mnt/'.
Program terminated with signal 6, Aborted.
#0  0x0000003d4f232905 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 libibverbs-1.1.4-2.el6.x86_64 libmlx4-1.0.1-7.el6.x86_64
(gdb) bt
#0  0x0000003d4f232905 in raise () from /lib64/libc.so.6
#1  0x0000003d4f2340e5 in abort () from /lib64/libc.so.6
#2  0x0000003d4f22b9be in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003d4f22ba80 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fb723db987d in afr_get_call_child (this=0x17686c0, child_up=0x7fb710011720 "", read_child=-1, fresh_children=0x7fb71000cd60, call_child=0x7fb71d81986c, last_index=0x7fb71001d918) at afr-common.c:670
#5  0x00007fb723d5e599 in afr_stat (frame=0x7fb72ae67c78, this=0x17686c0, loc=0x7fb7100120e8) at afr-inode-read.c:257
#6  0x00007fb723b0e6c9 in dht_stat (frame=0x7fb72ae63ca4, this=0x176a560, loc=0x7fb7100120e8) at dht-inode-read.c:302
#7  0x00007fb72389bc55 in wb_stat (frame=0x7fb72ae66198, this=0x176b810, loc=0x7fb7100120e8) at write-behind.c:753
#8  0x00007fb72c270142 in default_stat (frame=0x7fb72ae68080, this=0x176caf0, loc=0x7fb7100120e8) at defaults.c:1147
#9  0x00007fb72c270142 in default_stat (frame=0x7fb72ae679c8, this=0x176dd20, loc=0x7fb7100120e8) at defaults.c:1147
#10 0x00007fb72c270142 in default_stat (frame=0x7fb72ae64810, this=0x176eee0, loc=0x7fb7100120e8) at defaults.c:1147
#11 0x00007fb72301d661 in sp_stat (frame=0x7fb72ae69ebc, this=0x17701b0, loc=0x7fb7100120e8) at stat-prefetch.c:3644
#12 0x00007fb722dde15b in io_stats_stat (frame=0x7fb72ae64158, this=0x1771510, loc=0x7fb7100120e8) at io-stats.c:1836
#13 0x00007fb72a9124ec in fuse_getattr_resume (state=0x7fb7100120d0) at fuse-bridge.c:536
#14 0x00007fb72a90e804 in fuse_resolve_and_resume (state=0x7fb7100120d0, fn=0x7fb72a911ef5 <fuse_getattr_resume>) at fuse-resolve.c:754
#15 0x00007fb72a913783 in fuse_getattr (this=0x1759d50, finh=0x7fb7100344c0, msg=0x7fb7100344e8) at fuse-bridge.c:615
#16 0x00007fb72a92c56e in fuse_thread_proc (data=0x1759d50) at fuse-bridge.c:3482
#17 0x0000003d4fa077e1 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003d4f2e577d in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x00007fb723d5e599 in afr_stat (frame=0x7fb72ae67c78, this=0x17686c0, loc=0x7fb7100120e8) at afr-inode-read.c:257
257             ret = afr_get_call_child (this, local->child_up, read_child,
(gdb) f 4
#4  0x00007fb723db987d in afr_get_call_child (this=0x17686c0, child_up=0x7fb710011720 "", read_child=-1, fresh_children=0x7fb71000cd60, call_child=0x7fb71d81986c, last_index=0x7fb71001d918) at afr-common.c:670
670             GF_ASSERT (read_child >= 0);
(gdb) 




Expected results:
There should be no crashes.

Additional info:

Entries from the client log. 


[2012-02-06 01:29:10.891992] W [client3_1-fops.c:418:client3_1_stat_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.892069] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-2: forced unwindi
ng frame type(GlusterFS 3.1) op(RELEASEDIR(42)) called at 2012-02-06 01:29:10.890713
[2012-02-06 01:29:10.892115] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-2: forced unwindi
ng frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-02-06 01:29:10.890985
[2012-02-06 01:29:10.892135] W [client3_1-fops.c:2249:client3_1_lookup_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected. Path: /run31647/pa
/f2
[2012-02-06 01:29:10.892169] I [client.c:1885:client_rpc_notify] 0-hosdu-client-2: disconnected
[2012-02-06 01:29:10.893072] E [rpc-clnt.c:771:rpc_clnt_handle_reply] 0-hosdu-client-3: cannot lookup the saved frame for reply with xid (1440190)
[2012-02-06 01:29:10.893102] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(INODELK(29)) called at 2012-02-06 01:29:10.892259
[2012-02-06 01:29:10.893137] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.893160] W [client3_1-fops.c:4721:client3_1_inodelk] 0-hosdu-client-2: failed to send the fop: Transport endpoint is not connected
[2012-02-06 01:29:10.896806] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440192x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 29) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.896834] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.896852] I [afr-lk-common.c:993:afr_lock_blocking] 0-hosdu-replicate-1: unable to lock on even one child
[2012-02-06 01:29:10.896869] I [afr-transaction.c:952:afr_post_blocking_inodelk_cbk] 0-hosdu-replicate-1: Blocking inodelks failed.
[2012-02-06 01:29:10.896926] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(READLINK(2)) called at 2012-02-06 01:29:10.891941
[2012-02-06 01:29:10.896947] W [client3_1-fops.c:460:client3_1_readlink_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.896968] W [fuse-bridge.c:1127:fuse_readlink_cbk] 0-glusterfs-fuse: 1487166: /run31647/pd/l2 => -1 (Transport endpoint is not connected)
[2012-02-06 01:29:10.897040] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x186) [0x7fb72c0245d5] (-->/usr/local/lib/libgfrpc.
so.0(rpc_clnt_connection_cleanup+0x1c5) [0x7fb72c0234d6] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x45) [0x7fb72c022c0e]))) 0-hosdu-client-3: forced unwindi
ng frame type(GlusterFS 3.1) op(STAT(1)) called at 2012-02-06 01:29:10.892036
[2012-02-06 01:29:10.897088] W [client3_1-fops.c:418:client3_1_stat_cbk] 0-glusterfs: remote operation failed: Transport endpoint is not connected
[2012-02-06 01:29:10.900609] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440193x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 27) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.900638] W [client3_1-fops.c:2249:client3_1_lookup_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected. Path: /run316
47/p6/f2
[2012-02-06 01:29:10.904378] W [rpc-clnt.c:1478:rpc_clnt_submit] 0-hosdu-client-3: failed to submit rpc-request (XID: 0x1440194x Program: GlusterFS 3.1, ProgVers: 310, P
roc: 29) to rpc-transport (hosdu-client-3)
[2012-02-06 01:29:10.904407] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: Transport endpoint is not connected


I have attached the client log. I have archived the core file and other logs.

Comment 3 Sachidananda Urs 2013-08-08 05:46:11 UTC
Moving out of Big Bend since RDMA support is not available in Big Bend,2.1

Comment 7 Mohammed Rafi KC 2015-02-23 07:00:17 UTC
we ran the sanity tests for rdma, and couldn't reproduce the bug. Hence closing the bug as cuurent release.