Bug 1539680

Summary: RDMA transport bricks crash
Product: [Community] GlusterFS Reporter: Jiri Lunacek <jiri.lunacek>
Component: rdmaAssignee: bugs <bugs>
Status: CLOSED WONTFIX QA Contact:
Severity: urgent Docs Contact:
Priority: low    
Version: mainlineCC: atumball, bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-17 11:02:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
brick volume file none

Description Jiri Lunacek 2018-01-29 12:28:24 UTC
Created attachment 1387736 [details]
brick volume file

Description of problem:
RDMA transport bricks crash

Version-Release number of selected component (if applicable):
glusterfs-client-xlators-3.12.4-1.el7.x86_64
glusterfs-api-3.12.4-1.el7.x86_64
glusterfs-rdma-3.12.4-1.el7.x86_64
glusterfs-libs-3.12.4-1.el7.x86_64
glusterfs-cli-3.12.4-1.el7.x86_64
glusterfs-server-3.12.4-1.el7.x86_64
glusterfs-fuse-3.12.4-1.el7.x86_64
glusterfs-3.12.4-1.el7.x86_64

How reproducible:
We are experiencing crashes of glusterfsd on bricks of a replicated RDMA volume.
It may be interesting, that the two replicated bricks failed with the same error several minutes one after another and are running after restart.
This happens approximately every 5 or 6 days.

/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x467f) is always on top of the stack.

Backtrace:
pending frames:
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(29)
frame : type(0) op(37)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-29 11:58:11
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f3a13cc3500]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f3a13ccd434]
/lib64/libc.so.6(+0x35270)[0x7f3a1232c270]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x467f)[0x7f39fe8e867f]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x48af)[0x7f39fe8e88af]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_do_gf_rdma_write+0x7d)[0x7f39fe8ec84d]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_send_reply_type_msg+0x184)[0x7f39fe8ecee4]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_ioq_churn_reply+0x128)[0x7f39fe8ed418]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_ioq_churn_entry+0x85)[0x7f39fe8ed665]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0xa2e0)[0x7f39fe8ee2e0]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(gf_rdma_submit_reply+0x96)[0x7f39fe8ee9b6]
/lib64/libgfrpc.so.0(rpcsvc_transport_submit+0x82)[0x7f3a13a832b2]
/lib64/libgfrpc.so.0(rpcsvc_submit_generic+0x180)[0x7f3a13a84f00]
/usr/lib64/glusterfs/3.12.4/xlator/protocol/server.so(+0x91cc)[0x7f39fef0b1cc]
/usr/lib64/glusterfs/3.12.4/xlator/protocol/server.so(+0x206f4)[0x7f39fef226f4]
/usr/lib64/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x153a3)[0x7f39ff37a3a3]
/lib64/libglusterfs.so.0(default_readv_cbk+0x17b)[0x7f3a13d43deb]
/usr/lib64/glusterfs/3.12.4/xlator/features/upcall.so(+0x6cd1)[0x7f3a043e2cd1]
/usr/lib64/glusterfs/3.12.4/xlator/features/leases.so(+0x2a6b)[0x7f3a045f9a6b]
/usr/lib64/glusterfs/3.12.4/xlator/features/locks.so(+0x1025e)[0x7f3a04c3325e]
/usr/lib64/glusterfs/3.12.4/xlator/features/changetimerecorder.so(+0xdc14)[0x7f3a05b8bc14]
/usr/lib64/glusterfs/3.12.4/xlator/storage/posix.so(+0xf3d4)[0x7f3a065d93d4]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/changetimerecorder.so(+0x8daf)[0x7f3a05b86daf]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/bitrot-stub.so(+0xd9b1)[0x7f3a050749b1]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/locks.so(+0x190ec)[0x7f3a04c3c0ec]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/leases.so(+0x6531)[0x7f3a045fd531]
/usr/lib64/glusterfs/3.12.4/xlator/features/upcall.so(+0x101ba)[0x7f3a043ec1ba]
/lib64/libglusterfs.so.0(default_readv_resume+0x1f3)[0x7f3a13d5a9c3]
/lib64/libglusterfs.so.0(call_resume_wind+0x2da)[0x7f3a13ce78ca]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f3a13ce7df5]
/usr/lib64/glusterfs/3.12.4/xlator/performance/io-threads.so(+0x4de4)[0x7f3a041d5de4]
/lib64/libpthread.so.0(+0x7e25)[0x7f3a12b22e25]
/lib64/libc.so.6(clone+0x6d)[0x7f3a123ef34d]

Comment 1 Jiri Lunacek 2018-01-29 14:02:46 UTC
Related? https://bugzilla.redhat.com/show_bug.cgi?id=1525850

Comment 2 Shyamsundar 2018-10-23 14:54:05 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 3 Amar Tumballi 2019-06-17 11:02:55 UTC
Jiri,

Apologies for the delay.

Thanks for the report, but we are not able to look into the RDMA section
actively, and are seriously considering from dropping it from active support.

More on this @
https://lists.gluster.org/pipermail/gluster-devel/2018-July/054990.html


> ‘RDMA’ transport support:
> 
> Gluster started supporting RDMA while ib-verbs was still new, and very high-end infra around that time were using Infiniband. Engineers did work
> with Mellanox, and got the technology into GlusterFS for better data migration, data copy. While current day kernels support very good speed with
> IPoIB module itself, and there are no more bandwidth for experts in these area to maintain the feature, we recommend migrating over to TCP (IP
> based) network for your volume.
> 
> If you are successfully using RDMA transport, do get in touch with us to prioritize the migration plan for your volume. Plan is to work on this
> after the release, so by version 6.0, we will have a cleaner transport code, which just needs to support one type.