Bug 1539680 - RDMA transport bricks crash
Summary: RDMA transport bricks crash
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: GlusterFS
Classification: Community
Component: rdma
Version: mainline
Hardware: x86_64
OS: Linux
low
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-29 12:28 UTC by Jiri Lunacek
Modified: 2019-06-17 11:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-17 11:02:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
brick volume file (5.07 KB, text/plain)
2018-01-29 12:28 UTC, Jiri Lunacek
no flags Details

Description Jiri Lunacek 2018-01-29 12:28:24 UTC
Created attachment 1387736 [details]
brick volume file

Description of problem:
RDMA transport bricks crash

Version-Release number of selected component (if applicable):
glusterfs-client-xlators-3.12.4-1.el7.x86_64
glusterfs-api-3.12.4-1.el7.x86_64
glusterfs-rdma-3.12.4-1.el7.x86_64
glusterfs-libs-3.12.4-1.el7.x86_64
glusterfs-cli-3.12.4-1.el7.x86_64
glusterfs-server-3.12.4-1.el7.x86_64
glusterfs-fuse-3.12.4-1.el7.x86_64
glusterfs-3.12.4-1.el7.x86_64

How reproducible:
We are experiencing crashes of glusterfsd on bricks of a replicated RDMA volume.
It may be interesting, that the two replicated bricks failed with the same error several minutes one after another and are running after restart.
This happens approximately every 5 or 6 days.

/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x467f) is always on top of the stack.

Backtrace:
pending frames:
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(29)
frame : type(0) op(37)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-29 11:58:11
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f3a13cc3500]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f3a13ccd434]
/lib64/libc.so.6(+0x35270)[0x7f3a1232c270]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x467f)[0x7f39fe8e867f]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0x48af)[0x7f39fe8e88af]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_do_gf_rdma_write+0x7d)[0x7f39fe8ec84d]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_send_reply_type_msg+0x184)[0x7f39fe8ecee4]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_ioq_churn_reply+0x128)[0x7f39fe8ed418]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(__gf_rdma_ioq_churn_entry+0x85)[0x7f39fe8ed665]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(+0xa2e0)[0x7f39fe8ee2e0]
/usr/lib64/glusterfs/3.12.4/rpc-transport/rdma.so(gf_rdma_submit_reply+0x96)[0x7f39fe8ee9b6]
/lib64/libgfrpc.so.0(rpcsvc_transport_submit+0x82)[0x7f3a13a832b2]
/lib64/libgfrpc.so.0(rpcsvc_submit_generic+0x180)[0x7f3a13a84f00]
/usr/lib64/glusterfs/3.12.4/xlator/protocol/server.so(+0x91cc)[0x7f39fef0b1cc]
/usr/lib64/glusterfs/3.12.4/xlator/protocol/server.so(+0x206f4)[0x7f39fef226f4]
/usr/lib64/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x153a3)[0x7f39ff37a3a3]
/lib64/libglusterfs.so.0(default_readv_cbk+0x17b)[0x7f3a13d43deb]
/usr/lib64/glusterfs/3.12.4/xlator/features/upcall.so(+0x6cd1)[0x7f3a043e2cd1]
/usr/lib64/glusterfs/3.12.4/xlator/features/leases.so(+0x2a6b)[0x7f3a045f9a6b]
/usr/lib64/glusterfs/3.12.4/xlator/features/locks.so(+0x1025e)[0x7f3a04c3325e]
/usr/lib64/glusterfs/3.12.4/xlator/features/changetimerecorder.so(+0xdc14)[0x7f3a05b8bc14]
/usr/lib64/glusterfs/3.12.4/xlator/storage/posix.so(+0xf3d4)[0x7f3a065d93d4]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/changetimerecorder.so(+0x8daf)[0x7f3a05b86daf]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/bitrot-stub.so(+0xd9b1)[0x7f3a050749b1]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/locks.so(+0x190ec)[0x7f3a04c3c0ec]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/lib64/libglusterfs.so.0(default_readv+0xe1)[0x7f3a13d40101]
/usr/lib64/glusterfs/3.12.4/xlator/features/leases.so(+0x6531)[0x7f3a045fd531]
/usr/lib64/glusterfs/3.12.4/xlator/features/upcall.so(+0x101ba)[0x7f3a043ec1ba]
/lib64/libglusterfs.so.0(default_readv_resume+0x1f3)[0x7f3a13d5a9c3]
/lib64/libglusterfs.so.0(call_resume_wind+0x2da)[0x7f3a13ce78ca]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f3a13ce7df5]
/usr/lib64/glusterfs/3.12.4/xlator/performance/io-threads.so(+0x4de4)[0x7f3a041d5de4]
/lib64/libpthread.so.0(+0x7e25)[0x7f3a12b22e25]
/lib64/libc.so.6(clone+0x6d)[0x7f3a123ef34d]

Comment 1 Jiri Lunacek 2018-01-29 14:02:46 UTC
Related? https://bugzilla.redhat.com/show_bug.cgi?id=1525850

Comment 2 Shyamsundar 2018-10-23 14:54:05 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 3 Amar Tumballi 2019-06-17 11:02:55 UTC
Jiri,

Apologies for the delay.

Thanks for the report, but we are not able to look into the RDMA section
actively, and are seriously considering from dropping it from active support.

More on this @
https://lists.gluster.org/pipermail/gluster-devel/2018-July/054990.html


> ‘RDMA’ transport support:
> 
> Gluster started supporting RDMA while ib-verbs was still new, and very high-end infra around that time were using Infiniband. Engineers did work
> with Mellanox, and got the technology into GlusterFS for better data migration, data copy. While current day kernels support very good speed with
> IPoIB module itself, and there are no more bandwidth for experts in these area to maintain the feature, we recommend migrating over to TCP (IP
> based) network for your volume.
> 
> If you are successfully using RDMA transport, do get in touch with us to prioritize the migration plan for your volume. Plan is to work on this
> after the release, so by version 6.0, we will have a cleaner transport code, which just needs to support one type.


Note You need to log in before you can comment on or make changes to this bug.