Description of problem: GlusterFS version 3.13.3 crashes with segmentation fault in xdr_gf_dump_req in Gentoo Linux (latest version on Gentoo). But I think the bug is not in xdr_gf_dump_req, it is called with wrong arguments. A problem is that glusterfs version 3.13.3 is the only version of glusterfs currently available in gentoo, as the old ones (3.6.5) are removed from the repository due to being vulnerable. This bug isn't in GlusterFS version 3.6.5, which works. Version-Release number of selected component (if applicable): 3.13.3 on gentoo linux How reproducible: install glusterfs version 3.13.3 on gentoo linux Steps to Reproduce: 1. emerge =sys-cluster/glusterfs-3.13.3 2. /etc/init.d/glusterd restart Actual results: glusterd is killed with SIGSEGV Expected results: glusterd starts Additional info: gentoo package info page: https://packages.gentoo.org/packages/sys-cluster/glusterfs initital post: https://twitter.com/EZscheile/status/934595665283428354 Archive of coredump, strace and gdb backtrace: http://ezscheile.bplaced.net/glusterd-segv-pack.tar.gz Backtrace: #0 __GI_xdr_uint64_t (xdrs=0x7fda46ac5b20, uip=0x7fda46ac5c60) at xdr_intXX_t.c:71 #1 0x00007fda504e6a29 in xdr_gf_dump_req (xdrs=<optimized out>, objp=<optimized out>) at rpc-common-xdr.c:167 #2 0x00007fda5070fa83 in xdr_sizeof () from /lib64/libtirpc.so.3 #3 0x00007fda4a9057aa in glusterd_submit_request (rpc=0x1495450, req=req@entry=0x7fda46ac5c60, frame=frame@entry=0x7fda38001ec0, prog=prog@entry=0x7fda4ac4e2c0 <glusterd_dump_prog>, procnum=procnum@entry=1, iobref=iobref@entry=0x0, this=0x142a680, cbkfn=0x7fda4a942040 <glusterd_peer_dump_version_cbk>, xdrproc=0x7fda504e6a20 <xdr_gf_dump_req>) at glusterd-utils.c:428 #4 0x00007fda4a9473ca in glusterd_peer_dump_version (this=this@entry=0x142a680, rpc=rpc@entry=0x1495450, peerctx=peerctx@entry=0x1494400) at glusterd-handshake.c:2319 #5 0x00007fda4a8ed516 in __glusterd_peer_rpc_notify (rpc=rpc@entry=0x1495450, mydata=mydata@entry=0x1494400, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0) at glusterd-handler.c:6295 #6 0x00007fda4a8e404d in glusterd_big_locked_notify (rpc=0x1495450, mydata=0x1494400, event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7fda4a8ed200 <__glusterd_peer_rpc_notify>) at glusterd-handler.c:70 #7 0x00007fda50933f7c in rpc_clnt_notify (trans=<optimized out>, mydata=0x1495480, event=<optimized out>, data=0x1495680) at rpc-clnt.c:1004 #8 0x00007fda50930143 in rpc_transport_notify (this=this@entry=0x1495680, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x1495680) at rpc-transport.c:538 #9 0x00007fda47954f8f in socket_connect_finish (this=this@entry=0x1495680) at socket.c:2404 #10 0x00007fda47959511 in socket_event_handler (fd=fd@entry=13, idx=idx@entry=4, gen=gen@entry=1, data=data@entry=0x1495680, poll_in=0, poll_out=4, poll_err=0) at socket.c:2456 #11 0x00007fda50bc23da in event_dispatch_epoll_handler (event=0x7fda46ac5e7c, event_pool=0x1417770) at event-epoll.c:583 #12 event_dispatch_epoll_worker (data=0x1496e60) at event-epoll.c:659 #13 0x00007fda500b7839 in start_thread (arg=0x7fda46ac6700) at pthread_create.c:456 #14 0x00007fda4fdf5adf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 XDRS x_ops: *(xdrs->x_ops) = {x_getlong = 0x7fda5070f900, x_putlong = 0x7fda5070f880, x_getbytes = 0x7fda5070f900, x_putbytes = 0x7fda5070f8a0, x_getpostn = 0x7fda5070f8c0, x_setpostn = 0x7fda5070f8e0, x_inline = 0x7fda5070f960, x_destroy = 0x7fda5070f920, x_getint32 = 0x0, x_putint32 = 0x165296f147c52f00}
possible related to: https://bugs.gentoo.org/635172
Created attachment 1360977 [details] output of `emerge --info`
compiled sources (with applied gentoo patches): http://ezscheile.bplaced.net/glusterd-segv-work.tar.gz
libtirpc version: 1.0.2-r1 (gentoo), there might be a bug in xdr_sizeof, which setups the x_ops structure, but doesn't set x_ops->x_getint32.
relevant code snippets parts from glibc xdr_intXX_t.c: ---- __GI_xdr_uint64_t /* XDR 64bit integers */ bool_t xdr_int64_t (XDR *xdrs, int64_t *ip) { int32_t t1, t2; switch (xdrs->x_op) { case XDR_ENCODE: t1 = (int32_t) ((*ip) >> 32); t2 = (int32_t) (*ip); return (XDR_PUTINT32(xdrs, &t1) && XDR_PUTINT32(xdrs, &t2)); case XDR_DECODE: /*** SEGFAULT HERE ***/ if (!XDR_GETINT32(xdrs, &t1) || !XDR_GETINT32(xdrs, &t2)) return FALSE; *ip = ((int64_t) t1) << 32; *ip |= (uint32_t) t2; /* Avoid sign extension. */ return TRUE; case XDR_FREE: return TRUE; default: return FALSE; } } libc_hidden_nolink_sunrpc (xdr_int64_t, GLIBC_2_1_1) bool_t xdr_quad_t (XDR *xdrs, quad_t *ip) { return xdr_int64_t (xdrs, (int64_t *) ip); } libc_hidden_nolink_sunrpc (xdr_quad_t, GLIBC_2_3_4) ---- parts from libtirpc-1.0.2/src/ glusterfs-3.12.3/contrib/sunrpc/ xdr_sizeof.c ---- xdr_sizeof unsigned long xdr_sizeof (xdrproc_t func, void *data) { XDR x; struct xdr_ops ops; bool_t stat; #ifdef GF_DARWIN_HOST_OS typedef bool_t (*dummyfunc1) (XDR *, int *); #else typedef bool_t (*dummyfunc1) (XDR *, long *); #endif typedef bool_t (*dummyfunc2) (XDR *, caddr_t, u_int); ops.x_putlong = x_putlong; ops.x_putbytes = x_putbytes; ops.x_inline = x_inline; ops.x_getpostn = x_getpostn; ops.x_setpostn = x_setpostn; ops.x_destroy = x_destroy; /* the other harmless ones */ ops.x_getlong = (dummyfunc1) harmless; ops.x_getbytes = (dummyfunc2) harmless; /*** ops.x_getint32 NOT SET ***/ x.x_op = XDR_ENCODE; x.x_ops = &ops; x.x_handy = 0; x.x_private = (caddr_t) NULL; x.x_base = (caddr_t) 0; stat = func (&x, data, 0); if (x.x_private) free (x.x_private); return (stat == TRUE ? (unsigned) x.x_handy : 0); } ----
NOTE: glibc source taken from https://code.woboq.org/userspace/glibc/sunrpc/xdr_intXX_t.c.html
glusterd compiled without libtirpc works.
This bug only happens when glusterfs-3.12.3 is compiled against libtirpc-1.0.2-r1. It works with libtirpc-1.0.1-r1. So this is a bug in libtirpc.
see https://bugzilla.redhat.com/show_bug.cgi?id=1521004#c1
I contributed the patch for explicitly using libtirpc to master and backported it to 3.12.3 for Gentoo. I had been under the impression that libtirpc is just a drop-in replacement for glibc's RPC but after investigating this report, I have found that it segfaults unless you give --with-ipv6-default, which is new in 3.13.0. More specifically, the crash is avoided if I change addr_family in rpc_transport_inet_options_build (rpc/rpc-lib/src/rpc-transport.c) from inet to inet6. I don't understand why the former causes a crash. Is it not possible to use libtirpc without IPv6? Our libtirpc package allows you to build it with IPv6 support disabled, though this doesn't seem to make any difference to the crash. Do I also have to change the other instances of inet to inet6 or is the new --with-libtirpc flag effectively redundant because all the --with-ipv6-default code is actually required? Sorry for being slightly clueless here. I have used Gluster occasionally but I'm not the official Gentoo package maintainer, just a dev who thought he'd give the package some attention. I'm not yet familiar with IPv6 either. I am concerned that the flag I've added to master is effectively only good for causing segfaults so I'd like to resolve this before it ends up in a release. CC'ing Kevin Vigor because he added the --with-ipv6-default flag.
backlink to a new backtrace, happens with applied patch: https://bugs.gentoo.org/639838#c16 The bug is probably in the part which uses RPC / XDR to communicate with peers and depends on libtirpc, the availablity of peers and IPv4/IPv6.
Following Erik's testing, we have determined that changing rpc_transport_inet_options_build alone is not sufficient. Still looking for guidance here.
REVIEW: https://review.gluster.org/19334 (build: Fix redefinitions when using libtirpc without IPv6 by default) posted (#1) for review on master by James Le Cuirot
I initially thought the above commit would fix this issue. It doesn't but it is still related.
This is actually fixed by https://review.gluster.org/#/c/19330.
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.