Description of problem: Best described in the mail chain which is copied below: http://www.gluster.org/pipermail/gluster-devel/2015-February/043958.html On 02/23/2015 01:58 PM, Justin Clift wrote: > Short version: > > 75% of the Jenkins regression tests we run in Rackspace (on > glusterfs master branch) fail from spurious errors. > > This is why we're having capacity problems with our Jenkins > slave nodes... we need to run our tests 4x for each CR just > to get a potentially valid result. :/ > > > Longer version: > > Ran some regression test runs (20) on git master head over the > weekend, to better understand our spurious failure situation. > > 75% of the regression runs failed in various ways. Oops. > > The failures: > > * 5 x tests/bugs/fuse/bug-1126048.t > Failed test: 10 > > * 3 x tests/bugs/quota/bug-1087198.t > Failed test: 18 > > * 3 x tests/performance/open-behind.t > Failed test: 17 > > * 2 x tests/bugs/geo-replication/bug-877293.t > Failed test: 11 > > * 2 x tests/basic/afr/split-brain-heal-info.t > Failed tests: 20-41 > > * 1 x tests/bugs/distribute/bug-1117851.t > Failed test: 15 > > * 1 x tests/basic/uss.t > Failed test: 26 > > * 1 x hung on tests/bugs/posix/bug-1113960.t > > No idea which test it was on. Left it running > several hours, then killed the VM along with the rest. > > 4 of the regression runs also created coredumps. Uploaded the > archived_builds and logs here: > > http://mirror.salasaga.org/gluster/ > > (are those useful?) Yes, these are useful as they contain a very similar crash in each of the cores, so we could be looking at a single problem to fix here. Here is a short update on the core, at a broad level the cleanup_and_exit is racing with a list deletion in the following 2 threads. Those interested can download and extract the tarballs from the link provided, (ex: http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2 ) and run, "gdb -ex 'set sysroot ./' -ex 'core-file ./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from the root of the extracted tarball to look at the details from the core dump. Core was generated by `/build/install/sbin/glusterfsd -s bulkregression12.localdomain --volfile-id pat'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at /root/glusterfs/libglusterfs/src/list.h:88 88 /root/glusterfs/libglusterfs/src/list.h: No such file or directory. 1) list deletion generates the core: (gdb) bt #0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at /root/glusterfs/libglusterfs/src/list.h:88 #1 0x00007fd84a1352ae in pl_inodelk_client_cleanup (this=0x7fd84400b7e0, ctx=0x7fd834000b50) at /root/glusterfs/xlators/features/locks/src/inodelk.c:471 #2 0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0, client=0x7fd83c002fd0) at /root/glusterfs/xlators/features/locks/src/posix.c:2563 #3 0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0) at /root/glusterfs/libglusterfs/src/client_t.c:393 #4 0x00007fd849262296 in server_connection_cleanup (this=0x7fd844014350, client=0x7fd83c002fd0, flags=3) at /root/glusterfs/xlators/protocol/server/src/server-helpers.c:353 #5 0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70, xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/xlators/protocol/server/src/server.c:532 #6 0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70, trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741 #7 0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440, mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779 #8 0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440, event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543 #9 0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185 #10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5, data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386 #11 0x00007fd85bd55333 in event_dispatch_epoll_handler (event_pool=0x1d835d0, event=0x7fd84b5a9e70) at /root/glusterfs/libglusterfs/src/event-epoll.c:551 #12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790) at /root/glusterfs/libglusterfs/src/event-epoll.c:643 #13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0 #14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6 2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit) Thread 12 (LWP 28010): #0 0x00007f8620a31f48 in _nss_files_parse_servent () from ./lib64/libnss_files.so.2 #1 0x00007f8620a326b0 in _nss_files_getservbyport_r () from ./lib64/libnss_files.so.2 #2 0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from ./lib64/libc.so.6 #3 0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6 #4 0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860 "bulkregression16.localdomain", port=24007, family=2, dnscache=0x1715748, addr_info=0x7f861b662930) at /root/glusterfs/libglusterfs/src/common-utils.c:240 #5 0x00007f86220594c3 in af_inet_client_get_remote_sockaddr (this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8) at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:238 #6 0x00007f8622059eba in socket_client_get_remote_sockaddr (this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8, sa_family=0x7f861b662aa6) at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:496 #7 0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914 #8 0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0) at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426 #9 0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620 <clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at /root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554 #10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60, frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0 <xdr_pmap_signout_req@plt>) at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445 #11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258 #12 0x0000000000407903 in cleanup_and_exit (signum=15) at /root/glusterfs/glusterfsd/src/glusterfsd.c:1201 #13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at /root/glusterfs/glusterfsd/src/glusterfsd.c:1761 #14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0 #15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6 > > We should probably concentrate on fixing the most common > spurious failures soon, and look into the less common ones > later on. > > I'll do some runs on release-3.6 soon too, as I suspect that'll > be useful. > > + Justin This is a regression test on master branch as of the day specified.
REVIEW: http://review.gluster.org/10167 (tests: remove tests for clear-locks) posted (#1) for review on master by Jeff Darcy (jdarcy)
COMMIT: http://review.gluster.org/10167 committed in master by Vijay Bellur (vbellur) ------ commit 0086a55bb7de1ef5dc7a24583f5fc2b560e835fd Author: Jeff Darcy <jdarcy> Date: Wed Apr 8 17:17:13 2015 -0400 tests: remove tests for clear-locks These are suspected of causing core dumps during regression tests, leading to spurious failures. Per email conversation, since this isn't a supported feature anyway, the tests are being removed to facilitate testing of features we do support. Change-Id: I7fd5c76d26dd6c3ffa91f89fc10469ae3a63afdf BUG: 1195415 Signed-off-by: Jeff Darcy <jdarcy> Reviewed-on: http://review.gluster.org/10167 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Kaleb KEITHLEY <kkeithle> Reviewed-by: Vijay Bellur <vbellur>
Pranith pointed out this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1184417.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user