Bug 1195415 - glusterfsd core dumps when cleanup and socket disconnect routines race
Summary: glusterfsd core dumps when cleanup and socket disconnect routines race
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: glusterfs-3.7.0 1219967
TreeView+ depends on / blocked
 
Reported: 2015-02-23 19:41 UTC by Shyamsundar
Modified: 2015-05-14 17:35 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.7.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1219967 (view as bug list)
Environment:
Last Closed: 2015-05-14 17:26:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Shyamsundar 2015-02-23 19:41:10 UTC
Description of problem:

Best described in the mail chain which is copied below:
http://www.gluster.org/pipermail/gluster-devel/2015-February/043958.html

On 02/23/2015 01:58 PM, Justin Clift wrote:
> Short version:
>
> 75% of the Jenkins regression tests we run in Rackspace (on
> glusterfs master branch) fail from spurious errors.
>
> This is why we're having capacity problems with our Jenkins
> slave nodes... we need to run our tests 4x for each CR just
> to get a potentially valid result. :/
>
>
> Longer version:
>
> Ran some regression test runs (20) on git master head over the
> weekend, to better understand our spurious failure situation.
>
> 75% of the regression runs failed in various ways.  Oops.
>
> The failures:
>
>    * 5 x tests/bugs/fuse/bug-1126048.t
>          Failed test:  10
>
>    * 3 x tests/bugs/quota/bug-1087198.t
>          Failed test:  18
>
>    * 3 x tests/performance/open-behind.t
>          Failed test:  17
>
>    * 2 x tests/bugs/geo-replication/bug-877293.t
>          Failed test:  11
>
>    * 2 x tests/basic/afr/split-brain-heal-info.t
>          Failed tests:  20-41
>
>    * 1 x tests/bugs/distribute/bug-1117851.t
>          Failed test:  15
>
>    * 1 x tests/basic/uss.t
>          Failed test:  26
>
>    * 1 x hung on tests/bugs/posix/bug-1113960.t
>
>          No idea which test it was on.  Left it running
>          several hours, then killed the VM along with the rest.
>
> 4 of the regression runs also created coredumps.  Uploaded the
> archived_builds and logs here:
>
>      http://mirror.salasaga.org/gluster/
>
> (are those useful?)

Yes, these are useful as they contain a very similar crash in each of the cores, so we could be looking at a single problem to fix here. Here is a short update on the core, at a broad level the cleanup_and_exit is racing with a list deletion in the following 2 threads.

Those interested can download and extract the tarballs from the link provided, (ex: http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2 )
and run, "gdb -ex 'set sysroot ./' -ex 'core-file ./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from the root of the extracted tarball to look at the details from the core dump.

Core was generated by `/build/install/sbin/glusterfsd -s bulkregression12.localdomain --volfile-id pat'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at /root/glusterfs/libglusterfs/src/list.h:88
88      /root/glusterfs/libglusterfs/src/list.h: No such file or directory.

1) list deletion generates the core:

(gdb) bt
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at /root/glusterfs/libglusterfs/src/list.h:88
#1  0x00007fd84a1352ae in pl_inodelk_client_cleanup (this=0x7fd84400b7e0, ctx=0x7fd834000b50) at /root/glusterfs/xlators/features/locks/src/inodelk.c:471
#2  0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0, client=0x7fd83c002fd0) at /root/glusterfs/xlators/features/locks/src/posix.c:2563
#3  0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0) at /root/glusterfs/libglusterfs/src/client_t.c:393
#4  0x00007fd849262296 in server_connection_cleanup (this=0x7fd844014350, client=0x7fd83c002fd0, flags=3) at /root/glusterfs/xlators/protocol/server/src/server-helpers.c:353
#5  0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70, xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/xlators/protocol/server/src/server.c:532
#6  0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70, trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741
#7  0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440, mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779
#8  0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440, event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543
#9  0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185
#10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5, data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386
#11 0x00007fd85bd55333 in event_dispatch_epoll_handler (event_pool=0x1d835d0, event=0x7fd84b5a9e70) at /root/glusterfs/libglusterfs/src/event-epoll.c:551
#12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790) at /root/glusterfs/libglusterfs/src/event-epoll.c:643
#13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6

2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit)

Thread 12 (LWP 28010):
#0  0x00007f8620a31f48 in _nss_files_parse_servent () from ./lib64/libnss_files.so.2
#1  0x00007f8620a326b0 in _nss_files_getservbyport_r () from ./lib64/libnss_files.so.2
#2  0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from ./lib64/libc.so.6
#3  0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6
#4  0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860 "bulkregression16.localdomain", port=24007, family=2, dnscache=0x1715748, addr_info=0x7f861b662930) at /root/glusterfs/libglusterfs/src/common-utils.c:240
#5  0x00007f86220594c3 in af_inet_client_get_remote_sockaddr (this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8) at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:238
#6  0x00007f8622059eba in socket_client_get_remote_sockaddr (this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8, sa_family=0x7f861b662aa6) at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:496
#7  0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at /root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914
#8  0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0) at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426
#9  0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620 <clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0, progpayloadcount=0,
    iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at /root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554
#10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60, frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0 <xdr_pmap_signout_req@plt>)
    at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445
#11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258
#12 0x0000000000407903 in cleanup_and_exit (signum=15) at /root/glusterfs/glusterfsd/src/glusterfsd.c:1201
#13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at /root/glusterfs/glusterfsd/src/glusterfsd.c:1761
#14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0
#15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6

>
> We should probably concentrate on fixing the most common
> spurious failures soon, and look into the less common ones
> later on.
>
> I'll do some runs on release-3.6 soon too, as I suspect that'll
> be useful.
>
> + Justin

This is a regression test on master branch as of the day specified.

Comment 1 Anand Avati 2015-04-08 21:21:17 UTC
REVIEW: http://review.gluster.org/10167 (tests: remove tests for clear-locks) posted (#1) for review on master by Jeff Darcy (jdarcy)

Comment 2 Anand Avati 2015-04-09 09:51:17 UTC
COMMIT: http://review.gluster.org/10167 committed in master by Vijay Bellur (vbellur) 
------
commit 0086a55bb7de1ef5dc7a24583f5fc2b560e835fd
Author: Jeff Darcy <jdarcy>
Date:   Wed Apr 8 17:17:13 2015 -0400

    tests: remove tests for clear-locks
    
    These are suspected of causing core dumps during regression tests,
    leading to spurious failures.  Per email conversation, since this
    isn't a supported feature anyway, the tests are being removed to
    facilitate testing of features we do support.
    
    Change-Id: I7fd5c76d26dd6c3ffa91f89fc10469ae3a63afdf
    BUG: 1195415
    Signed-off-by: Jeff Darcy <jdarcy>
    Reviewed-on: http://review.gluster.org/10167
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Justin Clift 2015-04-09 11:18:48 UTC
Pranith pointed out this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1184417.

Comment 4 Niels de Vos 2015-05-14 17:26:29 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:28:23 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 6 Niels de Vos 2015-05-14 17:35:16 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.