Bug 1066128 - glusterfsd crashes with SEGV during catalyst run
Summary: glusterfsd crashes with SEGV during catalyst run
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-17 19:21 UTC by Nick Dokos
Modified: 2016-01-25 18:21 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-01-19 12:21:33 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
/var/log/glusterfs/bricks/brick-vol0.log excerpt (6.03 KB, text/plain)
2014-02-17 19:21 UTC, Nick Dokos
no flags Details
Backtrace from core dump (3.04 KB, text/plain)
2014-02-17 19:24 UTC, Nick Dokos
no flags Details
First error that gluster-swift encounters (followed by many more like this) (1.23 KB, text/plain)
2014-02-17 19:25 UTC, Nick Dokos
no flags Details

Description Nick Dokos 2014-02-17 19:21:36 UTC
Created attachment 864250 [details]
/var/log/glusterfs/bricks/brick-vol0.log excerpt

Description of problem:
I have been running the catalyst test from one client node with 64 threads to one server node. The load consists of 10 filesets with 10000 files each: catalyst first does bunch of PUTs against the gluster-swift servers and then
does a bunch of GETs - this is repeated three times altogether. I was able to
run this load once against a set of vanilla GlusterFS 3.5 bits that Kaleb Keithley provided, comparing it against a set of mods that Kaleb wanted tested.
For the purposes of this test, I had eliminated all translators except the posix translator.

I was trying to repeat the run when I ran into the problem: running against the vanilla bits, the test runs without error for a while, but eventually every operation returns an error. It turns out the glusterfsd process is gone, so the test cannot cross the mount point. "A while" means thousands of files; in fact, it has happened that the first two repetitions of all 100K files have succeeded and glusterfsd dies sometime during the third one: AFAICT, the time of glusterfsd's demise is not predictable.

What *is* predictable is the manner. I ran it a few times and the backtrace from the crash always looks like this:


Core was generated by `/usr/sbin/glusterfsd -s gprfs029-b-10ge --volfile-id vol0.gprfs029-b-10ge.brick'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000375e209220 in pthread_mutex_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libaio-0.3.107-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.1e-16.el6_5.4.x86_64 python-libs-2.6.6-36.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000375e209220 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00000038d9c31f29 in inode_link (inode=0x10a2edc, parent=0x0, name=0x0, iatt=0x7fff5e295460) at inode.c:890
#2  0x00007f2c5739ba0f in resolve_gfid_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=0x106f2f0, 
    op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x10a2edc, buf=0x7fff5e295460, xdata=0x0, 
    postparent=0x7fff5e2953f0) at server-resolve.c:128
#3  0x00007f2c577dcec6 in posix_lookup (frame=0x7f2c5ae0602c, this=<value optimized out>, loc=0x12779e8, 
    xdata=<value optimized out>) at posix.c:189
#4  0x00007f2c5739b1eb in resolve_gfid (frame=0x7f2c5ac3adc0) at server-resolve.c:182
#5  0x00007f2c5739b600 in server_resolve_entry (frame=0x7f2c5ac3adc0) at server-resolve.c:318
#6  0x00007f2c5739b478 in server_resolve (frame=0x7f2c5ac3adc0) at server-resolve.c:510
#7  0x00007f2c5739b59e in server_resolve_all (frame=<value optimized out>) at server-resolve.c:572
#8  0x00007f2c5739b5d5 in server_resolve_entry (frame=0x7f2c5ac3adc0) at server-resolve.c:325
#9  0x00007f2c5739b478 in server_resolve (frame=0x7f2c5ac3adc0) at server-resolve.c:510
#10 0x00007f2c5739b57e in server_resolve_all (frame=<value optimized out>) at server-resolve.c:565
#11 0x00007f2c5739b634 in resolve_and_resume (frame=<value optimized out>, fn=<value optimized out>) at server-resolve.c:595
#12 0x00007f2c573a9f23 in server3_3_rename (req=0x7f2c56c8c02c) at server-rpc-fops.c:5813
#13 0x00000038da409615 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x1069480)
    at rpcsvc.c:631
#14 0x00000038da409853 in rpcsvc_notify (trans=0x1098980, mydata=<value optimized out>, event=<value optimized out>, 
    data=0x1069480) at rpcsvc.c:725
#15 0x00000038da40b008 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, 
    data=<value optimized out>) at rpc-transport.c:512
#16 0x00007f2c58611fb5 in socket_event_poll_in (this=0x1098980) at socket.c:2119
---Type <return> to continue, or q <return> to quit---
#17 0x00007f2c586139fd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1098980, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2232
#18 0x00000038d9c672f7 in event_dispatch_epoll_handler (event_pool=0x104a710) at event-epoll.c:384
#19 event_dispatch_epoll (event_pool=0x104a710) at event-epoll.c:445
#20 0x00000000004075e4 in main (argc=19, argv=0x7fff5e296ba8) at glusterfsd.c:1983
(gdb) 



Version-Release number of selected component (if applicable):
GlusterFS 3.5git

How reproducible:
Every time except the first.

Steps to Reproduce:
1. I have an automated setup to run catalyst in this configuration. The process is too long to describe here.
2.
3.

Actual results:
glusterfs gets a SEGV and crashes. After that, any attempt to cross the mount point gets an error, e.g. cd to the directory.

Expected results:
It should not crash.

Additional info:

Comment 1 Nick Dokos 2014-02-17 19:24:56 UTC
Created attachment 864251 [details]
Backtrace from core dump

Comment 2 Nick Dokos 2014-02-17 19:25:57 UTC
Created attachment 864252 [details]
First error that gluster-swift encounters (followed by many more like this)

Comment 3 Ben England 2014-02-21 22:22:59 UTC
added myself to cclist

Comment 5 Niels de Vos 2015-12-22 12:43:58 UTC
Hi Nick,

sorry for the late reply, we're trying to catch up on old bugs. Could you let us know if this problem still occurs on current releases? We have fixed quite some bugs that look related to this problem (also backported to 3.5.x).

If you still have your automated setup, could you run the test again and let us know which exact version (or git commit) you use?

Thanks,
Niels

Comment 6 Soumya Koduri 2016-01-19 12:21:33 UTC
As we haven't yet received any response, closing the bug. Please re-open if the issue exists on any of the latest GlusterFS supported version.

Comment 7 Nick Dokos 2016-01-25 18:21:16 UTC
I don't know whether the problem still exists and at this point, I don't have the
time or the hardware to recreate the scenario. So closing it for now and reopening if necessary seems like the right course of action.


Note You need to log in before you can comment on or make changes to this bug.