Created attachment 864250 [details] /var/log/glusterfs/bricks/brick-vol0.log excerpt Description of problem: I have been running the catalyst test from one client node with 64 threads to one server node. The load consists of 10 filesets with 10000 files each: catalyst first does bunch of PUTs against the gluster-swift servers and then does a bunch of GETs - this is repeated three times altogether. I was able to run this load once against a set of vanilla GlusterFS 3.5 bits that Kaleb Keithley provided, comparing it against a set of mods that Kaleb wanted tested. For the purposes of this test, I had eliminated all translators except the posix translator. I was trying to repeat the run when I ran into the problem: running against the vanilla bits, the test runs without error for a while, but eventually every operation returns an error. It turns out the glusterfsd process is gone, so the test cannot cross the mount point. "A while" means thousands of files; in fact, it has happened that the first two repetitions of all 100K files have succeeded and glusterfsd dies sometime during the third one: AFAICT, the time of glusterfsd's demise is not predictable. What *is* predictable is the manner. I ran it a few times and the backtrace from the crash always looks like this: Core was generated by `/usr/sbin/glusterfsd -s gprfs029-b-10ge --volfile-id vol0.gprfs029-b-10ge.brick'. Program terminated with signal 11, Segmentation fault. #0 0x000000375e209220 in pthread_mutex_lock () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libaio-0.3.107-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.1e-16.el6_5.4.x86_64 python-libs-2.6.6-36.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x000000375e209220 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00000038d9c31f29 in inode_link (inode=0x10a2edc, parent=0x0, name=0x0, iatt=0x7fff5e295460) at inode.c:890 #2 0x00007f2c5739ba0f in resolve_gfid_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=0x106f2f0, op_ret=<value optimized out>, op_errno=<value optimized out>, inode=0x10a2edc, buf=0x7fff5e295460, xdata=0x0, postparent=0x7fff5e2953f0) at server-resolve.c:128 #3 0x00007f2c577dcec6 in posix_lookup (frame=0x7f2c5ae0602c, this=<value optimized out>, loc=0x12779e8, xdata=<value optimized out>) at posix.c:189 #4 0x00007f2c5739b1eb in resolve_gfid (frame=0x7f2c5ac3adc0) at server-resolve.c:182 #5 0x00007f2c5739b600 in server_resolve_entry (frame=0x7f2c5ac3adc0) at server-resolve.c:318 #6 0x00007f2c5739b478 in server_resolve (frame=0x7f2c5ac3adc0) at server-resolve.c:510 #7 0x00007f2c5739b59e in server_resolve_all (frame=<value optimized out>) at server-resolve.c:572 #8 0x00007f2c5739b5d5 in server_resolve_entry (frame=0x7f2c5ac3adc0) at server-resolve.c:325 #9 0x00007f2c5739b478 in server_resolve (frame=0x7f2c5ac3adc0) at server-resolve.c:510 #10 0x00007f2c5739b57e in server_resolve_all (frame=<value optimized out>) at server-resolve.c:565 #11 0x00007f2c5739b634 in resolve_and_resume (frame=<value optimized out>, fn=<value optimized out>) at server-resolve.c:595 #12 0x00007f2c573a9f23 in server3_3_rename (req=0x7f2c56c8c02c) at server-rpc-fops.c:5813 #13 0x00000038da409615 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x1069480) at rpcsvc.c:631 #14 0x00000038da409853 in rpcsvc_notify (trans=0x1098980, mydata=<value optimized out>, event=<value optimized out>, data=0x1069480) at rpcsvc.c:725 #15 0x00000038da40b008 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512 #16 0x00007f2c58611fb5 in socket_event_poll_in (this=0x1098980) at socket.c:2119 ---Type <return> to continue, or q <return> to quit--- #17 0x00007f2c586139fd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1098980, poll_in=1, poll_out=0, poll_err=0) at socket.c:2232 #18 0x00000038d9c672f7 in event_dispatch_epoll_handler (event_pool=0x104a710) at event-epoll.c:384 #19 event_dispatch_epoll (event_pool=0x104a710) at event-epoll.c:445 #20 0x00000000004075e4 in main (argc=19, argv=0x7fff5e296ba8) at glusterfsd.c:1983 (gdb) Version-Release number of selected component (if applicable): GlusterFS 3.5git How reproducible: Every time except the first. Steps to Reproduce: 1. I have an automated setup to run catalyst in this configuration. The process is too long to describe here. 2. 3. Actual results: glusterfs gets a SEGV and crashes. After that, any attempt to cross the mount point gets an error, e.g. cd to the directory. Expected results: It should not crash. Additional info:
Created attachment 864251 [details] Backtrace from core dump
Created attachment 864252 [details] First error that gluster-swift encounters (followed by many more like this)
added myself to cclist
Hi Nick, sorry for the late reply, we're trying to catch up on old bugs. Could you let us know if this problem still occurs on current releases? We have fixed quite some bugs that look related to this problem (also backported to 3.5.x). If you still have your automated setup, could you run the test again and let us know which exact version (or git commit) you use? Thanks, Niels
As we haven't yet received any response, closing the bug. Please re-open if the issue exists on any of the latest GlusterFS supported version.
I don't know whether the problem still exists and at this point, I don't have the time or the hardware to recreate the scenario. So closing it for now and reopening if necessary seems like the right course of action.