Description of problem: Running a 4-node (replica 4) GlusterFS cluster, the glusterd servers all eventually crash within about 15-60 minutes of starting with the following backtrace. Restarting the service brings them back online. When the servers crash, they are under heavy load: approximately 5-10 concurrent git/hg clone operations are occurring on the cluster, via client nodes (running the same nightly version) that mount the volume with `mount -t glusterfs`. Backtrace of crash: ... [2014-05-30 03:56:52.745039] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/wu/nO /_tmp_pwM6m1gEIVosHmM7110146A=-696488624/hooks/pre-commit.sample [2014-05-30 03:56:52.745050] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745061] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/ks/gM /_tmp_Y8a2fAoft1r14R_fVc-UTMs=-026718122/config [2014-05-30 03:56:52.745088] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/Cw/o- /_tmp_40mBnMCmLUj_fZfpTnk3hXs=-746944916/refs [2014-05-30 03:56:52.745101] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745123] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/u6/X_ /oA1rqlC7cUsMD81_6MN5ogo=/objects/pack/pack-2527e3670442207f7b23bdea5fc2dfe7359bf7ba.pack [2014-05-30 03:56:52.745137] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745144] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/h7/Nz /_tmp_Lo7MT0i_Ul-79_3ZvbF5mMo=-606962255/hooks/applypatch-msg.sample [2014-05-30 03:56:52.745157] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745164] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/u6/X_ /oA1rqlC7cUsMD81_6MN5ogo=/objects/pack/pack-2527e3670442207f7b23bdea5fc2dfe7359bf7ba.idx [2014-05-30 03:56:52.745181] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745197] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745221] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745250] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745407] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:56:52.745444] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9 ) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f 4d56b9952d]))) 0-client_t: client is NULL [2014-05-30 03:57:21.704523] I [server.c:519:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom ip-10-220-14-24-2534-2014/05/30-00:04:37:888277-gv0-client-0-0-1 [2014-05-30 03:57:21.704583] I [client_t.c:294:gf_client_put] 0-gv0-server: Shutting down connection ip-10 -220-14-24-2534-2014/05/30-00:04:37:888277-gv0-client-0-0-1 pending frames: frame : type(0) op(12) frame : type(0) op(12) frame : type(0) op(12) frame : type(0) op(12) frame : type(0) op(12) frame : type(0) op(12) frame : type(0) op(29) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(30) frame : type(0) op(29) frame : type(0) op(29) frame : type(0) op(30) frame : type(0) op(29) frame : type(0) op(30) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-05-30 03:57:21 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.5qa2 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f4d68acdcb6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f4d68ae801f] /lib64/libc.so.6(+0x33fc0)[0x7f4d67b05fc0] /usr/lib64/libglusterfs.so.0(uuid_unpack+0x0)[0x7f4d68b08c00] /usr/lib64/libglusterfs.so.0(+0x5a8d6)[0x7f4d68b088d6] /usr/lib64/libglusterfs.so.0(uuid_utoa+0x37)[0x7f4d68ae5727] /usr/lib64/glusterfs/3.5qa2/xlator/features/locks.so(pl_inodelk_client_cleanup+0x4d8)[0x7f4d5c166a58] /usr/lib64/glusterfs/3.5qa2/xlator/features/locks.so(+0x6f3a)[0x7f4d5c15bf3a] /usr/lib64/libglusterfs.so.0(gf_client_disconnect+0x5d)[0x7f4d68b211dd] /usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_connection_cleanup+0x448)[0x7f4d5695fd48] /usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_rpc_notify+0x17c)[0x7f4d5695be2c] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x7f4d6889c195] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x7f4d6889dc70] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f4d6889f678] /usr/lib64/glusterfs/3.5qa2/rpc-transport/socket.so(+0xa971)[0x7f4d5e3f1971] /usr/lib64/libglusterfs.so.0(+0x755a7)[0x7f4d68b235a7] /usr/sbin/glusterfsd(main+0x603)[0x407e93] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d67af37d5] /usr/sbin/glusterfsd[0x4049a9] --------- # cat /etc/glusterfs/glusterd.vol volume management type mgmt/glusterd option working-directory /var/lib/glusterd option transport-type socket,rdma option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 option transport.socket.read-fail-log off option ping-timeout 30 # option base-port 49152 end-volume and `gluster volume set gv0 performance.write-behind off` (so that hg clone works) Version-Release number of selected component (if applicable): 3.5qa2, from http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-server-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm How reproducible: Occurs 100% of the time we run this load against the FS. Steps to Reproduce: 1. Create a 2-4 node cluster with the version specified above (yum install -y http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-server-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-libs-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-fuse-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-cli-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm) 2. Run a bunch of git clone operations concurrently against the FS. 3. Observe the crash in `gluster v status MYVOL` (some of the bricks go offline) Actual results: glusterd services crash with the above stacktrace. Expected results: glusterd servers do not crash. Additional info: # uname -a Linux ip-10-213-14-6 3.10.35-43.137.amzn1.x86_64 #1 SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Server is AWS EC2 c3.xlarge with bricks on EBS IO-optimized 800GB volume. Potentially related issues? https://bugzilla.redhat.com/show_bug.cgi?id=885008 https://bugzilla.redhat.com/show_bug.cgi?id=1097102 https://bugzilla.redhat.com/show_bug.cgi?id=1090837
Hi Quinn, I tried recreating the crash with the steps suggested in the Description, but to no avail. Here's what I did: 1. Created a 2x2 distributed-replicate volume, started it, disabled write-behind. 2. I created about 10 different fuse mounts of the same volume on 10 different mountpoints. 3. Then I simultaneously started 'git clone' of glusterfs into 10 different directories, one from each mountpoint (something like 'git clone' into dir1 from /mnt1, 'git clone' into dir2 from /mnt2 ... 'git clone' into dir10 from /mnt10). 4. And I kept monitoring volume's status to look for crashes. All the 10 instances of 'git clone' ran successfully to completion with no crashes in between. Is this the exact set of steps you said would lead to the crash, or am I missing something here?
Hi Krutika, Thanks for trying to repro. That sounds correct, but our environment essentially does that repeatedly (with more concurrent operations) for many hours. We typically see the crash occur within 15-60 minutes of beginning this. Some more information: * This particular crash does not occur on 3.5.0-2.el6.x86_64 (in otherwise the same environment with the same types of operations being performed against the GlusterFS cluster). We saw another crash, in __get_entrylk_count, though. I haven't filed an issue about that one because I didn't think it was related. * Under light load (1 clone every 10 minutes, mostly read operations), this crash does not occur (at least for the several days we've been running it). Would it be helpful if I extract our deployment scripts to make a standalone repro script? It would launch some EC2 instances using AWS CloudFormation. I could even boot it up and give you the ssh private key after I've confirmed that it exhibits the issue.
Thanks Quinn for that input. Repro script is definitely what we need. Would it be possible for you to attach a modified version of the repro script that you're talking about, so that I can run it on my local vms (as apposed to an EC2 instance) to recreate the bug? If not, could you mail your script to me privately (since this bug is public and your script might contain sensitive information) at kdhananj ?
Yes, I'll see if I can extract it to run on any VMs. Otherwise, I'll send you an email. It will probably take me about a week to get to this. Thanks!
Hi Quinn, I sent a solution to this bug recently for review @ http://review.gluster.org/#/c/7981/ Meanwhile, could you download the rpms that contain this fix at the following locations and let me know if it fixes the crash for you, with the repro you have? http://build.gluster.org/job/glusterfs-devrpms-el6/1586/artifact/ http://build.gluster.org/job/glusterfs-devrpms/1613/artifact/ Thanks, Krutika
Patch that fixes this issue is merged in master, and awaiting review in release-3.5 - http://review.gluster.org/#/c/7981/. Moving the state of the bug to MODIFIED.
Hi Krutika, This appears to have fixed it. Thank you! It looks like those Jenkins rpm links you sent have been deleted, since it took me a few days to get to this. However, I am using the ones from http://build.gluster.org/job/glusterfs-devrpms-el6/1740/, which is a build of a commit that appears to include your commit. I ran the repro and have not seen the problem reoccur yet. It is well beyond the point where the problem would have reoccurred (virtually with certainty), so it does appear to be fixed.
(In reply to Quinn Slack from comment #7) > Hi Krutika, > > This appears to have fixed it. Thank you! > > It looks like those Jenkins rpm links you sent have been deleted, since it > took me a few days to get to this. However, I am using the ones from > http://build.gluster.org/job/glusterfs-devrpms-el6/1740/, which is a build > of a commit that appears to include your commit. > > I ran the repro and have not seen the problem reoccur yet. It is well beyond > the point where the problem would have reoccurred (virtually with > certainty), so it does appear to be fixed. That is awesome, Quinn! Good to know. :) Thanks a lot for testing it out. -Krutika
I am running 3.5.0 GA and still getting this error on quotad.log is the patch part of the GA or should be in 3.5.1?
(In reply to Peter Auyeung from comment #9) > I am running 3.5.0 GA and still getting this error on quotad.log > is the patch part of the GA or should be in 3.5.1? Peter, This bug is about a brick process (glusterfsd) crash seen in 3.5.0 whose fix is available in 3.5.1. Not sure what error you are talking about. Could you see if there's a bug already for the problem you described and post your quotad.log there? If not, you could raise a new bug to track this issue. Thanks, Krutika