1103347 – Crash in uuid_unpack under heavy load

Bug 1103347 - Crash in uuid_unpack under heavy load

Summary: Crash in uuid_unpack under heavy load

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	locks
Sub Component:
Version:	pre-release
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-30 19:35 UTC by Quinn Slack
Modified:	2014-07-13 12:04 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.5.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-07-13 12:04:57 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Quinn Slack 2014-05-30 19:35:21 UTC

Description of problem:

Running a 4-node (replica 4) GlusterFS cluster, the glusterd servers all eventually crash within about 15-60 minutes of starting with the following backtrace. Restarting the service brings them back online. When the servers crash, they are under heavy load: approximately 5-10 concurrent git/hg clone operations are occurring on the cluster, via client nodes (running the same nightly version) that mount the volume with `mount -t glusterfs`.

Backtrace of crash:

...
[2014-05-30 03:56:52.745039] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/wu/nO
/_tmp_pwM6m1gEIVosHmM7110146A=-696488624/hooks/pre-commit.sample
[2014-05-30 03:56:52.745050] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745061] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/ks/gM
/_tmp_Y8a2fAoft1r14R_fVc-UTMs=-026718122/config
[2014-05-30 03:56:52.745088] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/Cw/o-
/_tmp_40mBnMCmLUj_fZfpTnk3hXs=-746944916/refs
[2014-05-30 03:56:52.745101] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745123] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/u6/X_
/oA1rqlC7cUsMD81_6MN5ogo=/objects/pack/pack-2527e3670442207f7b23bdea5fc2dfe7359bf7ba.pack
[2014-05-30 03:56:52.745137] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745144] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/h7/Nz
/_tmp_Lo7MT0i_Ul-79_3ZvbF5mMo=-606962255/hooks/applypatch-msg.sample
[2014-05-30 03:56:52.745157] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745164] I [server-helpers.c:291:do_fd_cleanup] 0-gv0-server: fd cleanup on /git/u6/X_
/oA1rqlC7cUsMD81_6MN5ogo=/objects/pack/pack-2527e3670442207f7b23bdea5fc2dfe7359bf7ba.idx
[2014-05-30 03:56:52.745181] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745197] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745221] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745250] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745407] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:56:52.745444] E [client_t.c:384:gf_client_unref] (-->/usr/lib64/glusterfs/3.5qa2/xlator/fea
tures/locks.so(pl_flush_cbk+0xb9) [0x7f4d5c15d3b9] (-->/usr/lib64/libglusterfs.so.0(default_flush_cbk+0xb9
) [0x7f4d68adb699] (-->/usr/lib64/glusterfs/3.5qa2/xlator/debug/io-stats.so(io_stats_flush_cbk+0xed) [0x7f
4d56b9952d]))) 0-client_t: client is NULL
[2014-05-30 03:57:21.704523] I [server.c:519:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom
 ip-10-220-14-24-2534-2014/05/30-00:04:37:888277-gv0-client-0-0-1
[2014-05-30 03:57:21.704583] I [client_t.c:294:gf_client_put] 0-gv0-server: Shutting down connection ip-10
-220-14-24-2534-2014/05/30-00:04:37:888277-gv0-client-0-0-1
pending frames:
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(12)
frame : type(0) op(29)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(30)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(30)
frame : type(0) op(29)
frame : type(0) op(30)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2014-05-30 03:57:21
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.5qa2
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f4d68acdcb6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f4d68ae801f]
/lib64/libc.so.6(+0x33fc0)[0x7f4d67b05fc0]
/usr/lib64/libglusterfs.so.0(uuid_unpack+0x0)[0x7f4d68b08c00]
/usr/lib64/libglusterfs.so.0(+0x5a8d6)[0x7f4d68b088d6]
/usr/lib64/libglusterfs.so.0(uuid_utoa+0x37)[0x7f4d68ae5727]
/usr/lib64/glusterfs/3.5qa2/xlator/features/locks.so(pl_inodelk_client_cleanup+0x4d8)[0x7f4d5c166a58]
/usr/lib64/glusterfs/3.5qa2/xlator/features/locks.so(+0x6f3a)[0x7f4d5c15bf3a]
/usr/lib64/libglusterfs.so.0(gf_client_disconnect+0x5d)[0x7f4d68b211dd]
/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_connection_cleanup+0x448)[0x7f4d5695fd48]
/usr/lib64/glusterfs/3.5qa2/xlator/protocol/server.so(server_rpc_notify+0x17c)[0x7f4d5695be2c]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x7f4d6889c195]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x7f4d6889dc70]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f4d6889f678]
/usr/lib64/glusterfs/3.5qa2/rpc-transport/socket.so(+0xa971)[0x7f4d5e3f1971]
/usr/lib64/libglusterfs.so.0(+0x755a7)[0x7f4d68b235a7]
/usr/sbin/glusterfsd(main+0x603)[0x407e93]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d67af37d5]
/usr/sbin/glusterfsd[0x4049a9]
---------

# cat /etc/glusterfs/glusterd.vol 
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 30
#   option base-port 49152
end-volume

and `gluster volume set gv0 performance.write-behind off` (so that hg clone works)

Version-Release number of selected component (if applicable):

3.5qa2, from http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-server-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm

How reproducible:

Occurs 100% of the time we run this load against the FS.

Steps to Reproduce:
1. Create a 2-4 node cluster with the version specified above (yum install -y http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-server-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-libs-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-fuse-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-cli-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/epel-6-x86_64/glusterfs-3.5qa2-0.544.git58b9ede.autobuild/glusterfs-3.5qa2-0.544.git58b9ede.el6.x86_64.rpm)
2. Run a bunch of git clone operations concurrently against the FS.
3. Observe the crash in `gluster v status MYVOL` (some of the bricks go offline)

Actual results:

glusterd services crash with the above stacktrace.

Expected results:

glusterd servers do not crash.

Additional info:

# uname -a
Linux ip-10-213-14-6 3.10.35-43.137.amzn1.x86_64 #1 SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Server is AWS EC2 c3.xlarge with bricks on EBS IO-optimized 800GB volume.

Potentially related issues? 
https://bugzilla.redhat.com/show_bug.cgi?id=885008
https://bugzilla.redhat.com/show_bug.cgi?id=1097102
https://bugzilla.redhat.com/show_bug.cgi?id=1090837

Comment 1 Krutika Dhananjay 2014-06-03 06:12:51 UTC

Hi Quinn,

I tried recreating the crash with the steps suggested in the Description, but to no avail. Here's what I did:

1. Created a 2x2 distributed-replicate volume, started it, disabled write-behind.
2. I created about 10 different fuse mounts of the same volume on 10 different mountpoints.
3. Then I simultaneously started 'git clone' of glusterfs into 10 different directories, one from each mountpoint (something like 'git clone' into dir1 from /mnt1, 'git clone' into dir2 from /mnt2 ... 'git clone' into dir10 from /mnt10).
4. And I kept monitoring volume's status to look for crashes.

All the 10 instances of 'git clone' ran successfully to completion with no crashes in between.

Is this the exact set of steps you said would lead to the crash, or am I missing something here?

Comment 2 Quinn Slack 2014-06-03 07:09:18 UTC

Hi Krutika,

Thanks for trying to repro. That sounds correct, but our environment essentially does that repeatedly (with more concurrent operations) for many hours. We typically see the crash occur within 15-60 minutes of beginning this.

Some more information:

* This particular crash does not occur on 3.5.0-2.el6.x86_64 (in otherwise the same environment with the same types of operations being performed against the GlusterFS cluster). We saw another crash, in __get_entrylk_count, though. I haven't filed an issue about that one because I didn't think it was related.

* Under light load (1 clone every 10 minutes, mostly read operations), this crash does not occur (at least for the several days we've been running it).

Would it be helpful if I extract our deployment scripts to make a standalone repro script? It would launch some EC2 instances using AWS CloudFormation. I could even boot it up and give you the ssh private key after I've confirmed that it exhibits the issue.

Comment 3 Krutika Dhananjay 2014-06-03 07:34:57 UTC

Thanks Quinn for that input.

Repro script is definitely what we need. Would it be possible for you to attach a modified version of the repro script that you're talking about, so that I can run it on my local vms (as apposed to an EC2 instance) to recreate the bug? If not, could you mail your script to me privately (since this bug is public and your script might contain sensitive information) at kdhananj ?

Comment 4 Quinn Slack 2014-06-03 07:52:58 UTC

Yes, I'll see if I can extract it to run on any VMs. Otherwise, I'll send you an email. It will probably take me about a week to get to this. Thanks!

Comment 5 Krutika Dhananjay 2014-06-11 05:16:41 UTC

Hi Quinn,

I sent a solution to this bug recently for review @ http://review.gluster.org/#/c/7981/

Meanwhile, could you download the rpms that contain this fix at the following locations and let me know if it fixes the crash for you, with the repro you have?

http://build.gluster.org/job/glusterfs-devrpms-el6/1586/artifact/
http://build.gluster.org/job/glusterfs-devrpms/1613/artifact/

Thanks,
Krutika

Comment 6 Krutika Dhananjay 2014-06-12 06:55:43 UTC

Patch that fixes this issue is merged in master, and awaiting review in release-3.5 - http://review.gluster.org/#/c/7981/.

Moving the state of the bug to MODIFIED.

Comment 7 Quinn Slack 2014-06-16 02:17:09 UTC

Hi Krutika,

This appears to have fixed it. Thank you!

It looks like those Jenkins rpm links you sent have been deleted, since it took me a few days to get to this. However, I am using the ones from http://build.gluster.org/job/glusterfs-devrpms-el6/1740/, which is a build of a commit that appears to include your commit.

I ran the repro and have not seen the problem reoccur yet. It is well beyond the point where the problem would have reoccurred (virtually with certainty), so it does appear to be fixed.

Comment 8 Krutika Dhananjay 2014-06-16 04:29:04 UTC

(In reply to Quinn Slack from comment #7)
> Hi Krutika,
> 
> This appears to have fixed it. Thank you!
> 
> It looks like those Jenkins rpm links you sent have been deleted, since it
> took me a few days to get to this. However, I am using the ones from
> http://build.gluster.org/job/glusterfs-devrpms-el6/1740/, which is a build
> of a commit that appears to include your commit.
> 
> I ran the repro and have not seen the problem reoccur yet. It is well beyond
> the point where the problem would have reoccurred (virtually with
> certainty), so it does appear to be fixed.

That is awesome, Quinn! Good to know. :)
Thanks a lot for testing it out.

-Krutika

Comment 9 Peter Auyeung 2014-06-25 04:09:54 UTC

I am running 3.5.0 GA and still getting this error on quotad.log
is the patch part of the GA or should be in 3.5.1?

Comment 10 Krutika Dhananjay 2014-06-25 04:27:46 UTC

(In reply to Peter Auyeung from comment #9)
> I am running 3.5.0 GA and still getting this error on quotad.log
> is the patch part of the GA or should be in 3.5.1?

Peter,

This bug is about a brick process (glusterfsd) crash seen in 3.5.0 whose fix is available in 3.5.1. Not sure what error you are talking about. Could you see if there's a bug already for the problem you described and post your quotad.log there? If not, you could raise a new bug to track this issue.

Thanks,
Krutika

Note You need to log in before you can comment on or make changes to this bug.