Bug 761853 (GLUSTER-121)

Summary: Client Crashes Syncing Lots of Small Files
Product: [Community] GlusterFS Reporter: Jonathan Steffan <jonathansteffan>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED WORKSFORME QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 2.0.3CC: amarts, avati, cory.meyer, gluster-bugs, jonathansteffan, shehjart, vikas
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Anand Avati 2009-07-08 17:46:59 UTC
Jonathan,
   Can you retry the operation without the filter translator and let us know if you still face the issue? We are trying to understand if the cause of the problem is associated with the filter translator

Avati

Comment 1 Jonathan Steffan 2009-07-08 20:08:00 UTC
Running an rsync of ~20 million small files crashes the client. The client sits there at 100% CPU. strace reveals:

futex(0xcd32a70, FUTEX_WAIT, 2, NULL

The client layout is:

testing/features/filter -> cluster/distribute -> cluster/replicate -> protocol/client -> protocol/server -> performance/io-cache -> performance/io-threads -> features/locks -> storage/posix

The server does not crash/lock up, the client does.

Comment 2 Jonathan Steffan 2009-07-08 21:42:56 UTC
I will test that tomorrow. Right now I am running:

for i in {1..1000000}; do ls -alR /gluster/mount; done

against ~2TB of small files; which has crashed in the past. If this passes, I'll run the rsyncs again with the filter translator removed.

Comment 3 Jonathan Steffan 2009-07-09 12:28:14 UTC
The recursive ls also caused the issue. I've removed the filter translator and am running the ls test again.

Comment 4 Jonathan Steffan 2009-07-09 15:56:09 UTC
After removing the filter translator, we still see the 100% cpu usage on the client and the fs is no longer responding. strace reveals:

epoll_wait(3,

on the servers and client.

Comment 5 Jonathan Steffan 2009-07-09 15:59:13 UTC
Restarting the servers does not recover the client. I'm going to do a test with the following layout:

testing/features/filter -> cluster/distribute -> cluster/replicate ->
protocol/client -> protocol/server -> features/locks -> storage/posix

Comment 6 Jonathan Steffan 2009-07-09 16:02:04 UTC
The attempted restart of the glusterfsd did result in a defunct process on both servers.

Comment 7 Jonathan Steffan 2009-07-09 17:08:03 UTC
"testing/features/filter -> cluster/distribute -> cluster/replicate ->
protocol/client -> protocol/server -> features/locks -> storage/posix" still causes the client to stop responding when running the for loop listing content.

Additionally, if the client process is killed before the servers are restarted this does not result in a defunct process.

Comment 8 Anand Avati 2009-07-09 17:58:25 UTC
> Additionally, if the client process is killed before the servers are restarted
> this does not result in a defunct process.

Can you get us gdb backtraces of the process threads when it is in the high cpu usage mode? or when hung?

Avati

Comment 9 Jonathan Steffan 2009-07-09 21:14:46 UTC
I just found this in DEBUG loging:

[..... running .....]

[2009-07-09 18:11:36] D [inode.c:572:__inode_replace] fuse/inode: inode(428478) replaced (428478
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(428482) replaced (428482
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(428249) replaced (428249
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(428481) replaced (428481
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(2148348676) replaced (2148348676
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(428493) replaced (428493
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(2148348682) replaced (2148348682
[2009-07-09 18:11:37] D [inode.c:572:__inode_replace] fuse/inode: inode(428499) replaced (428499
[2009-07-09 18:11:37] D [client-protocol.c:4746:client_lookup_cbk] iops_lun_1: LOOKUP 2188059948/6000 (/path/to/dir): inode number changed from 428499 to 428244
[2009-07-09 18:11:37] D [client-protocol.c:4746:client_lookup_cbk] iops_lun_0: LOOKUP 2188059948/6000 (/path/to/dir): inode number changed from 428787 to 428499

Comment 10 Jonathan Steffan 2009-07-09 21:16:02 UTC
Sorry, after the client_lookup_cbk calls the process goes to 100% and the fs stops responding. This looks like an issue with cluster/replicate.

Comment 11 Anand Avati 2009-07-09 21:34:57 UTC
(In reply to comment #10)
> Sorry, after the client_lookup_cbk calls the process goes to 100% and the fs
> stops responding. This looks like an issue with cluster/replicate.

Jonathan,
  We would really be interested to see a backtrace of all the threads when the FS is in 100% cpu. Could you do that for us or give us remote login access to your setup so that we can inspect it in gdb?

Comment 12 Jonathan Steffan 2009-07-09 23:11:43 UTC
I have just verified it looks like an issue with cluster/replicate. I removed the second subvolume from the cluster/replicate volume and have not experienced the issue running the same ls loop. gdb on the way (I'm going to let this test run a little while longer.)

Comment 13 Jonathan Steffan 2009-07-10 15:36:25 UTC
Having run the test(s) successfully with only one cluster/replicate subvolume, I'm unable to reproduce this issue anymore. I'm going to start another rsync against this volume which caused/exposed this issue and hopefully it re-creates it. There is another user on IRC that can reproduce the issue and I will prod them more to add comments and hopefully get a gdb backtrace for us (assuming I can not reproduce the situation.)

Comment 14 cory.meyer 2009-07-10 17:01:54 UTC
I was also able to reproduce the issue under Glusterfs 2.0.2 using the following volfile.   Currently attempting to reproduce and generate gdb backtrace.


volume one01.brick01
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.2.101
 option remote-subvolume brick01
end-volume

volume two01.brick01
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.1.101
 option remote-subvolume brick01
end-volume

volume afr01
type cluster/afr
 option read-subvolume one01.brick01
 subvolumes one01.brick01 two01.brick01
end-volume

## Add writeback feature
volume write-behind_afr01
  type performance/write-behind
  # option aggregate-size 1MB
  option cache-size 3MB
  option flush-behind on
  subvolumes afr01
end-volume

### Add IO-Cache feature
volume iocache_afr01
  type performance/io-cache
  option cache-size 64MB             # default is 32MB
  option page-size 1MB               #128KB is default option
  option force-revalidate-timeout 2  # default is 1
  subvolumes write-behind_afr01
end-volume

### Add readahead feature
volume readahead_afr01
  type performance/read-ahead
  option page-count 16      # cache per file  = (page-count x page-size)
  subvolumes iocache_afr01
end-volume

Comment 15 Jonathan Steffan 2009-07-13 12:00:09 UTC
[Switching to Thread 47931321951968 (LWP 31120)]
0x00002b97e114f888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002b97e114f888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
#1  0x00002b97e114b8a5 in _L_mutex_lock_107 () from /lib64/libpthread.so.0
#2  0x00002b97e114b333 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002b97e0d146ac in inode_unref (inode=0x20015ab0) at inode.c:374
#4  0x00002b97e0d1dc48 in fd_unref (fd=0x2aaab17d2e50) at fd.c:461
#5  0x00002b97e2154363 in afr_sh_data_flush_cbk (frame=0x20a25680, cookie=<value optimized out>, this=0x1ff52c50, op_ret=<value optimized out>, 
    op_errno=<value optimized out>) at afr-self-heal-data.c:99
#6  0x00002b97e1f1df47 in client_flush_cbk (frame=0x2aaaac41dc20, hdr=0x2aaaacaa0b70, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4578
#7  0x00002b97e1f1d8aa in protocol_client_pollin (this=0x1ff52890, trans=0x1ff57630) at client-protocol.c:6169
#8  0x00002b97e1f24372 in notify (this=0x1ff58a70, event=2, data=0x1ff57630) at client-protocol.c:6213
#9  0x00002aaaaaaafdeb in socket_event_handler (fd=<value optimized out>, idx=2, data=0x1ff57630, poll_in=1, poll_out=0, poll_err=0) at socket.c:814
#10 0x00002b97e0d1f5a5 in event_dispatch_epoll (event_pool=0x1ff4d5e0) at event.c:804
#11 0x0000000000403b59 in main (argc=6, argv=0x7fffc9fcc978) at glusterfsd.c:1226

Comment 16 Jonathan Steffan 2009-07-13 12:13:38 UTC
(gdb) backtrace full
#0  0x00002b97e114f888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00002b97e114b8a5 in _L_mutex_lock_107 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00002b97e114b333 in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00002b97e0d146ac in inode_unref (inode=0x20015ab0) at inode.c:374
	table = (inode_table_t *) 0x1ff58a70
#4  0x00002b97e0d1dc48 in fd_unref (fd=0x2aaab17d2e50) at fd.c:461
	refcount = <value optimized out>
	__FUNCTION__ = "fd_unref"
#5  0x00002b97e2154363 in afr_sh_data_flush_cbk (frame=0x20a25680, cookie=<value optimized out>, this=0x1ff52c50, op_ret=<value optimized out>, op_errno=<value optimized out>) at afr-self-heal-data.c:99
	sh = (afr_self_heal_t *) 0x20cd21d8
	call_count = -4
#6  0x00002b97e1f1df47 in client_flush_cbk (frame=0x2aaaac41dc20, hdr=0x2aaaacaa0b70, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4578
	fn = (ret_fn_t) 0xe87991bc
	_parent = (call_frame_t *) 0x1ff58a70
	op_errno = -1332406208
#7  0x00002b97e1f1d8aa in protocol_client_pollin (this=0x1ff52890, trans=0x1ff57630) at client-protocol.c:6169
	conf = (client_conf_t *) 0x1ff57190
	ret = 0
	iobuf = (struct iobuf *) 0x0
	hdr = 0x2aaaacaa0b70 ""
	hdrlen = 32
#8  0x00002b97e1f24372 in notify (this=0x1ff58a70, event=2, data=0x1ff57630) at client-protocol.c:6213
	handshake = <value optimized out>
	ret = <value optimized out>
	child_down = <value optimized out>
	was_not_down = <value optimized out>
	trans = (transport_t *) 0x0
	conn = <value optimized out>
	conf = (client_conf_t *) 0x1ff57190
	parent = <value optimized out>
	__FUNCTION__ = "notify"
#9  0x00002aaaaaaafdeb in socket_event_handler (fd=<value optimized out>, idx=2, data=0x1ff57630, poll_in=1, poll_out=0, poll_err=0) at socket.c:814
	this = (transport_t *) 0x1ff58a70
	priv = (socket_private_t *) 0x1ff57a50
	ret = -4
#10 0x00002b97e0d1f5a5 in event_dispatch_epoll (event_pool=0x1ff4d5e0) at event.c:804
	events = (struct epoll_event *) 0x1ff5a100
	i = 3
	ret = 4
	__FUNCTION__ = "event_dispatch_epoll"
#11 0x0000000000403b59 in main (argc=6, argv=0x7fffc9fcc978) at glusterfsd.c:1226
	ctx = <value optimized out>
	cmd_args = <value optimized out>
	stbuf = {st_dev = 0, st_ino = 47931309588925, st_nlink = 0, st_mode = 1243904375, st_uid = 0, st_gid = 0, pad0 = 0, st_rdev = 47931309596326, st_size = 140736582173824, st_blksize = 47931309588864, st_blocks = 140736582173887, st_atim = {tv_sec = 140736582173872, tv_nsec = 140736582173864}, st_mtim = {tv_sec = 47931311753976, tv_nsec = 1}, st_ctim = {tv_sec = 0, tv_nsec = 4198641}, __unused = {47931318545606, 140736582173936, 47931321924464}}
	tmp_logfile = '\0' <repeats 1023 times>
	timestr = '\0' <repeats 255 times>
	utime = 1247351137
	tm = <value optimized out>
	ret = <value optimized out>
	lim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
	specfp = <value optimized out>
	graph = (xlator_t *) 0x0
	trav = (xlator_t *) 0x1ff4c060
	fuse_volume_found = 0
	xl_count = 0
	__FUNCTION__ = "main"
(gdb) info registers
rax            0xfffffffffffffffc	-4
rbx            0x1ff58a70	536185456
rcx            0xffffffffffffffff	-1
rdx            0x2	2
rsi            0x0	0
rdi            0x1ff58a70	536185456
rbp            0x7990	0x7990
rsp            0x7fffc9fcb998	0x7fffc9fcb998
r8             0x2aaab0951840	46912595368000
r9             0xe87991bc	3900281276
r10            0x0	0
r11            0x202	514
r12            0x1ff58a70	536185456
r13            0x1ff52c50	536161360
r14            0x0	0
r15            0x1	1
rip            0x2b97e114f888	0x2b97e114f888 <__lll_mutex_lock_wait+24>
eflags         0x202	[ IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) thread apply all backtrace

Thread 3 (Thread 1092397376 (LWP 31121)):
#0  0x00002b97e13f6611 in nanosleep () from /lib64/libc.so.6
#1  0x00002b97e1429b24 in usleep () from /lib64/libc.so.6
#2  0x00002b97e0d12dd4 in gf_timer_proc (ctx=0x1ff4c010) at timer.c:177
#3  0x00002b97e11492f7 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b97e142fe3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 1112729920 (LWP 31126)):
#0  __dentry_search_arbit (inode=0x21bc4e80) at inode.c:798
#1  0x00002b97e0d1495e in inode_path (inode=0x20de66d0, name=0x2b97e1e4e028 "60", bufp=0x4252e070) at inode.c:859
#2  0x00002b97e278dd96 in fuse_loc_fill (loc=0x217c7938, state=0x217c7920, ino=0, par=4974744968, name=0x2b97e1e4e028 "60") at fuse-bridge.c:248
#3  0x00002b97e2790de5 in fuse_lookup (req=0x20386510, par=4974744968, name=0x2b97e1e4e028 "60") at fuse-bridge.c:423
#4  0x00002b97e2792f6d in fuse_thread_proc (data=<value optimized out>) at fuse-bridge.c:2480
#5  0x00002b97e11492f7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002b97e142fe3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 47931321951968 (LWP 31120)):
#0  0x00002b97e114f888 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
#1  0x00002b97e114b8a5 in _L_mutex_lock_107 () from /lib64/libpthread.so.0
#2  0x00002b97e114b333 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002b97e0d146ac in inode_unref (inode=0x20015ab0) at inode.c:374
#4  0x00002b97e0d1dc48 in fd_unref (fd=0x2aaab17d2e50) at fd.c:461
#5  0x00002b97e2154363 in afr_sh_data_flush_cbk (frame=0x20a25680, cookie=<value optimized out>, this=0x1ff52c50, op_ret=<value optimized out>, op_errno=<value optimized out>) at afr-self-heal-data.c:99
#6  0x00002b97e1f1df47 in client_flush_cbk (frame=0x2aaaac41dc20, hdr=0x2aaaacaa0b70, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4578
#7  0x00002b97e1f1d8aa in protocol_client_pollin (this=0x1ff52890, trans=0x1ff57630) at client-protocol.c:6169
#8  0x00002b97e1f24372 in notify (this=0x1ff58a70, event=2, data=0x1ff57630) at client-protocol.c:6213
#9  0x00002aaaaaaafdeb in socket_event_handler (fd=<value optimized out>, idx=2, data=0x1ff57630, poll_in=1, poll_out=0, poll_err=0) at socket.c:814
#10 0x00002b97e0d1f5a5 in event_dispatch_epoll (event_pool=0x1ff4d5e0) at event.c:804
#11 0x0000000000403b59 in main (argc=6, argv=0x7fffc9fcc978) at glusterfsd.c:1226

Comment 17 Jonathan Steffan 2009-07-13 15:11:59 UTC
(gdb) thread apply all backtrace full

Thread 3 (Thread 1109051712 (LWP 4066)):
#0  0x00002ab177e25611 in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00002ab177e58b24 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00002ab177741dd4 in gf_timer_proc (ctx=0x24f0010) at timer.c:177
	now = 1247508494661553
	now_tv = {tv_sec = 1247508494, tv_usec = 661553}
	event = (gf_timer_t *) 0x24feef0
	reg = (gf_timer_registry_t *) 0x24f99d0
	__FUNCTION__ = "gf_timer_proc"
#3  0x00002ab177b782f7 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00002ab177e5ee3d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 1119541568 (LWP 4071)):
#0  __dentry_search_arbit (inode=0x2aaac09e1010) at inode.c:799
	dentry = (dentry_t *) 0x0
	trav = (dentry_t *) 0x2aaabcb18600
#1  0x00002ab17774395e in inode_path (inode=0x2aaac09e1010, name=0x2ab1788bf028 "60", bufp=0x42bad070) at inode.c:859
	table = (inode_table_t *) 0x24fca70
	trav = (dentry_t *) 0x2aaabcb1b0a0
	i = 121758588324
	size = <value optimized out>
	ret = <value optimized out>
	len = <value optimized out>
	buf = <value optimized out>
	__FUNCTION__ = "inode_path"
#2  0x00002ab1791bcd96 in fuse_loc_fill (loc=0x2aaac09dd6c8, state=0x2aaac09dd6b0, ino=0, par=3780234481, name=0x2ab1788bf028 "60") at fuse-bridge.c:248
	inode = (inode_t *) 0x0
	parent = (inode_t *) 0x2aaac09e1010
	ret = <value optimized out>
	path = 0x0
	__FUNCTION__ = "fuse_loc_fill"
#3  0x00002ab1791bfde5 in fuse_lookup (req=0x2aaabcaeff80, par=3780234481, name=0x2ab1788bf028 "60") at fuse-bridge.c:423
	state = (fuse_state_t *) 0x1
	ret = <value optimized out>
	__FUNCTION__ = "fuse_lookup"
#4  0x00002ab1791c1f6d in fuse_thread_proc (data=<value optimized out>) at fuse-bridge.c:2480
	mount_point = <value optimized out>
	this = (xlator_t *) 0x24f1b20
	priv = (fuse_private_t *) 0x24fcb40
	res = 43
	iobuf = (struct iobuf *) 0x24f15a8
	chan_size = 135168
	__FUNCTION__ = "fuse_thread_proc"
#5  0x00002ab177b782f7 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x00002ab177e5ee3d in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 46941711767264 (LWP 4065)):
#0  0x00002ab177e5f228 in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x00002ab17774e4f2 in event_dispatch_epoll (event_pool=0x24f15e0) at event.c:850
	events = (struct epoll_event *) 0x24fe100
	i = 1
	ret = 1
	__FUNCTION__ = "event_dispatch_epoll"
#2  0x0000000000403b59 in main (argc=6, argv=0x7fff3359cf48) at glusterfsd.c:1226
	ctx = <value optimized out>
	cmd_args = <value optimized out>
	stbuf = {st_dev = 0, st_ino = 46941699404221, st_nlink = 0, st_mode = 1243904375, st_uid = 0, st_gid = 0, pad0 = 0, st_rdev = 46941699411622, st_size = 140734054910544, st_blksize = 46941699404160, st_blocks = 140734054910607, st_atim = {tv_sec = 140734054910592, tv_nsec = 140734054910584}, st_mtim = {tv_sec = 46941701569272, tv_nsec = 1}, st_ctim = {tv_sec = 0, tv_nsec = 4198641}, __unused = {46941708360902, 140734054910656, 46941711739760}}
	tmp_logfile = '\0' <repeats 1023 times>
	timestr = '\0' <repeats 255 times>
	utime = 1247505826
	tm = <value optimized out>
	ret = <value optimized out>
	lim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
	specfp = <value optimized out>
	graph = (xlator_t *) 0x0
	trav = (xlator_t *) 0x24f0060
	fuse_volume_found = 0
	xl_count = 0
	__FUNCTION__ = "main"
#0  0x00002ab177e5f228 in epoll_wait () from /lib64/libc.so.6
(gdb) thread apply all backtrace

Thread 3 (Thread 1109051712 (LWP 4066)):
#0  0x00002ab177e25611 in nanosleep () from /lib64/libc.so.6
#1  0x00002ab177e58b24 in usleep () from /lib64/libc.so.6
#2  0x00002ab177741dd4 in gf_timer_proc (ctx=0x24f0010) at timer.c:177
#3  0x00002ab177b782f7 in start_thread () from /lib64/libpthread.so.0
#4  0x00002ab177e5ee3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 1119541568 (LWP 4071)):
#0  __dentry_search_arbit (inode=0x2aaac09e1010) at inode.c:799
#1  0x00002ab17774395e in inode_path (inode=0x2aaac09e1010, name=0x2ab1788bf028 "60", bufp=0x42bad070) at inode.c:859
#2  0x00002ab1791bcd96 in fuse_loc_fill (loc=0x2aaac09dd6c8, state=0x2aaac09dd6b0, ino=0, par=3780234481, name=0x2ab1788bf028 "60") at fuse-bridge.c:248
#3  0x00002ab1791bfde5 in fuse_lookup (req=0x2aaabcaeff80, par=3780234481, name=0x2ab1788bf028 "60") at fuse-bridge.c:423
#4  0x00002ab1791c1f6d in fuse_thread_proc (data=<value optimized out>) at fuse-bridge.c:2480
#5  0x00002ab177b782f7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002ab177e5ee3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 46941711767264 (LWP 4065)):
#0  0x00002ab177e5f228 in epoll_wait () from /lib64/libc.so.6
#1  0x00002ab17774e4f2 in event_dispatch_epoll (event_pool=0x24f15e0) at event.c:850
#2  0x0000000000403b59 in main (argc=6, argv=0x7fff3359cf48) at glusterfsd.c:1226
#0  0x00002ab177e5f228 in epoll_wait () from /lib64/libc.so.6

Comment 18 Jonathan Steffan 2009-07-13 17:54:12 UTC
Please note this looks to have been happening for versions 2.0.0-2.0.3. I'm updating the version back to 2.0.3.

Comment 19 Amar Tumballi 2009-07-13 18:00:51 UTC
jonathan, thanks for all the detailed reports. We are on it, and will update here when this bug gets fixed. (or a patch is submitted).

Comment 20 Amar Tumballi 2009-07-13 20:34:54 UTC
(In reply to comment #17)
> (gdb) thread apply all backtrace full
> 
> Thread 2 (Thread 1119541568 (LWP 4071)):
> #0  __dentry_search_arbit (inode=0x2aaac09e1010) at inode.c:799
> #1  0x00002ab17774395e in inode_path (inode=0x2aaac09e1010, name=0x2ab1788bf028
> "60", bufp=0x42bad070) at inode.c:859
> #2  0x00002ab1791bcd96 in fuse_loc_fill (loc=0x2aaac09dd6c8,
> state=0x2aaac09dd6b0, ino=0, par=3780234481, name=0x2ab1788bf028 "60") at
> fuse-bridge.c:248
> #3  0x00002ab1791bfde5 in fuse_lookup (req=0x2aaabcaeff80, par=3780234481,
> name=0x2ab1788bf028 "60") at fuse-bridge.c:423
> #4  0x00002ab1791c1f6d in fuse_thread_proc (data=<value optimized out>) at
> fuse-bridge.c:2480
> #5  0x00002ab177b782f7 in start_thread () from /lib64/libpthread.so.0
> #6  0x00002ab177e5ee3d in clone () from /lib64/libc.so.6


Hi Jonathan,
 The above trace is the part which should be causing 100% CPU and glusterfs hang effect. Can you try applying patch http://patches.gluster.com/patch/742/ and see if it fixes the whole problem? This patch should fix the issue IMO.

Regards,

Comment 21 Jonathan Steffan 2009-07-13 21:55:43 UTC
Building and running tests now. To run a full test case and be sure it's working, it's likely going to take overnight.

Comment 22 Jonathan Steffan 2009-07-14 14:31:40 UTC
I took the 2.0.4 tag and applied this patch. So far, we have not run into the issue that we have been able to reproduce easily with previous commits. I'll be doing more testing but this looks to fix the issue we were seeing.

Comment 23 Amar Tumballi 2009-07-14 14:49:52 UTC
(In reply to comment #22)
> I took the 2.0.4 tag and applied this patch. So far, we have not run into the
> issue that we have been able to reproduce easily with previous commits. I'll be
> doing more testing but this looks to fix the issue we were seeing.

Hi Jonathan, 
 Can you check for the line 'possible infinite loop, breaking' in the log file? That can confirm that this case was actually hit earlier.

Thanks,

Comment 24 Jonathan Steffan 2009-07-25 17:07:44 UTC
I will schedule some time next week to test for the log entries. Sorry for the delay.

Comment 25 Shehjar Tikoo 2010-01-28 03:37:46 UTC
(In reply to comment #24)
> I will schedule some time next week to test for the log entries. Sorry for the
> delay.

Hi Jonathan,

Are you running into this in recent releases, specifically 2.0.9 and 3.0.0?
We'd like an update on this bug to determine the future status of it.

Thanks

Comment 26 Shehjar Tikoo 2010-03-02 08:25:51 UTC
If you see this error in 3.0.x, kindly reopen