Created attachment 584919 [details] the multi threaded program running on one of the fuse clients which hung Description of problem: 3x2 distributed replicate volume with 2 fuse clients. One of the clients is running a multi-threaded application and the other fuse client is running dbench. volume set operations are running parallely and one brick from each replicate pair is brought down at regular intervals. The multithreaded application running on the fuse client hung, so is the fuse client. attached to the process via gdb and found this backtrace. Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1 0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 #1 0x00007fb06b05eea3 in event_dispatch_epoll (event_pool=0x2195cd0) at ../../../libglusterfs/src/event.c:830 #2 0x00007fb06b05f27d in event_dispatch (event_pool=0x2195cd0) at ../../../libglusterfs/src/event.c:947 #3 0x0000000000408858 in main (argc=4, argv=0x7ffff57c2368) at ../../../glusterfsd/src/glusterfsd.c:1674 (gdb) info thr 25 Thread 0x7fb068b2f700 (LWP 20751) do_sigwait (set=<value optimized out>, sig=0x7fb068b2eeb8) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:65 24 Thread 0x7fb06832e700 (LWP 20753) 0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0 23 Thread 0x7fb067b2d700 (LWP 20754) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 22 Thread 0x7fb066f0a700 (LWP 20757) 0x00007fb06a9cc4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 21 Thread 0x7fb06513a700 (LWP 20758) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 20 Thread 0x7fb064939700 (LWP 20759) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 19 Thread 0x7fb0621ad700 (LWP 20760) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 18 Thread 0x7fb0619ac700 (LWP 20761) 0x00007fb06a9cbcbd in read () at ../sysdeps/unix/syscall-template.S:82 17 Thread 0x7fb0611a9700 (LWP 22946) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 16 Thread 0x7fb0609a8700 (LWP 22947) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 15 Thread 0x7fb05a169700 (LWP 26148) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 14 Thread 0x7fb059968700 (LWP 26149) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 13 Thread 0x7fb05769b700 (LWP 3095) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 12 Thread 0x7fb056e9a700 (LWP 3096) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 11 Thread 0x7fb054e6c700 (LWP 3108) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 10 Thread 0x7fb04ffff700 (LWP 3109) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 9 Thread 0x7fb04e660700 (LWP 3361) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 8 Thread 0x7fb04de5f700 (LWP 3362) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 7 Thread 0x7fb04bd59700 (LWP 3455) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 6 Thread 0x7fb04b558700 (LWP 3456) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 5 Thread 0x7fb049689700 (LWP 3550) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 4 Thread 0x7fb048e88700 (LWP 3551) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 3 Thread 0x7fb046fb9700 (LWP 4150) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 2 Thread 0x7fb0467b8700 (LWP 4151) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 * 1 Thread 0x7fb06b499720 (LWP 20750) 0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 (gdb) t 24 [Switching to thread 24 (Thread 0x7fb06832e700 (LWP 20753))]#0 0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fb068b4dcc2 in fuse_migrate_fd (this=0x2196b20, fd=0x39f1178, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3562 #2 0x00007fb068b4e228 in fuse_handle_opened_fds (this=0x2196b20, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3678 #3 0x00007fb068b4e31f in fuse_graph_switch_task (data=0x3b5cd40) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3725 #4 0x00007fb06b0700cd in synctask_wrap (old_task=0x3b67610) at ../../../libglusterfs/src/syncop.c:120 #5 0x00007fb06a2df1a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) f 1 #1 0x00007fb068b4dcc2 in fuse_migrate_fd (this=0x2196b20, fd=0x39f1178, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3562 3562 LOCK (&fd->inode->lock); (gdb) p *fd $1 = {pid = 3305, flags = 0, refcount = 2, inode_list = {next = 0x7fb04e904d00, prev = 0x39f1250}, inode = 0x7fb04e904cd0, lock = 1, _ctx = 0x7fb0509de1a0, xl_count = 18, lk_ctx = 0x7fb0509d3910} (gdb) p*fd->inode $2 = {table = 0x2cfe8a0, gfid = "jI\226\367\261\063A\373\214y3\373K:w", <incomplete sequence \373>, lock = -1, nlookup = 9, ref = 6032, ia_type = IA_IFDIR, fd_list = {next = 0x39f3000, prev = 0x39f1188}, dentry_list = {next = 0x7fb04e662c30, prev = 0x7fb04e662c30}, hash = { next = 0x391b1c0, prev = 0x391b1c0}, list = {next = 0x7fb04e9054b4, prev = 0x7fb04e903f50}, _ctx = 0x2d04cc0} (gdb) Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. create a 3x2 distributed replicate volume and mount it via 2 fuse clients. 2. Run the multithreaded application (attached) on one of the fuse clients and run dbench (dbench 22) on the other fuse client. 3. run volume set operations (xlator on/ff with 300 seconds gap) parallely 4. bring a brick from each of the replica pairs down at some regular intervals (300 seconds in this case), sleep for some time and do volume start force 5. heal the volume via both gluster cli command and fins |xargs stat on both the mount points. Actual results: The multithreaded application on one of the mount points hung Expected results: applications should not hang Additional info: ./a.out -t 1315 Switching over to the working directory /mnt/client/playground time 1315 Total Statistics ======> Opens : 1180/1319 Reads : 9965970/9965971 Writes : 1827/1982 Flocks : 134/138 fcntl locks : 138/138 Truncates : 15/15 Fstat : 23617395/23617396 Chown : 1656/1657 Opendir : 1010/1012 Readdir : 4031/5041 ^C^C^C^C^C gluster volume info Volume Name: mirror Type: Distributed-Replicate Volume ID: c15b0415-46ec-485d-a1c6-989783bb154a Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: hyperspace:/mnt/sda7/export4 Brick2: hyperspace:/mnt/sda8/export4 Brick3: hyperspace:/mnt/sda7/export5 Brick4: hyperspace:/mnt/sda8/export5 Brick5: hyperspace:/mnt/sda7/export6 Brick6: hyperspace:/mnt/sda8/export6 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on features.quota: on performance.quick-read: on performance.read-ahead: on performance.stat-prefetch: off features.limit-usage: /:250GB D-UP [2012-05-16 15:17:09.415011] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 874) [2012-05-16 15:17:09.415116] I [client-handshake.c:453:client_set_lk_version_cbk] 4-mirror-client-5: Server lk version = 1 [2012-05-16 15:17:09.415333] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 875) [2012-05-16 15:17:09.415743] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 876) [2012-05-16 15:17:09.415944] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 877) [2012-05-16 15:17:09.416163] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 878) [2012-05-16 15:17:09.416367] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 879) [2012-05-16 15:17:09.416632] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 880) [2012-05-16 15:17:09.416821] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 881) [2012-05-16 15:17:09.417013] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 882) [2012-05-16 15:17:09.417221] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 883) [2012-05-16 15:17:09.417408] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 884) [2012-05-16 15:17:09.417769] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 885) [2012-05-16 15:17:09.417954] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 886) [2012-05-16 15:17:09.418323] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 887) [2012-05-16 15:17:09.418525] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 888) [2012-05-16 15:17:09.418720] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 889) [2012-05-16 15:17:09.418936] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 890) [2012-05-16 15:17:09.419294] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 891) [2012-05-16 15:17:09.421271] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 892) [2012-05-16 15:17:09.422842] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 893) [2012-05-16 15:17:09.423105] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79 -33fb4b3a77fb> succeeded (fd = 894) :
Created attachment 584920 [details] header file for the program attached
Checked with the latest master(cf63a76ca03240eb617ca5bd2aa9b3f7abe7b6a4). Same set of tests run fine without causing any hang in the filesystem or the application. Seem to have been fixed by the commit http://review.gluster.org/3566.