Bug 822083

Summary: [27ae1677eb2a6ed4a04bda0df5cc92f2780c11ed]: glusterfs client hangs, thus the application running on it
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: fuseAssignee: Raghavendra G <rgowdapp>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: gluster-bugs, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 848330 (view as bug list) Environment:
Last Closed: 2012-10-23 08:46:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 848330    
Attachments:
Description Flags
the multi threaded program running on one of the fuse clients which hung
none
header file for the program attached none

Description Raghavendra Bhat 2012-05-16 09:55:35 UTC
Created attachment 584919 [details]
the multi threaded program running on one of the fuse clients which hung

Description of problem:
3x2 distributed replicate volume with 2 fuse clients. One of the clients is running a multi-threaded application and the other fuse client is running dbench. volume set operations are running parallely and one brick from each replicate pair is brought down at regular intervals.

The multithreaded application running on the fuse client hung, so is the fuse client.

attached to the process via gdb and found this backtrace.

Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
82	../sysdeps/unix/syscall-template.S: No such file or directory.
	in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fb06b05eea3 in event_dispatch_epoll (event_pool=0x2195cd0) at ../../../libglusterfs/src/event.c:830
#2  0x00007fb06b05f27d in event_dispatch (event_pool=0x2195cd0) at ../../../libglusterfs/src/event.c:947
#3  0x0000000000408858 in main (argc=4, argv=0x7ffff57c2368) at ../../../glusterfsd/src/glusterfsd.c:1674
(gdb) info thr
  25 Thread 0x7fb068b2f700 (LWP 20751)  do_sigwait (set=<value optimized out>, sig=0x7fb068b2eeb8)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:65
  24 Thread 0x7fb06832e700 (LWP 20753)  0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
  23 Thread 0x7fb067b2d700 (LWP 20754)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  22 Thread 0x7fb066f0a700 (LWP 20757)  0x00007fb06a9cc4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
  21 Thread 0x7fb06513a700 (LWP 20758)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  20 Thread 0x7fb064939700 (LWP 20759)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  19 Thread 0x7fb0621ad700 (LWP 20760)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  18 Thread 0x7fb0619ac700 (LWP 20761)  0x00007fb06a9cbcbd in read () at ../sysdeps/unix/syscall-template.S:82
  17 Thread 0x7fb0611a9700 (LWP 22946)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  16 Thread 0x7fb0609a8700 (LWP 22947)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  15 Thread 0x7fb05a169700 (LWP 26148)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  14 Thread 0x7fb059968700 (LWP 26149)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  13 Thread 0x7fb05769b700 (LWP 3095)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  12 Thread 0x7fb056e9a700 (LWP 3096)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  11 Thread 0x7fb054e6c700 (LWP 3108)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  10 Thread 0x7fb04ffff700 (LWP 3109)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  9 Thread 0x7fb04e660700 (LWP 3361)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  8 Thread 0x7fb04de5f700 (LWP 3362)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  7 Thread 0x7fb04bd59700 (LWP 3455)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  6 Thread 0x7fb04b558700 (LWP 3456)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  5 Thread 0x7fb049689700 (LWP 3550)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  4 Thread 0x7fb048e88700 (LWP 3551)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  3 Thread 0x7fb046fb9700 (LWP 4150)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  2 Thread 0x7fb0467b8700 (LWP 4151)  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
* 1 Thread 0x7fb06b499720 (LWP 20750)  0x00007fb06a3806a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
(gdb) t 24
[Switching to thread 24 (Thread 0x7fb06832e700 (LWP 20753))]#0  0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fb06a9c98f5 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb068b4dcc2 in fuse_migrate_fd (this=0x2196b20, fd=0x39f1178, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270)
    at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3562
#2  0x00007fb068b4e228 in fuse_handle_opened_fds (this=0x2196b20, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270)
    at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3678
#3  0x00007fb068b4e31f in fuse_graph_switch_task (data=0x3b5cd40) at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3725
#4  0x00007fb06b0700cd in synctask_wrap (old_task=0x3b67610) at ../../../libglusterfs/src/syncop.c:120
#5  0x00007fb06a2df1a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) f 1
#1  0x00007fb068b4dcc2 in fuse_migrate_fd (this=0x2196b20, fd=0x39f1178, old_subvol=0x7fb050036880, new_subvol=0x7fb050a18270)
    at ../../../../../xlators/mount/fuse/src/fuse-bridge.c:3562
3562	                LOCK (&fd->inode->lock);
(gdb) p *fd
$1 = {pid = 3305, flags = 0, refcount = 2, inode_list = {next = 0x7fb04e904d00, prev = 0x39f1250}, inode = 0x7fb04e904cd0, lock = 1, 
  _ctx = 0x7fb0509de1a0, xl_count = 18, lk_ctx = 0x7fb0509d3910}
(gdb) p*fd->inode
$2 = {table = 0x2cfe8a0, gfid = "jI\226\367\261\063A\373\214y3\373K:w", <incomplete sequence \373>, lock = -1, nlookup = 9, ref = 6032, 
  ia_type = IA_IFDIR, fd_list = {next = 0x39f3000, prev = 0x39f1188}, dentry_list = {next = 0x7fb04e662c30, prev = 0x7fb04e662c30}, hash = {
    next = 0x391b1c0, prev = 0x391b1c0}, list = {next = 0x7fb04e9054b4, prev = 0x7fb04e903f50}, _ctx = 0x2d04cc0}
(gdb) 



Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. create a 3x2 distributed replicate volume and mount it via 2 fuse clients.
2. Run the multithreaded application (attached) on one of the fuse clients and run dbench (dbench 22) on the other fuse client.
3. run volume set operations (xlator on/ff with 300 seconds gap) parallely
4. bring a brick from each of the replica pairs down at some regular intervals (300 seconds in this case), sleep for some time and do volume start force
5. heal the volume via both gluster cli command and fins |xargs stat on both the mount points.

  
Actual results:
The multithreaded application on one of the mount points hung

Expected results:
applications should not hang

Additional info:

./a.out -t 1315
Switching over to the working directory /mnt/client/playground time 1315
Total Statistics ======>
Opens        : 1180/1319
Reads        : 9965970/9965971
Writes       : 1827/1982
Flocks       : 134/138
fcntl locks  : 138/138
Truncates    : 15/15
Fstat        : 23617395/23617396
Chown        : 1656/1657
Opendir      : 1010/1012
Readdir      : 4031/5041
^C^C^C^C^C

gluster volume info
 
Volume Name: mirror
Type: Distributed-Replicate
Volume ID: c15b0415-46ec-485d-a1c6-989783bb154a
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: hyperspace:/mnt/sda7/export4
Brick2: hyperspace:/mnt/sda8/export4
Brick3: hyperspace:/mnt/sda7/export5
Brick4: hyperspace:/mnt/sda8/export5
Brick5: hyperspace:/mnt/sda7/export6
Brick6: hyperspace:/mnt/sda8/export6
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
features.quota: on
performance.quick-read: on
performance.read-ahead: on
performance.stat-prefetch: off
features.limit-usage: /:250GB


D-UP
[2012-05-16 15:17:09.415011] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 874)
[2012-05-16 15:17:09.415116] I [client-handshake.c:453:client_set_lk_version_cbk] 4-mirror-client-5: Server lk version = 1
[2012-05-16 15:17:09.415333] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 875)
[2012-05-16 15:17:09.415743] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 876)
[2012-05-16 15:17:09.415944] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 877)
[2012-05-16 15:17:09.416163] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 878)
[2012-05-16 15:17:09.416367] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 879)
[2012-05-16 15:17:09.416632] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 880)
[2012-05-16 15:17:09.416821] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 881)
[2012-05-16 15:17:09.417013] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 882)
[2012-05-16 15:17:09.417221] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 883)
[2012-05-16 15:17:09.417408] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 884)
[2012-05-16 15:17:09.417769] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 885)
[2012-05-16 15:17:09.417954] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 886)
[2012-05-16 15:17:09.418323] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 887)
[2012-05-16 15:17:09.418525] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 888)
[2012-05-16 15:17:09.418720] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 889)
[2012-05-16 15:17:09.418936] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 890)
[2012-05-16 15:17:09.419294] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 891)
[2012-05-16 15:17:09.421271] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 892)
[2012-05-16 15:17:09.422842] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 893)
[2012-05-16 15:17:09.423105] I [client-handshake.c:1033:client3_1_reopendir_cbk] 5-mirror-client-5: reopendir on <gfid:6a4996f7-b133-41fb-8c79
-33fb4b3a77fb> succeeded (fd = 894)
:

Comment 1 Raghavendra Bhat 2012-05-16 09:56:02 UTC
Created attachment 584920 [details]
header file for the program attached

Comment 2 Raghavendra Bhat 2012-10-23 08:46:51 UTC
Checked with the latest master(cf63a76ca03240eb617ca5bd2aa9b3f7abe7b6a4). Same set  of tests run fine without causing any hang in the filesystem or the application. Seem to have been fixed by the commit http://review.gluster.org/3566.