Bug 1194525 - Perf: Crash seen during IOZone performance regression runs.
Summary: Perf: Crash seen during IOZone performance regression runs.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.0.4
Assignee: krishnan parthasarathi
QA Contact: Ben Turner
URL:
Whiteboard:
Depends On: 1197118
Blocks: 1104459 1182947 1193995
TreeView+ depends on / blocked
 
Reported: 2015-02-20 03:20 UTC by Ben Turner
Modified: 2016-09-17 14:38 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.6.0.48-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1197118 (view as bug list)
Environment:
Last Closed: 2015-03-26 06:36:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0682 0 normal SHIPPED_LIVE Red Hat Storage 3.0 enhancement and bug fix update #4 2015-03-26 10:32:55 UTC

Description Ben Turner 2015-02-20 03:20:18 UTC
Description of problem:

When running iozone tests I am seeing a crash on the client side.

Version-Release number of selected component (if applicable):

glusterfs-3.6.0.45-1.el6.x86_64

How reproducible:

Occasionally, not every time.

Steps to Reproduce:
1.  Run IOZone perf regression tests.
2.  Check client logs for crashes.

Actual results:

Poor performance, transport endpoint not connected messages, and disconnects.

Expected results:

Normal operation.

Additional info:

I have this reprod in the lab.

Comment 1 Ben Turner 2015-02-20 03:21:14 UTC
In the log:

[2015-02-20 02:50:47.140507] I [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 4
pending frames:
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(1) op(WRITE)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-02-20 02:51:12
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.0.45
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f3f4806d7b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f3f480883cf]
/lib64/libc.so.6[0x38acc329a0]
/usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so(+0x9594)[0x7f3f3dc56594]
/usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so(+0xad1d)[0x7f3f3dc57d1d]
/usr/lib64/libglusterfs.so.0(+0x77d1c)[0x7f3f480c4d1c]
/lib64/libpthread.so.0[0x38ad0079d1]
/lib64/libc.so.6(clone+0x6d)[0x38acce8b6d]
---------

Comment 2 Ben Turner 2015-02-20 03:23:39 UTC
I see this in the client logs:

[2015-02-20 02:07:16.497365] E [dht-helper.c:1364:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30c) [0x7feb0a9d7b6c] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7feb0a9afa0e] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7feb0a9b1cc2]))) 0-rep-dht: invalid argument: inode
[2015-02-20 02:07:16.497405] E [dht-helper.c:1345:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30c) [0x7feb0a9d7b6c] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7feb0a9afa0e] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7feb0a9b1ca4]))) 0-rep-dht: invalid argument: inode
[2015-02-20 02:07:16.497427] E [dht-helper.c:1364:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30c) [0x7feb0a9d7b6c] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7feb0a9afa0e] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7feb0a9b1cc2]))) 0-rep-dht: invalid argument: inode
[2015-02-20 02:07:21.123976] E [dht-helper.c:1345:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30c) [0x7feb0a9d7b6c] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7feb0a9afa0e] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7feb0a9b1ca4]))) 0-rep-dht: invalid argument: inode
[2015-02-20 02:07:21.124006] E [dht-helper.c:1364:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_readdirp_cbk+0x30c) [0x7feb0a9d7b6c] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7feb0a9afa0e] (-->/usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7feb0a9b1cc2]))) 0-rep-dht: invalid argument: inode

And here is the output from iozone:

	Iozone: Performance Test of File I/O
	        Version $Revision: 3.408 $
		Compiled for 64 bit mode.
		Build: linux 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
	             Ben England.

	Run began: Thu Feb 19 22:15:53 2015

	Network distribution mode enabled.
	Hostname = gqac031.sbu.lab.eng.bos.redhat.com
	Setting no_unlink
	Include close in write timing
	Include fsync in write timing
	No retest option selected
	Record Size 1024 KB
	File size set to 2097152 KB
	Command line used: iozone -+m /root/clients.ioz -+h gqac031.sbu.lab.eng.bos.redhat.com -C -w -c -e -i 2 -J 3 -+n -r 1024k -s 2g -t 8
	Output is in Kbytes/sec
	Compute time 0.003000 seconds for reads and writes.
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 8 processes
	Each process writes a 2097152 Kbyte file in 1024 Kbyte records

	Test running:
Client # 7 reporting an error Transport endpoint is not connected !

Client # 6 reporting an error Transport endpoint is not connected !

Client # 4 reporting an error Transport endpoint is not connected !

Client # 5 reporting an error Transport endpoint is not connected !

	Children see throughput for 8 random readers 	=  481237.70 KB/sec
	Min throughput per process 			=   10994.98 KB/sec 
	Max throughput per process 			=  112833.18 KB/sec
	Avg throughput per process 			=   60154.71 KB/sec
	Min xfer 					=  264192.00 KB
	Child[0] xfer count = 2097152.00 KB, Throughput =  112833.18 KB/sec
	Child[1] xfer count = 1937408.00 KB, Throughput =  101591.91 KB/sec
	Child[2] xfer count = 2081792.00 KB, Throughput =  111696.71 KB/sec
	Child[3] xfer count = 1999872.00 KB, Throughput =  105917.29 KB/sec
	Child[4] xfer count =  264192.00 KB, Throughput =   10994.98 KB/sec
	Child[5] xfer count =  314368.00 KB, Throughput =   13162.18 KB/sec
	Child[6] xfer count =  316416.00 KB, Throughput =   13252.79 KB/sec
	Child[7] xfer count =  282624.00 KB, Throughput =   11788.66 KB/sec

	Test running:
Client # 5 reporting an error Transport endpoint is not connected !

Client # 4 reporting an error Transport endpoint is not connected !

Client # 7 reporting an error Transport endpoint is not connected !

Client # 6 reporting an error Transport endpoint is not connected !

	Children see throughput for 8 random writers 	= 1029431.46 KB/sec
	Min throughput per process 			=   22222.78 KB/sec 
	Max throughput per process 			=  237961.33 KB/sec
	Avg throughput per process 			=  128678.93 KB/sec
	Min xfer 					= 1366016.00 KB
	Child[0] xfer count = 2097152.00 KB, Throughput =  237961.33 KB/sec
	Child[1] xfer count = 2097152.00 KB, Throughput =  237687.33 KB/sec
	Child[2] xfer count = 2097152.00 KB, Throughput =  226001.80 KB/sec
	Child[3] xfer count = 2097152.00 KB, Throughput =  237303.77 KB/sec
	Child[4] xfer count = 1377280.00 KB, Throughput =   22405.48 KB/sec
	Child[5] xfer count = 1366016.00 KB, Throughput =   23476.16 KB/sec
	Child[6] xfer count = 1366016.00 KB, Throughput =   22222.78 KB/sec
	Child[7] xfer count = 1375232.00 KB, Throughput =   22372.82 KB/sec

Comment 3 Ben Turner 2015-02-20 16:16:21 UTC
Here are some updates from troubleshooting the running system:

Busy process:

top - 10:53:27 up 3 days, 19:32,  3 users,  load average: 1.04, 1.05, 1.05
Tasks: 570 total,   2 running, 568 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.2%sy,  0.0%ni, 98.9%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49409280k total,  1420676k used, 47988604k free,    20792k buffers
Swap: 24772604k total,      484k used, 24772120k free,   321564k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                          
 8707 root      20   0 1816m  62m 3192 S 98.3  0.1 602:30.84 glusterfsd


gstack of busy process:

[root@gqas015 ~]# gstack 8707
Thread 12 (Thread 0x7f7cd92a6700 (LWP 8708)):
#0  0x0000003b89a0ef3d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f7ce0a9f03c in gf_timer_proc () from /usr/lib64/libglusterfs.so.0
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f7cd88a5700 (LWP 8709)):
#0  0x0000003b89a0f4b5 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004057bb in glusterfs_sigwaiter ()
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f7cd7ea4700 (LWP 8710)):
#0  0x0000003b89a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7ce0abeccf in syncenv_task () from /usr/lib64/libglusterfs.so.0
#2  0x00007f7ce0ac4c00 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#3  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f7ccffff700 (LWP 8711)):
#0  0x0000003b89a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7ce0abeccf in syncenv_task () from /usr/lib64/libglusterfs.so.0
#2  0x00007f7ce0ac4c00 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#3  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f7cd7079700 (LWP 8722)):
#0  0x0000003b896e8fd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f7ce0ad9c11 in event_dispatch_epoll_worker () from /usr/lib64/libglusterfs.so.0
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f7cd4f5c700 (LWP 8737)):
#0  0x0000003b896e8fd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f7ce0ad9c11 in event_dispatch_epoll_worker () from /usr/lib64/libglusterfs.so.0
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f7cd4156700 (LWP 8738)):
#0  0x0000003b89a0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7cd57f46ab in index_worker () from /usr/lib64/glusterfs/3.6.0.45/xlator/features/index.so
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f7cbf5fe700 (LWP 8753)):
#0  0x0000003b89a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7cd64690a8 in posix_janitor_thread_proc () from /usr/lib64/glusterfs/3.6.0.45/xlator/storage/posix.so
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f7cbebfd700 (LWP 8754)):
#0  0x0000003b89a0c392 in ?? () from /lib64/libpthread.so.0
#1  0x00007f7ce0ad999f in event_slot_get () from /usr/lib64/libglusterfs.so.0
#2  0x00007f7ce0ad9eb4 in event_select_on_epoll () from /usr/lib64/libglusterfs.so.0
#3  0x00007f7cd7293ef7 in socket_throttle () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#4  0x00007f7ce0851d11 in rpcsvc_request_outstanding () from /usr/lib64/libgfrpc.so.0
#5  0x00007f7ce0851d9d in rpcsvc_request_destroy () from /usr/lib64/libgfrpc.so.0
#6  0x00007f7ce0852006 in rpcsvc_submit_generic () from /usr/lib64/libgfrpc.so.0
#7  0x00007f7cd4f685a5 in server_submit_reply () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/server.so
#8  0x00007f7cd4f7be52 in server_fsync_cbk () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/server.so
#9  0x00007f7cd51a33c0 in io_stats_fsync_cbk () from /usr/lib64/glusterfs/3.6.0.45/xlator/debug/io-stats.so
#10 0x00007f7cd59fbbbb in barrier_fsync_cbk () from /usr/lib64/glusterfs/3.6.0.45/xlator/features/barrier.so
#11 0x00007f7ce0a9014c in default_fsync_cbk () from /usr/lib64/libglusterfs.so.0
#12 0x00007f7ce0aa2f36 in call_resume_unwind () from /usr/lib64/libglusterfs.so.0
#13 0x00007f7ce0aa472b in call_unwind_error () from /usr/lib64/libglusterfs.so.0
#14 0x00007f7cd6468bbc in posix_fsyncer_process () from /usr/lib64/glusterfs/3.6.0.45/xlator/storage/posix.so
#15 0x00007f7cd6468d6a in posix_fsyncer () from /usr/lib64/glusterfs/3.6.0.45/xlator/storage/posix.so
#16 0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#17 0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f7ccd689700 (LWP 15292)):
#0  0x0000003b89a0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7cd5c092c0 in iot_worker () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/io-threads.so
#2  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f7cbe1fc700 (LWP 21503)):
#0  0x0000003b896aca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003b896ac8b0 in sleep () from /lib64/libc.so.6
#2  0x00007f7cd6468710 in posix_health_check_thread_proc () from /usr/lib64/glusterfs/3.6.0.45/xlator/storage/posix.so
#3  0x0000003b89a079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b896e89dd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7ce062d700 (LWP 8707)):
#0  0x0000003b89a0822d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f7ce0ad985d in event_dispatch_epoll () from /usr/lib64/libglusterfs.so.0
#2  0x0000000000408022 in main ()

Comment 4 Ben Turner 2015-02-21 03:07:52 UTC
Also some one of the core dumps:

[root@gqac006 tmp]# gdb /usr/sbin/glusterfs /core.21666
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done.
[New Thread 21672]
[New Thread 21666]
[New Thread 21667]
[New Thread 21679]
[New Thread 21678]
[New Thread 21673]
[New Thread 21670]
[New Thread 21674]
[New Thread 21668]
[New Thread 21671]
Reading symbols from /usr/lib64/libglusterfs.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libglusterfs.so.0
Reading symbols from /usr/lib64/libgfrpc.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgfrpc.so.0
Reading symbols from /usr/lib64/libgfxdr.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgfxdr.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/read-ahead.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/read-ahead.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/readdir-ahead.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/readdir-ahead.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/io-cache.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/io-cache.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/quick-read.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/quick-read.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/md-cache.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/performance/md-cache.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/debug/io-stats.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/debug/io-stats.so
Reading symbols from /usr/lib64/glusterfs/3.6.0.45/xlator/meta.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.0.45/xlator/meta.so
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/glusterfs --volfile-server=gqas014.sbu.lab.eng.bos.redhat.com --volfi'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f3f3dc56594 in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
Missing separate debuginfos, use: debuginfo-install glusterfs-3.6.0.45-1.el6.x86_64
(gdb) thread info
No symbol table is loaded.  Use the "file" command.
(gdb) info thread
  10 Thread 0x7f3f3e860700 (LWP 21671)  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9 Thread 0x7f3f3fc62700 (LWP 21668)  0x00000038ad00f4b5 in sigwait () from /lib64/libpthread.so.0
  8 Thread 0x7f3f36fc1700 (LWP 21674)  0x00000038acce9163 in epoll_wait () from /lib64/libc.so.6
  7 Thread 0x7f3f3f261700 (LWP 21670)  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 0x7f3f379c2700 (LWP 21673)  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  5 Thread 0x7f3f352bb700 (LWP 21678)  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  4 Thread 0x7f3f2bfff700 (LWP 21679)  0x00000038ad00e75d in read () from /lib64/libpthread.so.0
  3 Thread 0x7f3f40663700 (LWP 21667)  0x00000038ad00ef3d in nanosleep () from /lib64/libpthread.so.0
  2 Thread 0x7f3f47c18700 (LWP 21666)  0x00000038ad00822d in pthread_join () from /lib64/libpthread.so.0
* 1 Thread 0x7f3f3da31700 (LWP 21672)  0x00007f3f3dc56594 in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
(gdb) bt
#0  0x00007f3f3dc56594 in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#1  0x00007f3f3dc57d1d in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#2  0x00007f3f480c4d1c in ?? () from /usr/lib64/libglusterfs.so.0
#3  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 2
[Switching to thread 2 (Thread 0x7f3f47c18700 (LWP 21666))]#0  0x00000038ad00822d in pthread_join () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00822d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f3f480c485d in ?? () from /usr/lib64/libglusterfs.so.0
#2  0x0000000000408022 in main ()
(gdb) t 3
[Switching to thread 3 (Thread 0x7f3f40663700 (LWP 21667))]#0  0x00000038ad00ef3d in nanosleep () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00ef3d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f3f4808a03c in gf_timer_proc () from /usr/lib64/libglusterfs.so.0
#2  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 4
[Switching to thread 4 (Thread 0x7f3f2bfff700 (LWP 21679))]#0  0x00000038ad00e75d in read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00e75d in read () from /lib64/libpthread.so.0
#1  0x00007f3f40676d03 in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#2  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0x7f3f352bb700 (LWP 21678))]#0  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000038ad009508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000038ad0093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3f3dc54666 in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#4  0x00007f3f47e42bd7 in rpc_clnt_submit () from /usr/lib64/libgfrpc.so.0
#5  0x00007f3f3ce127c2 in client_submit_vec_request () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so
#6  0x00007f3f3ce12d06 in client3_3_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so
#7  0x00007f3f3cdff01f in client_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so
#8  0x00007f3f3cbd5d89 in dht_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so
#9  0x00007f3f3c990ee4 in wb_fulfill_head () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so
#10 0x00007f3f3c99107b in wb_fulfill () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so
#11 0x00007f3f3c992a92 in wb_process_queue () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so
#12 0x00007f3f3c993398 in wb_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/write-behind.so
#13 0x00007f3f3c78357c in ra_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/read-ahead.so
#14 0x00007f3f48073b30 in default_writev () from /usr/lib64/libglusterfs.so.0
#15 0x00007f3f3c3701c6 in ioc_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/io-cache.so
#16 0x00007f3f3c164000 in qr_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/quick-read.so
#17 0x00007f3f37df5b0b in mdc_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/performance/md-cache.so
#18 0x00007f3f37be0f37 in io_stats_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/debug/io-stats.so
#19 0x00007f3f48073b30 in default_writev () from /usr/lib64/libglusterfs.so.0
#20 0x00007f3f379c815e in meta_writev () from /usr/lib64/glusterfs/3.6.0.45/xlator/meta.so
#21 0x00007f3f40684943 in fuse_write_resume () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#22 0x00007f3f4066c856 in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
---Type <return> to continue, or q <return> to quit---
#23 0x00007f3f4066c586 in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#24 0x00007f3f4066c89e in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#25 0x00007f3f4066c921 in fuse_resolve_continue () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#26 0x00007f3f4066c52e in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#27 0x00007f3f4066c87e in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#28 0x00007f3f4066c8c8 in fuse_resolve_and_resume () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#29 0x00007f3f406842e0 in ?? () from /usr/lib64/glusterfs/3.6.0.45/xlator/mount/fuse.so
#30 0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#31 0x00000038acce8b6d in clone () from /lib64/libc.so.6 
(gdb) t 6
[Switching to thread 6 (Thread 0x7f3f379c2700 (LWP 21673))]#0  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000038ad009508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000038ad0093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3f3dc55f2b in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#4  0x00007f3f3dc57d1d in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so
#5  0x00007f3f480c4d1c in ?? () from /usr/lib64/libglusterfs.so.0
#6  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 7
[Switching to thread 7 (Thread 0x7f3f3f261700 (LWP 21670))]#0  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3f480a9ccf in syncenv_task () from /usr/lib64/libglusterfs.so.0
#2  0x00007f3f480afc00 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#3  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 8
[Switching to thread 8 (Thread 0x7f3f36fc1700 (LWP 21674))]#0  0x00000038acce9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00000038acce9163 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f3f480c4c11 in ?? () from /usr/lib64/libglusterfs.so.0
#2  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 9
[Switching to thread 9 (Thread 0x7f3f3fc62700 (LWP 21668))]#0  0x00000038ad00f4b5 in sigwait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00f4b5 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004057bb in glusterfs_sigwaiter ()
#2  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 10
[Switching to thread 10 (Thread 0x7f3f3e860700 (LWP 21671))]#0  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038ad00b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3f480a9ccf in syncenv_task () from /usr/lib64/libglusterfs.so.0
#2  0x00007f3f480afc00 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#3  0x00000038ad0079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000038acce8b6d in clone () from /lib64/libc.so.6
(gdb) t 11
Thread ID 11 not known.
(gdb) quit

Comment 5 Ben Turner 2015-02-21 03:19:46 UTC
Also to note, I am not seeing problems on pure distribute volumes.  These problems only have happened on volumes with replication so far.  I am fairly confident I can not repro this on pure dist volumes, ran for 3 hours straight on pure dist.  The FS hung on a replica on the first, third, fourth and sixth run when I switched to replica 2.

Comment 7 Shyamsundar 2015-02-23 15:50:29 UTC
(In reply to Ben Turner from comment #3)

This comment was to capture a hang/busy CPU on gluster, where the thread seems to be spinning on a pthread spin lock in the epoll event maintenance code. This has no bearing on the crashes (till now at least). If observed, we need to open a separate bug for the same so that we do not spin and hang systems as mentioned in the comment #3.

Comment 9 Shyamsundar 2015-02-23 17:37:45 UTC
(In reply to Ben Turner from comment #4)

More from the core post installation of the gluster debug info:

Core was generated by `/usr/sbin/glusterfs --volfile-server=gqas014.sbu.lab.eng.bos.redhat.com --volfi'.
Program terminated with signal 11, Segmentation fault.
#0  __socket_proto_state_machine (this=0x7f3f30088360) at socket.c:2068
2068                                    vector[count].iov_base = iobuf_ptr (in->iobuf);

(gdb) bt
#0  __socket_proto_state_machine (this=0x7f3f30088360) at socket.c:2068
#1  socket_proto_state_machine (this=0x7f3f30088360) at socket.c:2139
#2  socket_event_poll_in (this=0x7f3f30088360) at socket.c:2160
#3  0x00007f3f3dc57d1d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f3f30088360, poll_in=1, poll_out=0, poll_err=0) at socket.c:2280
#4  0x00007f3f480c4d1c in event_dispatch_epoll_handler (data=0x1242bc0) at event-epoll.c:551
#5  event_dispatch_epoll_worker (data=0x1242bc0) at event-epoll.c:643
#6  0x00000038ad0079d1 in start_thread (arg=0x7f3f3da31700) at pthread_create.c:301
#7  0x00000038acce8b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

(gdb) list
2063                                                    ret = -1;
2064                                                    goto out;
2065                                            }
2066                                    }
2067
2068                                    vector[count].iov_base = iobuf_ptr (in->iobuf);
2069                                    vector[count].iov_len = in->iobuf_size;
2070
2071                                    iobref = in->iobref;
2072

(gdb) info args 
pollin = <value optimized out>
this = 0x7f3f30088360

(gdb) info locals 
count = 1
ret = <value optimized out>
iobuf = <value optimized out>
frag = 0x7f3f30088fa8
priv = 0x7f3f30088f80
iobref = 0x7f3f30d524a0
vector = {{iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
in = 0x7f3f30088fa0

(gdb) p in
$1 = (struct gf_sock_incoming *) 0x7f3f30088fa0

(gdb) p *in
$2 = {record_state = SP_STATE_NADA, frag = {fragcurrent = 0x7f3f47bf6dd8 "", bytes_read = 0, remaining_size = 0, vector = {iov_base = 0x7f3f47bf6cec, iov_len = 236}, pending_vector = 0x7f3f30088fb8, call_body = {request = {
        header_state = SP_STATE_REQUEST_HEADER_INIT, vector_state = SP_STATE_VECTORED_REQUEST_INIT, vector_sizer_state = 0}, reply = {accept_status = 0, status_state = SP_STATE_VECTORED_REPLY_STATUS_INIT, 
        accepted_state = SP_STATE_ACCEPTED_REPLY_INIT, accepted_success_state = SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT}}, simple_state = SP_STATE_SIMPLE_MSG_INIT, state = SP_STATE_RPCFRAG_INIT}, proghdr_base_addr = 0x0, iobuf = 0x0, 
  iobuf_size = 236, vector = {{iov_base = 0x7f3f30089054, iov_len = 4}, {iov_base = 0x0, iov_len = 0}}, count = 0, payload_vector = {iov_base = 0x0, iov_len = 0}, iobref = 0x7f3f30d524a0, request_info = 0x0, 
  pending_vector = 0x7f3f30088fc8, pending_count = 0, fraghdr = 2147483884, complete_record = 0 '\000', msg_type = REPLY, total_bytes_read = 236, ra_read = 236, ra_max = 236, ra_served = 236, ra_buf = 0x7f3f47bf6c00 ""}

in->iobuf is NULL hence the de-reference of iobuf_ptr of in->iobuf has caused the seg fault in this case.

Interestingly the record_state should have changed from NADA to something before coming to this part of the code. Needs more analysis of the core to progress.

Comment 10 Shyamsundar 2015-02-23 21:15:40 UTC
Analysis as follows shows that we are probably racing between 2 epoll threads hitting the said race in the review @ http://review.gluster.org/#/c/3842/19/libglusterfs/src/event-epoll.c

The issue is that we drop the socket priv lock here, __socket_read_reply which we assume is not dropped in between a call to, __socket_proto_state_machine.

From the various threads in the core, we can see 3 threads operating on the same socket, one in write and 2 in read (one waiting on the lock in __socket_proto_state_machine and the other causing the crash above).

From the strack trace in comment #9 above, we see that the record_state should not be NADA, at the point of the crash. This could only happen in case another thread raced on the same and the processing within __socket_proto_state_machine was garbled.

The other 2 threads that are presently operating on the same socket are as follows,
Thread 6 (Thread 0x7f3f379c2700 (LWP 21673)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00000038ad009508 in _L_lock_854 () from /lib64/libpthread-2.12.so
#2  0x00000038ad0093d7 in __pthread_mutex_lock (mutex=0x7f3f30089088) at pthread_mutex_lock.c:61
#3  0x00007f3f3dc55f2b in socket_proto_state_machine (this=0x7f3f30088360) at socket.c:2137
#4  socket_event_poll_in (this=0x7f3f30088360) at socket.c:2160
#5  0x00007f3f3dc57d1d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f3f30088360, poll_in=1, poll_out=0, poll_err=0) at socket.c:2280
#6  0x00007f3f480c4d1c in event_dispatch_epoll_handler (data=0x7f3f3002bf50) at event-epoll.c:551
#7  event_dispatch_epoll_worker (data=0x7f3f3002bf50) at event-epoll.c:643
#8  0x00000038ad0079d1 in start_thread (arg=0x7f3f379c2700) at pthread_create.c:301
#9  0x00000038acce8b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7f3f352bb700 (LWP 21678)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00000038ad009508 in _L_lock_854 () from /lib64/libpthread-2.12.so
#2  0x00000038ad0093d7 in __pthread_mutex_lock (mutex=0x7f3f30089088) at pthread_mutex_lock.c:61
#3  0x00007f3f3dc54666 in socket_submit_request (this=0x7f3f30088360, req=0x7f3f352ba050) at socket.c:3166
#4  0x00007f3f47e42bd7 in rpc_clnt_submit (rpc=0x7f3f30078760, prog=<value optimized out>, procnum=<value optimized out>, cbkfn=0x7f3f3ce1c3d0 <client3_3_writev_cbk>, proghdr=0x7f3f352ba1a0, proghdrcount=1, progpayload=0x7f3f352ba690, 
    progpayloadcount=1, iobref=<value optimized out>, frame=0x7f3f46f0b474, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1555
#5  0x00007f3f3ce127c2 in client_submit_vec_request (this=0x7f3f300060f0, req=0x7f3f352ba460, frame=0x7f3f46f0b474, prog=0x7f3f3d030820, procnum=13, cbkfn=0x7f3f3ce1c3d0 <client3_3_writev_cbk>, payload=0x7f3f352ba690, payloadcnt=1, 
    iobref=0x7f3f20025a90, xdrproc=0x7f3f47c25ec0 <xdr_gfs3_write_req>) at client-rpc-fops.c:88
#6  0x00007f3f3ce12d06 in client3_3_writev (frame=0x7f3f46f0b474, this=0x7f3f300060f0, data=0x7f3f352ba4f0) at client-rpc-fops.c:4187
#7  0x00007f3f3cdff01f in client_writev (frame=0x7f3f46f0b474, this=0x7f3f300060f0, fd=<value optimized out>, vector=<value optimized out>, count=<value optimized out>, off=<value optimized out>, flags=32770, iobref=0x7f3f20025a90, 
    xdata=0x0) at client.c:1016

So basically I believe we hit the race condition due to dropping the lock in between, as errno in the main thread that crashed is also 11 (EAGAIN) but not handled overall. We need to verify the analysis to move forward with a fix for the same,

Verify,
1) Can __socket_read_reply drop the lock processing map_xid?
2) The incoming state as present for this request, seems to have read a few bytes or an entire RPC request (check ra_read/max/served), so is it possible that the 2 threads caused this garbling of data?
3) The write seems to need poll_out as the socket ioq seems to have a request cached (inspect priv->ioq where priv is socket_private_t)
4) The socket is also primed for poll_out as seen in slot->events on frame 5 (0xc0000007 - where 07 is the IN/OUT/PRI (ONESHOT and ET are the higher bits, see sys/epoll.h), which means a poll-out was requested and hence the chance for the socket to be armed once more on the said race.
5) The slot in event_dispatch_epoll_handler also has an in handler count of 2, but the threads evidently demonstrate the same.

Bottomline, even with LT mode of epoll (which is another patch to handle the throttle) we could hit this condition, as a result we would need a fix for the mentioned race in the review.

Comment 12 Raghavendra G 2015-02-24 04:28:46 UTC
@shyam,

Good one!! I was raking my heads from past two days thinking how can someone modify the state since we are holding the lock. Now that you point out the code its clearer. Yes, dropping the lock in __socket_read_reply is highly likely reason for the crash.

regards,
Raghavendra

Comment 16 Ben Turner 2015-03-11 14:30:28 UTC
Verified on  glusterfs-3.6.0.50-1.

Comment 18 errata-xmlrpc 2015-03-26 06:36:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0682.html


Note You need to log in before you can comment on or make changes to this bug.