Bug 960141 - NFS no longer responds, get "Reply submission failed" errors
Summary: NFS no longer responds, get "Reply submission failed" errors
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: GlusterFS
Classification: Community
Component: nfs
Version: 3.3.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-06 14:55 UTC by Michael Brown
Modified: 2014-12-14 19:40 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:31 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Michael Brown 2013-05-06 14:55:37 UTC
Description of problem:
Using glusterfs-3.3.1 (plus a few patches, see git://github.com/Supermathie/glusterfs.git branch release-3.3-oracle, latest commit is 3bebd52812906f9b48677779fa15f91341437cb8) my setup gets into a state such that GlusterFS no longer is able to respond to RPCs

Version-Release number of selected component (if applicable):
3.3.1+

How reproducible:
100%

Steps to Reproduce:
1. Run heavy workload against gluster
2. NFS stops responding:
[2013-05-05 01:28:56.165044] I [afr-self-heal-data.c:712:afr_sh_data_fix] 0-gv0-replicate-5: no active sinks for performing self-heal on file /fleming4/db0/ALTUS_data/users04.dbf
[2013-05-05 01:28:56.218572] I [afr-self-heal-data.c:712:afr_sh_data_fix] 0-gv0-replicate-4: no active sinks for performing self-heal on file /fleming2/db0/ALTUS_data/users01.dbf
[2013-05-05 01:31:16.259641] W [socket.c:195:__socket_rwv] 0-socket.nfs-server: writev failed (Connection reset by peer)
[2013-05-05 01:31:16.315405] I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
[2013-05-05 01:31:16.421480] I [socket.c:2387:socket_submit_reply] 0-socket.nfs-server: not connected (priv->connected = 255)
[2013-05-05 01:31:16.421507] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3705786983x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-05 01:31:16.421528] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-05 01:31:16.966684] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3705788008x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-05 01:31:16.966725] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-05 01:31:16.999766] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3705795067x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-05 01:31:16.999794] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-05 01:31:17.191748] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3705795452x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)

3. Restart NFS (or all of GlusterFS itself) and the same thing happens
  
Actual results:
[2013-05-05 01:31:16.421507] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3705786983x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-05 01:31:16.421528] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed

Expected results:
NFS works

Additional info:
 The NFS daemon also gets rather large - I suspect it's storing up data for all these RPCs:
  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
30674 root       20   0 37.6G 37.3G  2288 R 99.0 29.6  5:29.88 /usr/local/glusterfs/sbin/glusterfs

Comment 1 Michael Brown 2013-05-06 15:57:11 UTC
Same situation, with DEBUG on.SimilarSimilar

[2013-05-06 11:53:56.971002] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs3_read_fd_resume+0x86) [0x7f9ea1a29e66] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs_fop_read+0x86) [0x7f9ea1a11996] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs_fop_local_init+0x1b) [0x7f9ea1a112bb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.971052] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs3_read_fd_resume+0x86) [0x7f9ea1a29e66] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs_fop_read+0xe1) [0x7f9ea1a119f1] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(mem_get0+0x1b) [0x7f9ea6b05feb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.971094] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs_fop_read+0x1cd) [0x7f9ea1a11add] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/debug/io-stats.so(io_stats_readv+0x75) [0x7f9ea1c5b4d5] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(mem_get0+0x1b) [0x7f9ea6b05feb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.971132] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/debug/io-stats.so(io_stats_readv+0x164) [0x7f9ea1c5b5c4] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/performance/write-beh
ind.so(wb_readv+0x1a2) [0x7f9ea1e77cf2] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(mem_get0+0x1b)Similar [0x7f9ea6b05feb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.971188] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(fop_readv_stub+0x45) [0x7f9ea6afa565] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(+0x312cd) [0x7f9ea6af72cd] (-->/usr/local/glusterfs/
lib/libglusterfs.so.0(mem_get0+0x1b) [0x7f9ea6b05feb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.973938] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 73f5c6e8, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-06 11:53:56.973970] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1945487080x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-06 11:53:56.973985] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-06 11:53:56.974320] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 740b0400, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-06 11:53:56.975805] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 740adb6e, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-06 11:53:56.975830] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1946868590x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-06 11:53:56.975844] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-06 11:53:56.975878] D [afr-lk-common.c:403:transaction_lk_op] 0-gv0-replicate-1: lk op is for a transaction
[2013-05-06 11:53:56.975962] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/libgfrpc.so.0(rpcsvc_notify+0x93) [0x7f9ea68b8463] (-->/usr/local/glusterfs/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0xee) [0x7f9ea68b814e] (-->/usr/local/glusterfs/lib/libgfrpc.so.0(rpcsvc_request_create+0x79) [0x7f9ea68b7d39]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-06 11:53:56.975990] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 7436fcc8, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid f4aca6a7-9210-435c-b2ca-0008f0650c39, offset: 422511616,  count: 65536
[2013-05-06 11:53:56.976036] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs3svc_read+0x92) [0x7f9ea1a29cf2] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs3_read+0x31e) [0x7f9ea1a29aae] (-->/usr/local/glusterfs/lib/glusterfs/3.3git/xlator/nfs/server.so(nfs3_call_state_init+0x45) [0x7f9ea1a25435]))) 0-mem-pool: Mem pool is full. Callocing mem

Comment 2 Michael Brown 2013-05-06 16:00:35 UTC
Trace output:

[2013-05-05 02:49:16.628001] D [mem-pool.c:457:mem_get] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(fop_writev_stub+0x52) [0x7f1c05dc7d22] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(+0x312cd) [0x7f1c05dbf2cd] (-->/usr/local/glusterfs/lib/libglusterfs.so.0(mem_get0+0x1b) [0x7f1c05dcdfeb]))) 0-mem-pool: Mem pool is full. Callocing mem
[2013-05-05 02:49:16.628045] D [nfs3-helpers.c:3441:nfs3_log_write_res] 0-nfs-nfsv3: XID: 72786bb1, WRITE: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, STABLE,wverf: 1367736515
[2013-05-05 02:49:16.628065] T [rpcsvc.c:1050:rpcsvc_submit_generic] 0-rpc-service: Tx message: 28
[2013-05-05 02:49:16.628079] T [rpcsvc.c:676:rpcsvc_record_build_header] 0-rpc-service: Reply fraglen 52, payload: 28, rpc hdr: 24
[2013-05-05 02:49:16.628107] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x1920494513x, Program: NFS3, ProgVers: 3, Proc: 7) to rpc-transport (socket.nfs-server)
[2013-05-05 02:49:16.628124] T [nfs3.c:447:nfs3_call_state_wipe] 0-nfs-nfsv3: fd 0x264f8dc ref: 321
[2013-05-05 02:49:16.628139] T [write-behind.c:456:wb_sync] 0-gv0-write-behind: no vectors are to besynced
[2013-05-05 02:49:16.628169] T [rpc-clnt.c:685:rpc_clnt_reply_init] 0-gv0-client-8: received rpc message (RPC XID: 0x2495x Program: GlusterFS 3.1, ProgVers: 330, Proc: 12) from rpc-transport (gv0-client-8)
[2013-05-05 02:49:16.629067] T [write-behind.c:456:wb_sync] 0-gv0-write-behind: no vectors are to besynced
[2013-05-05 02:49:16.629091] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 73f42f15, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-05 02:49:16.629109] T [rpcsvc.c:1050:rpcsvc_submit_generic] 0-rpc-service: Tx message: 8296
[2013-05-05 02:49:16.629123] T [rpcsvc.c:676:rpcsvc_record_build_header] 0-rpc-service: Reply fraglen 8320, payload: 8296, rpc hdr: 24
[2013-05-05 02:49:16.629136] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1945382677x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-05 02:49:16.629150] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-05 02:49:16.629162] T [nfs3.c:447:nfs3_call_state_wipe] 0-nfs-nfsv3: fd 0x264f7b0 ref: 44403

Comment 3 Michael Brown 2013-05-06 16:21:51 UTC
Looks like NFS is still responding to *some* requests:

[root@fearless1 ~]# tshark -n -i eth2 -R "rpc" -c 800
Running as user "root" and group "root". This could be dangerous.
Capturing on eth2
  0.144428 192.168.10.4 -> 192.168.10.1 NFS V3 NULL Call
  0.575769 192.168.10.1 -> 192.168.10.4 NFS V3 NULL Reply (Call In 3)
  1.076045 192.168.10.6 -> 192.168.10.1 NFS V3 NULL Call
  1.076527 192.168.10.3 -> 192.168.10.1 NFS V3 NULL Call
  1.138787 192.168.10.1 -> 192.168.10.4 NFS [RPC duplicate of #17]V3 NULL Reply (Call In 3)
  1.139004 192.168.10.6 -> 192.168.10.1 NFS V3 GETATTR Call, FH:0x821472b3
  1.226347 192.168.10.6 -> 192.168.10.1 NFS [TCP ZeroWindow] V3 READ Call, FH:0xb45a3c21 Offset:1593188352 Len:8192  ; V
3 READ Call, FH:0xb45a3c21 Offset:1593196544 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1802338304 Len:8192  ; V3 RE
AD Call, FH:0xb45a3c21 Offset:1593204736 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1593212928 Len:8192  ; V3 READ C
all, FH:0xb45a3c21 Offset:1593221120 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1593229312 Len:8192  ; V3 READ Call,
 FH:0xb45a3c21 Offset:1593237504 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1593245696 Len:8192  ; V3 READ Call, FH:
0xb45a3c21 Offset:1800503296 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1593253888 Len:8192  ; V3 READ Call, FH:0xb4
5a3c21 Offset:1917083648 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:2022531072 Len:8192  ; V3 READ Call, FH:0x10d045
d8 Offset:4241022976 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1918623744 Len:8192  ; V3 READ Call, FH:0xb45a3c21 O
ffset:1591746560 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1918279680 Len:8192  ; V3 NULL Call  ; V3 READ Call, FH:
0x10d045d8 Offset:4241055744 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590812672 Len:8192  ; V3 READ Call, FH:0xb4
5a3c21 Offset:2022137856 Len:8192  ; V3 READ Call, FH:0x10d045d8 Offset:4241063936 Len:8192  ; V3 READ Call, FH:0xb45a3c
21 Offset:1917304832 Len:8192  ; V3 READ Call, FH:0x10d045d8 Offset:4241072128 Len:8192  ; V3 READ Call, FH:0xb45a3c21 O
ffset:1591279616 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1591754752 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offse
t:2022146048 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1591296000 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:15
90870016 Len:8192  ; V3 READ Call, FH:0x99b731c7 Offset:121511936 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590181
888 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1918525440 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590820864 
Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1917313024 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590190080 Len:
8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1918763008 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1917100032 Len:8192
  ; V3 READ Call, FH:0xb45a3c21 Offset:1590198272 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1591762944 Len:8192  ; 
V3 READ Call, FH:0xb45a3c21 Offset:1918844928 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590206464 Len:8192  ; V3 R
EAD Call, FH:0xb45a3c21 Offset:1918533632 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590214656 Len:8192  ; V3 READ 
Call, FH:0xb45a3c21 Offset:1918771200 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1591287808 Len:8192  ; V3 READ Call
, FH:0xb45a3c21 Offset:1920507904 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:2022555648 Len:8192  ; V3 READ Call, FH
:0xb45a3c21 Offset:1590878208 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1917108224 Len:8192  ; V3 READ Call, FH:0xb
45a3c21 Offset:1590222848 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1920516096 Len:8192  ; V3 READ Call, FH:0xb45a3
c21 Offset:1591304192 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1591984128 Len:8192  ; V3 READ Call, FH:0xb45a3c21 
Offset:1918853120 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1919655936 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offs
et:1590231040 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1590370304 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:1
590902784 Len:8192  ; V3 READ Call, FH:0x99b731c7 Offset:121610240 Len:8192  ; V3 READ Call, FH:0xb45a3c21 Offset:159023
9232 Len:8192
  1.438041 192.168.10.6 -> 192.168.10.1 RPC [TCP ZeroWindow] Continuation
  1.681157 192.168.10.6 -> 192.168.10.1 NFS V3 WRITE Call, FH:0x86e03073 Offset:39110 Len:3030 UNSTABLE
  1.722757 192.168.10.1 -> 192.168.10.6 NFS V3 NULL Reply (Call In 43)
  1.722784 192.168.10.1 -> 192.168.10.3 NFS V3 NULL Reply (Call In 45)
  2.817842 192.168.10.1 -> 192.168.10.6 NFS V3 WRITE Reply (Call In 82) Len:3030 UNSTABLE
  2.817944 192.168.10.6 -> 192.168.10.1 NFS V3 WRITE Call, FH:0x7444543b Offset:31672 Len:3001 UNSTABLE
  3.318538 192.168.10.1 -> 192.168.10.6 NFS V3 GETATTR Reply (Call In 56)  Regular File mode:0640 uid:500 gid:1000
  3.318613 192.168.10.6 -> 192.168.10.1 NFS V3 WRITE Call, FH:0xcbdf97b5 Offset:1618 Len:90 UNSTABLE
  3.850941 192.168.10.1 -> 192.168.10.6 NFS V3 WRITE Reply (Call In 111) Len:3001 UNSTABLE

Comment 4 Michael Brown 2013-05-06 18:03:42 UTC
I'm starting to suspect that Gluster is unable to properly handle and respond to RPC packets with a large number of calls bundled together.

Notice the packet at t=1.226347 above - there's around 50 READ calls all bundled together, none of which seem to get responded to.

Is it possible that constructing the reply/replies is overwhelming Gluster somehow?

Comment 5 Michael Brown 2013-05-06 19:10:06 UTC
GlusterFS is definitely getting into this state (not responding to bulk READ calls but still responding to other calls, glusterfs/nfs server at 99% CPU) *before* the errors start appearing in the log.

Comment 6 Michael Brown 2013-05-09 06:24:42 UTC
(there's probably going to be a TON of wrapped lines, so you can see the unwrapped version at http://paste.fedoraproject.org/11223/80643136/)

OK, after taking a much more serious look at this, I've got:
* a full packet capture of a situation that causes this problem (358MB compressed)
* logs from gluster, truncated before starting, at TRACE level (88MB compressed)
* a corefile from the gluster/nfs process generated near the end of the test (10GB, compressible to 133MB or so)

My configuration:
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 52b830b4-f6b0-4103-b90b-ce665dfbafec
Status: Stopped
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.nfs.io-threads: on
performance.client-io-threads: on
performance.io-cache: on
performance.write-behind-window-size: 4MB
performance.nfs.write-behind: on
nfs.ports-insecure: on
diagnostics.client-log-level: INFO


Notable "problems":
RPC context pointer, requested by JoeJulian:
(gdb) print *this
$22 = {ops = 0x7f05d50d2580, listener = 0x0, private = 0x27b6d90, xl_private = 0x0, xl = 0x261ffb0, mydata = 0x27a70d0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
      __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, refcount = 2, ctx = 0x25dd010, options = 0x7f05d70ba3c8, 
  name = 0x26215e0 "gv0-client-9", dnscache = 0x28fda50, buf = 0x0, init = 0x7f05d4ec5650 <init>, fini = 0x7f05d4ec6780 <fini>, reconfigure = 0x7f05d4ec56d0 <reconfigure>, 
  notify = 0x7f05d83f2bd0 <rpc_clnt_notify>, notify_data = 0x0, peerinfo = {sockaddr = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, sockaddr_len = 16, 
    identifier = "192.168.11.2:24037", '\000' <repeats 89 times>}, myinfo = {sockaddr = {ss_family = 2, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, sockaddr_len = 16, 
    identifier = "192.168.11.1:933", '\000' <repeats 91 times>}, total_bytes_read = 134672272, total_bytes_write = 3678711, list = {next = 0x0, prev = 0x0}, bind_insecure = 0}

40 million oustanding frames at this point:
(gdb) print ((call_pool_t)this->ctx->pool)
$23 = {{all_frames = {next = 0x25f96c0, prev = 0x25f9620}, all_stacks = {next_call = 0x25f96c0, prev_call = 0x25f9620}}, cnt = 39705952, lock = 0, frame_mem_pool = 0x0, stack_mem_pool = 0x0}

gluster/nfs gets into a state (very quickly) where it's spending all of it's time spinning somewhere:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                                                                                                  
26240 root       20   0 4287M 2062M  2736 R 1687  1.6 21:53.95 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26656 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:06.76 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26649 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:07.32 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26684 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:08.14 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26682 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:10.35 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26678 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:07.98 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26675 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:07.45 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26679 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:10.32 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26676 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:10.89 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26681 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:06.87 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26677 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:06.87 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26672 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:11.07 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26674 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:07.54 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26680 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:10.12 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26280 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:10.37 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26683 root       20   0 4287M 2062M  2736 R 99.0  1.6  1:08.90 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
26652 root       20   0 4287M 2062M  2736 R 98.0  1.6  1:09.92 /usr/local/glusterfs/sbin/glusterfs -s localhost --volfile-id gluster/nfs 
 6580 michael    20   0  111M  3112  1232 R  3.0  0.0  6:56.91 htop
 1611 root       20   0     0     0     0 S  0.0  0.0  0:25.56 fio-wq/0 
 1647 root       20   0     0     0     0 S  0.0  0.0  0:21.33 fioa-scan
26228 root       20   0 1386M 57112  2172 S  0.0  0.0  0:09.38 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles
26267 root       20   0 1386M 57112  2172 S  0.0  0.0  0:01.46 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles
26646 root       20   0 1386M 57112  2172 S  0.0  0.0  0:00.59 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles
26671 root       20   0 1386M 57112  2172 S  0.0  0.0  0:00.57 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles
26685 root       20   0 1386M 57112  2172 S  0.0  0.0  0:00.46 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles
26686 root       20   0 1386M 57112  2172 S  0.0  0.0  0:00.25 /usr/local/glusterfs/sbin/glusterfsd -s localhost --volfile-id gv0.fearles

yet the disks are idle:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.41    0.00    0.26    0.00    0.00   81.33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdf               0.00     0.00    0.00    0.67     0.00     6.17    18.50     0.00    2.00   2.00   0.13
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.33     0.00     1.33     8.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.67     0.00     6.17    18.50     0.00    2.00   2.00   0.13
sdi               0.00     0.00    0.00    0.33     0.00     1.33     8.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
fioa              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

and volume profile says that operations aren't being responded to:
NFS Server : localhost
----------------------
Cumulative Stats:
   Block Size:                  2b+                   8b+                  16b+ 
 No. of Reads:                    0                     0                     0 
No. of Writes:                    2                     6                    18 
 
   Block Size:                 32b+                  64b+                 128b+ 
 No. of Reads:                    0                     0                     0 
No. of Writes:                   85                    30                    19 
 
   Block Size:                256b+                 512b+                1024b+ 
 No. of Reads:                    0                    23                    18 
No. of Writes:                   81                    28                    54 
 
   Block Size:               2048b+                4096b+                8192b+ 
 No. of Reads:                    6                    28                385676 
No. of Writes:                   21                    16                     7 
 
   Block Size:              16384b+               32768b+               65536b+ 
 No. of Reads:                  289                   135                 31300 
No. of Writes:                  235                     0                     0 
 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us              1      FORGET
      0.00       0.00 us       0.00 us       0.00 us             81     RELEASE
      0.00     671.00 us     671.00 us     671.00 us              1      UNLINK
      0.00     151.88 us     139.00 us     166.00 us              8        OPEN
      0.00    1622.00 us    1622.00 us    1622.00 us              1      RENAME
      0.00     411.88 us     345.00 us     502.00 us              8       FSTAT
      0.00    1515.50 us     549.00 us    3835.00 us              8    READDIRP
      0.00     488.23 us     155.00 us    4467.00 us             30          LK
      0.00     832.00 us     211.00 us    3206.00 us            119     SETATTR
      0.00     167.32 us     115.00 us    5326.00 us            677      STATFS
      0.00   11469.73 us     464.00 us  672289.00 us             74      CREATE
      0.00    3335.09 us      60.00 us 3806186.00 us           1200      ACCESS
      0.00  130144.47 us     235.00 us 26674232.00 us            256      LOOKUP
      0.00  223278.53 us       5.00 us 5554879.00 us            602       WRITE
      0.00  582000.45 us      51.00 us 75297085.00 us            278       FLUSH
      0.01  129560.57 us      65.00 us 70316547.00 us           3446        STAT
     99.98 8454338.20 us      99.00 us 115469515.00 us         417475        READ
 
    Duration: 6021 seconds
   Data Read: 5221816212 bytes
Data Written: 4168883 bytes

glusterfs really is taking a long time to respond:
rpc.xid == 0x8879128a
[2013-05-08 21:16:00.564817] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:16:00.564860] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:16:03.161708] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:16:03.161784] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:16:13.940617] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:16:13.940662] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:16:41.028161] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:16:41.028205] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:17:03.172976] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-08 21:17:03.173042] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-08 21:17:40.996580] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-08 21:17:40.996628] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-08 21:17:44.372915] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:17:44.372956] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:18:12.439583] T [rpcsvc.c:382:rpcsvc_request_create] 0-rpc-service: received rpc-message (XID: 0x8879128a, Ver: 2, Program: 100003, ProgVers: 3, Proc: 6) from rpc-transport (socket.nfs-server)
[2013-05-08 21:18:12.439628] D [nfs3-helpers.c:1757:nfs3_log_rw_call] 0-nfs-nfsv3: XID: 8879128a, READ: args: FH: exportid 52b830b4-f6b0-4103-b90b-ce665dfbafec, gfid 77c67b0a-57a1-4b56-a6dd-15899529facd, offset: 7456178176,  count: 8192
[2013-05-08 21:19:29.949901] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-08 21:19:29.949953] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)
[2013-05-08 21:23:02.353017] D [nfs3-helpers.c:3421:nfs3_log_read_res] 0-nfs-nfsv3: XID: 8879128a, READ: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 8192, is_eof: 0, vector: count: 1, len: 8192
[2013-05-08 21:23:02.353066] T [rpcsvc.c:1087:rpcsvc_submit_generic] 0-rpc-service: submitted reply for rpc-message (XID: 0x8879128a, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)

Comment 7 Michael Brown 2013-05-09 18:48:57 UTC
I have a coredump, binaries, pcap and gluster trace available for download... I don't mind passing it around the "gluster devs" but I don't want it public.

Comment 8 Michael Brown 2013-05-14 14:40:26 UTC
This situation also gets files on disk into a bad state. Looks like some writew made it *ONLY TO ONE REPLICA*.

I have many files reported as needing healing, but gluster doesn't detect them at split-brained. Looking at the files on disk:

/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

FILE=/export/bricks/5*/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

[root@fearless1 michael]# ls -al $FILE
-rw-r-----. 2 500 1000 8409 May  6 12:02 /export/bricks/500117310007a858/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

[root@fearless2 michael]# ls -al $FILE
-rw-r-----. 2 500 1000 8409 May  6 12:02 /export/bricks/500117310007a8f8/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

[root@fearless1 michael]# stat $FILE
  File: `/export/bricks/500117310007a858/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc'
  Size: 8409            Blocks: 32         IO Block: 4096   regular file
Device: 801h/2049d      Inode: 6543        Links: 2
Access: (0640/-rw-r-----)  Uid: (  500/ UNKNOWN)   Gid: ( 1000/ UNKNOWN)
Access: 2013-05-14 10:16:59.213714324 -0400
Modify: 2013-05-06 12:02:39.010328111 -0400
Change: 2013-05-09 13:22:13.544569304 -0400

[root@fearless2 michael]# stat $FILE
  File: `/export/bricks/500117310007a8f8/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc'
  Size: 8409            Blocks: 32         IO Block: 4096   regular file
Device: 861h/2145d      Inode: 6607        Links: 2
Access: (0640/-rw-r-----)  Uid: (  500/ UNKNOWN)   Gid: ( 1000/ UNKNOWN)
Access: 2013-05-14 10:16:59.216179634 -0400
Modify: 2013-05-06 12:02:39.014695431 -0400
Change: 2013-05-09 13:22:13.160553133 -0400

[root@fearless1 michael]# getfattr -m - -d -e hex $FILE
# file: export/bricks/500117310007a858/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.gv0-client-14=0x000000000000000000000000
trusted.afr.gv0-client-15=0x000000000000000000000000
trusted.gfid=0x0ec8adb2caa447e59b6e4b845bc414a2

[root@fearless2 michael]# getfattr -m - -d -e hex $FILE
# file: export/bricks/500117310007a8f8/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.gv0-client-14=0x000000000000000000000000
trusted.afr.gv0-client-15=0x000000000000000000000000
trusted.gfid=0x0ec8adb2caa447e59b6e4b845bc414a2

[root@fearless1 michael]# md5sum $FILE
b100f12f152f51a3b7594b4b1f465b77  /export/bricks/500117310007a858/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

[root@fearless2 michael]# md5sum $FILE
ab8eeccb15a9b26faeb44fba73903d00  /export/bricks/500117310007a8f8/glusterdata/fleming4/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_ora_17473.trc

Ah! There's the difference.

fearless1:
…
*** 2013-05-05 02:08:38.152
Recovery of Online Redo Log: Thread 1 Group 15 Seq 804 Reading mem 0

*** 2013-05-05 02:09:00.409
Recovery of Online Redo Log: Thread 1 Group 16 Seq 805 Reading mem 0

*** 2013-05-05 02:09:20.836
[1367734160836420] kgnfs_check_for_timedout_requests:5463: assert ch->pops_kgnfschnl at kgnfs.c:kgnfs_check_for_timedout_requests:5463
 [1367735871359808] kgnfs_check_for_timedout_requests:5463: assert ch->pops_kgnfschnl at kgnfs.c:kgnfs_check_for_timedout_requests:5463
…

fearless2:
…
*** 2013-05-05 02:08:38.152
Recovery of Online Redo Log: Thread 1 Group 15 Seq 804 Reading mem 0
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ [1367735871359808] kgnfs_check_for_timedout_requests:5463: assert ch->pops_kgnfschnl at kgnfs.c:kgnfs_check_for_timedout_requests:5463
…

With another file (/fleming1/db0/ALTUS_config/diagnostics/diag/rdbms/altus/ALTUS/trace/ALTUS_p020_6803.trm), however, all the tests above show the files as identical. Yet according to 'volume heal gv0 info' it still requires healing and a 'volume heal gv0 full' doesn't process it.

Comment 9 David Bomba 2013-05-20 10:26:44 UTC
We are experiencing the exact same issue. Gluster under high load (we use it to store Virtual Machine Images) locks up.


nfs.log reports the following error:

[2013-05-20 19:53:04.491580] E [nfs3.c:627:nfs3svc_submit_vector_reply] 0-nfs-nfsv3: Reply submission failed
[2013-05-20 19:54:53.004965] I [socket.c:2378:socket_submit_reply] 0-socket.nfs-server: not connected (priv->connected = 255)
[2013-05-20 19:54:53.005037] E [rpcsvc.c:1080:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x3192027547x, Program: NFS3, ProgVers: 3, Proc: 6) to rpc-transport (socket.nfs-server)

Machine is a Dual Core Xeon X5560 with 96GB RAM, and works fine under normal loads, but once CPU hits 90%+ we see the issue pop up.

Interestingly we note other gluster volumes continue to function normally whilst the high load volume simply stalls.

Comment 10 Niels de Vos 2014-11-27 14:54:21 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.


Note You need to log in before you can comment on or make changes to this bug.