Bug 761829 (GLUSTER-97) - 5 servers, 1 crashes intermittently at server-protocol.c:877
Summary: 5 servers, 1 crashes intermittently at server-protocol.c:877
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-97
Product: GlusterFS
Classification: Community
Component: protocol
Version: pre-2.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Vijay Bellur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-25 19:11 UTC by Yang
Modified: 2010-01-29 13:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTNR
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Yang 2009-06-25 16:20:36 UTC
(gdb) backtrace
#0  server_mknod_cbk (frame=0x2aaab37eab20, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, 
    op_errno=<value optimized out>, inode=0x2aaab380bfe0, stbuf=0x7fff7e40d170) at server-protocol.c:877
#1  0x00002aaaaaab507a in posix_mknod (frame=0x2aaab37eb660, this=0x60b380, loc=0x7fff7e40d2f0, mode=32896, dev=0)
    at posix.c:499
#2  0x00002aaaaaecbd34 in server_mknod (frame=<value optimized out>, bound_xl=0x60b380, params=<value optimized out>)
    at server-protocol.c:4367
#3  0x00002aaaaaec3f67 in server_protocol_interpret (trans=0x2aaab00bfb40, blk=0x2aaab37d9d50) at server-protocol.c:6074
#4  0x00002aaaaaecf5dc in notify (this=<value optimized out>, event=<value optimized out>, data=0x2aaab00bfb40)
    at server-protocol.c:6444
#5  0x00002ad72c8c8bfb in sys_epoll_iteration (ctx=<value optimized out>) at epoll.c:64
#6  0x00002ad72c8c7fb9 in poll_iteration (ctx=0x2aaab37ebec8) at transport.c:312
#7  0x000000000040291c in main (argc=7, argv=0x7fff7e40d628) at glusterfs.c:565

Comment 1 Yang 2009-06-25 19:11:36 UTC
Code version:  glusterfs-1.3.12, fuse-2.7.4glfs11
Latest code update:  2009-06-09
Code location:  feed6:/data1/nfs_share/, with local copies of libraries
Usage:  As shared, pooled-resource drive.  Machines feed1~feed5 act as servers.  Machines feed0~feed13 act as clients (including feed1~feed5).  Processes on clients read and write simultaneously to the same directory, although not same files.  One process monitors a specific file via tail -f.
Crashing machine:  feed1
Recent crashing dates:  2009-05-02, 2009-06-03, 2009-06-08, 2009-06-25 09:42

glusterfs-client.vol
-------------------------------
volume feed1
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.201
  option remote-subvolume brick
end-volume

volume feed2
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.202
  option remote-subvolume brick
end-volume

volume feed3
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.203
  option remote-subvolume brick
end-volume

volume feed4
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.204
  option remote-subvolume brick
end-volume

volume feed5
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.205
  option remote-subvolume brick
end-volume

volume client-ns
  type protocol/client
  option transport-type tcp/client
  option remote-host 10.1.90.201
  option remote-subvolume brick-ns
end-volume

volume bricks
  type cluster/unify
  option namespace client-ns
  subvolumes feed1 feed2 feed3 feed4 feed5
  option scheduler alu
  option alu.limits.min-free-disk 5%
  option alu.limits.max-open-files 100000
  option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
  option alu.disk-usage.entry-threshold 2GB
  option alu.disk-usage.exit-threshold  60MB
  option alu.open-files-usage.entry-threshold 1024
  option alu.open-files-usage.exit-threshold 32
  option alu.stat-refresh.interval 10sec
end-volume
----------------------------------------------------------------------------

glusterfs-server.vol
-----------------------------
volume brick
  type storage/posix
  option directory /data1/glusterfs
end-volume

volume brick-ns
  type storage/posix
  option directory /data1/glusterfs-ns
end-volume

volume locks
  type features/posix-locks  # posix-locks
  subvolumes brick
end-volume

volume server
  type protocol/server
  option transport-type tcp/server
  subvolumes brick brick-ns
  option auth.ip.brick.allow 10.1.90.*
  option auth.ip.brick-ns.allow 10.1.90.*
end-volume
----------------------------------------------------------------------------

Sample gluster.client.log error messages on feed1
--------------------------------
2009-06-10 15:47:22 E [unify.c:882:unify_open] bricks: /persistent_data/piddata-23480.bin: entry_count is 3
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on feed2
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on feed4
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on client-ns
...
2009-06-10 23:06:21 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 2
2009-06-10 23:06:21 E [unify.c:265:unify_lookup_cbk] bricks: Revalidate failed for /persistent_data/piddata-313.bin
2009-06-10 23:06:21 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 95612: (34) /persistent_data/piddata-313.bin => -1 (2)
...
2009-06-12 01:49:06 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 1. last sent = 2009-06-12 01:48:16. last received = 2009-06-12 01:48:16 transport-timeout = 42
2009-06-12 01:49:06 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-12 01:49:06 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-12 01:49:06 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(40) reply=@0x61e6c0
2009-06-12 01:49:06 E [client-protocol.c:4613:client_checksum_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-12 01:49:06 E [client-protocol.c:325:client_protocol_xfer] client-ns: transport_submit failed
...
2009-06-12 01:49:57 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 1. last sent = 2009-06-12 01:49:06. last received = 1969-12-31 16:00:00 transport-timeout = 42
2009-06-12 01:49:57 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-12 01:49:57 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-12 01:49:57 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(22) reply=@0x61e6c0
2009-06-12 01:49:57 E [client-protocol.c:3767:client_opendir_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-12 01:49:57 W [client-protocol.c:1711:client_closedir] client-ns: no proper fd found, returning
2009-06-12 01:49:57 W [client-protocol.c:280:client_protocol_xfer] client-ns: attempting to pipeline request type(1) op(34) with handshake
...
2009-06-14 05:14:17 W [client-protocol.c:205:call_bail] feed4: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:13:31. last received = 2009-06-14 05:05:37 transport-timeout = 42
2009-06-14 05:14:17 C [client-protocol.c:212:call_bail] feed4: bailing transport
2009-06-14 05:14:17 W [client-protocol.c:4784:client_protocol_cleanup] feed4: cleaning up state in transport object 0x616760
2009-06-14 05:14:17 E [client-protocol.c:4834:client_protocol_cleanup] feed4: forced unwinding frame type(1) op(15) reply=@0x2aaaac005e70
2009-06-14 05:14:17 E [client-protocol.c:3891:client_statfs_cbk] feed4: no proper reply from server, returning ENOTCONN
2009-06-14 05:14:17 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed4  (107)
2009-06-14 05:14:17 W [client-protocol.c:280:client_protocol_xfer] feed4: attempting to pipeline request type(1) op(15) with handshake
...
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed1: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:17 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed1: bailing transport
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(15) reply=@0x2aaaac005b60
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed1  (107)
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed3: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:20 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed3: bailing transport
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed3: cleaning up state in transport object 0x614310
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed2: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:17 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed2: bailing transport
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed3: forced unwinding frame type(1) op(15) reply=@0x2aaaac005eb0
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed3: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed3  (107)
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed2: cleaning up state in transport object 0x611f00
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed2: forced unwinding frame type(1) op(15) reply=@0x2aaaac005cd0
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed2: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed2  (107)
2009-06-14 05:17:29 W [client-protocol.c:332:client_protocol_xfer] feed1: not connected at the moment to submit frame type(1) op(15)
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed1  (107)
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed2: attempting to pipeline request type(1) op(15) with handshake
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed3: attempting to pipeline request type(1) op(15) with handshake
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed4: attempting to pipeline request type(1) op(15) with handshake
...
2009-06-16 16:17:34 E [unify.c:325:unify_lookup] bricks: returning ESTALE for / [translator generation (10) inode generation (6)]
2009-06-16 16:17:34 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 2016545: (34) / => -1 (116)
...
2009-06-17 12:01:25 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 2. last sent = 2009-06-17 12:00:40. last received = 2009-06-17 12:00:40 transport-timeout = 42
2009-06-17 12:01:25 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-17 12:01:25 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-17 12:01:25 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(12) reply=@0x2aaaac000c30
2009-06-17 12:01:25 E [client-protocol.c:2627:client_open_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-17 12:01:25 E [unify.c:820:unify_open_cbk] bricks: Open success on child node, failed on namespace
2009-06-17 12:01:25 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(40) reply=@0x2aaaac000c30
2009-06-17 12:01:25 E [client-protocol.c:4613:client_checksum_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-17 12:01:25 E [client-protocol.c:325:client_protocol_xfer] client-ns: transport_submit failed
2009-06-17 12:01:25 E [fuse-bridge.c:715:fuse_fd_cbk] glusterfs-fuse: 3345478: (12) /feed_processing/11460/11460.218.feed => -1 (107)
...
2009-06-21 14:37:40 W [client-protocol.c:205:call_bail] feed1: activating bail-out. pending frames = 2. last sent = 2009-06-21 14:36:53. last received = 2009-06-21 14:36:49 transport-timeout = 42
2009-06-21 14:37:40 C [client-protocol.c:212:call_bail] feed1: bailing transport
2009-06-21 14:37:40 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-21 14:37:40 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(34) reply=@0x2aaab005eea0
2009-06-21 14:37:40 E [client-protocol.c:4430:client_lookup_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-21 14:37:40 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 107
2009-06-21 14:37:40 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(34) reply=@0x2aaab005eea0
2009-06-21 14:37:40 E [client-protocol.c:4430:client_lookup_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-21 14:37:40 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 107
...
2009-06-25 04:44:14 W [fuse-bridge.c:411:fuse_entry_cbk] glusterfs-fuse: 4270680: (34) /feed_processing/12879/12879.206.feed => 1074574335 Rehashing because st_nlink less than dentry maps
...
2009-06-25 09:42:00 E [protocol.c:271:gf_block_unserialize_transport] feed1: EOF from peer (10.1.90.201:6996)
2009-06-25 09:42:00 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-25 09:42:00 E [protocol.c:271:gf_block_unserialize_transport] client-ns: EOF from peer (10.1.90.201:6996)
2009-06-25 09:42:00 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-25 09:42:00 E [tcp-client.c:190:tcp_connect] client-ns: non-blocking connect() returned: 111 (Connection refused)
-----------------------------------------------------------------------------

Sample gluster.server.log error messages on feed1
--------------------------------
2009-06-11 13:40:06 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/query/query_data.lock: File exists
...
2009-06-11 20:37:49 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/query/query_par.lock: File exists
...
2009-06-12 10:17:33 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.1.90.200:1016)
...
2009-06-13 04:15:00 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/item/item_list_27.lock: File exists
...
2009-06-14 01:51:25 E [server-protocol.c:186:generic_reply] server: transport_writev failed
...
2009-06-14 01:51:25 E [tcp.c:124:tcp_except] server: shutdown () - error: Transport endpoint is not connected
...
2009-06-14 05:28:00 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/token/token_data.lock: File exists
...
TLA Repo Revision: glusterfs--mainline--2.5--patch-797
Time : 2009-06-25 09:41:58
Signal Number : 11

/usr/local/sbin/glusterfsd -f /etc/glusterfs/glusterfs-server.vol -l /var/log/glusterfs/gluster.server.log -L WARNING --pidfile /var/log/glusterfs/pid
volume server
  type protocol/server
  option auth.ip.brick-ns.allow 10.1.90.*
  option auth.ip.brick.allow 10.1.90.*
  option transport-type tcp/server
  subvolumes brick brick-ns
end-volume

volume locks
  type features/posix-locks
  subvolumes brick
end-volume

volume brick-ns
  type storage/posix
  option directory /data1/glusterfs-ns
end-volume

volume brick
  type storage/posix
  option directory /data1/glusterfs
end-volume

frame : type(1) op(2)

/lib64/libc.so.6[0x2ad72cf535b0]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_mknod_cbk+0xf0)[0x2aaaaaecbb70]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/storage/posix.so(posix_mknod+0x12a)[0x2aaaaaab507a]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_mknod+0x164)[0x2aaaaaecbd34]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_interpret+0x2e7)[0x2aaaaaec3f67]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(notify+0x9c)[0x2aaaaaecf5dc]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/libglusterfs.so.0(sys_epoll_iteration+0xbb)[0x2ad72c8c8bfb]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/libglusterfs.so.0(poll_iteration+0x79)[0x2ad72c8c7fb9]
[glusterfs](main+0x67c)[0x40291c]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x2ad72cf40ae4]
[glusterfs][0x401de9]
-----------------------------------------------------------------------------

Sample core data:
--------------------------------
feed1:/ # gdb /usr/local/sbin/glusterfsd core.12030
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/libthread_db.so.1".
...
Core was generated by `[glusterfs]                                                                   '.
Program terminated with signal 11, Segmentation fault.
#0  server_mknod_cbk (frame=0x2aaab37eab20, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, 
    op_errno=<value optimized out>, inode=0x2aaab380bfe0, stbuf=0x7fff7e40d170) at server-protocol.c:877
877         dict_set (reply, "INODE", data_from_uint64 (inode->ino));
(gdb) p *inode
Cannot access memory at address 0x2aaab380bfe0
(gdb) p *reply
$3 = {is_static = 0 '\0', is_locked = 1 '\001', hash_size = 1, count = 3, refcount = 0, members = 0x2aaab2b71ab0, 
  members_list = 0x2aaab2e9b750, extra_free = 0x0, lock = 1}
-----------------------------------------------------------------------------

Comment 2 Amar Tumballi 2009-11-26 00:57:54 UTC
Yang, 
 We are no more supporting 1.3.12 release. Please try 2.0.8+ or 3.0.0 release when it is done. 

Also we no longer support 'cluster/unify' translator too. Please use 'cluster/distribute' instead. That should fix all these issues.

Regards,
Amar


Note You need to log in before you can comment on or make changes to this bug.