| Summary: | 5 servers, 1 crashes intermittently at server-protocol.c:877 | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Yang <yang.wang> |
| Component: | protocol | Assignee: | Vijay Bellur <vbellur> |
| Status: | CLOSED WONTFIX | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | pre-2.0 | CC: | amarts, gluster-bugs, rabhat |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTNR | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Code version: glusterfs-1.3.12, fuse-2.7.4glfs11
Latest code update: 2009-06-09
Code location: feed6:/data1/nfs_share/, with local copies of libraries
Usage: As shared, pooled-resource drive. Machines feed1~feed5 act as servers. Machines feed0~feed13 act as clients (including feed1~feed5). Processes on clients read and write simultaneously to the same directory, although not same files. One process monitors a specific file via tail -f.
Crashing machine: feed1
Recent crashing dates: 2009-05-02, 2009-06-03, 2009-06-08, 2009-06-25 09:42
glusterfs-client.vol
-------------------------------
volume feed1
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.201
option remote-subvolume brick
end-volume
volume feed2
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.202
option remote-subvolume brick
end-volume
volume feed3
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.203
option remote-subvolume brick
end-volume
volume feed4
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.204
option remote-subvolume brick
end-volume
volume feed5
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.205
option remote-subvolume brick
end-volume
volume client-ns
type protocol/client
option transport-type tcp/client
option remote-host 10.1.90.201
option remote-subvolume brick-ns
end-volume
volume bricks
type cluster/unify
option namespace client-ns
subvolumes feed1 feed2 feed3 feed4 feed5
option scheduler alu
option alu.limits.min-free-disk 5%
option alu.limits.max-open-files 100000
option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
option alu.disk-usage.entry-threshold 2GB
option alu.disk-usage.exit-threshold 60MB
option alu.open-files-usage.entry-threshold 1024
option alu.open-files-usage.exit-threshold 32
option alu.stat-refresh.interval 10sec
end-volume
----------------------------------------------------------------------------
glusterfs-server.vol
-----------------------------
volume brick
type storage/posix
option directory /data1/glusterfs
end-volume
volume brick-ns
type storage/posix
option directory /data1/glusterfs-ns
end-volume
volume locks
type features/posix-locks # posix-locks
subvolumes brick
end-volume
volume server
type protocol/server
option transport-type tcp/server
subvolumes brick brick-ns
option auth.ip.brick.allow 10.1.90.*
option auth.ip.brick-ns.allow 10.1.90.*
end-volume
----------------------------------------------------------------------------
Sample gluster.client.log error messages on feed1
--------------------------------
2009-06-10 15:47:22 E [unify.c:882:unify_open] bricks: /persistent_data/piddata-23480.bin: entry_count is 3
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on feed2
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on feed4
2009-06-10 15:47:22 E [unify.c:885:unify_open] bricks: /persistent_data/piddata-23480.bin: found on client-ns
...
2009-06-10 23:06:21 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 2
2009-06-10 23:06:21 E [unify.c:265:unify_lookup_cbk] bricks: Revalidate failed for /persistent_data/piddata-313.bin
2009-06-10 23:06:21 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 95612: (34) /persistent_data/piddata-313.bin => -1 (2)
...
2009-06-12 01:49:06 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 1. last sent = 2009-06-12 01:48:16. last received = 2009-06-12 01:48:16 transport-timeout = 42
2009-06-12 01:49:06 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-12 01:49:06 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-12 01:49:06 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(40) reply=@0x61e6c0
2009-06-12 01:49:06 E [client-protocol.c:4613:client_checksum_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-12 01:49:06 E [client-protocol.c:325:client_protocol_xfer] client-ns: transport_submit failed
...
2009-06-12 01:49:57 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 1. last sent = 2009-06-12 01:49:06. last received = 1969-12-31 16:00:00 transport-timeout = 42
2009-06-12 01:49:57 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-12 01:49:57 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-12 01:49:57 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(22) reply=@0x61e6c0
2009-06-12 01:49:57 E [client-protocol.c:3767:client_opendir_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-12 01:49:57 W [client-protocol.c:1711:client_closedir] client-ns: no proper fd found, returning
2009-06-12 01:49:57 W [client-protocol.c:280:client_protocol_xfer] client-ns: attempting to pipeline request type(1) op(34) with handshake
...
2009-06-14 05:14:17 W [client-protocol.c:205:call_bail] feed4: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:13:31. last received = 2009-06-14 05:05:37 transport-timeout = 42
2009-06-14 05:14:17 C [client-protocol.c:212:call_bail] feed4: bailing transport
2009-06-14 05:14:17 W [client-protocol.c:4784:client_protocol_cleanup] feed4: cleaning up state in transport object 0x616760
2009-06-14 05:14:17 E [client-protocol.c:4834:client_protocol_cleanup] feed4: forced unwinding frame type(1) op(15) reply=@0x2aaaac005e70
2009-06-14 05:14:17 E [client-protocol.c:3891:client_statfs_cbk] feed4: no proper reply from server, returning ENOTCONN
2009-06-14 05:14:17 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed4 (107)
2009-06-14 05:14:17 W [client-protocol.c:280:client_protocol_xfer] feed4: attempting to pipeline request type(1) op(15) with handshake
...
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed1: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:17 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed1: bailing transport
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(15) reply=@0x2aaaac005b60
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed1 (107)
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed3: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:20 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed3: bailing transport
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed3: cleaning up state in transport object 0x614310
2009-06-14 05:17:29 W [client-protocol.c:205:call_bail] feed2: activating bail-out. pending frames = 1. last sent = 2009-06-14 05:16:37. last received = 2009-06-14 05:14:17 transport-timeout = 42
2009-06-14 05:17:29 C [client-protocol.c:212:call_bail] feed2: bailing transport
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed3: forced unwinding frame type(1) op(15) reply=@0x2aaaac005eb0
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed3: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed3 (107)
2009-06-14 05:17:29 W [client-protocol.c:4784:client_protocol_cleanup] feed2: cleaning up state in transport object 0x611f00
2009-06-14 05:17:29 E [client-protocol.c:4834:client_protocol_cleanup] feed2: forced unwinding frame type(1) op(15) reply=@0x2aaaac005cd0
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed2: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed2 (107)
2009-06-14 05:17:29 W [client-protocol.c:332:client_protocol_xfer] feed1: not connected at the moment to submit frame type(1) op(15)
2009-06-14 05:17:29 E [client-protocol.c:3891:client_statfs_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-14 05:17:29 E [unify.c:1402:unify_statfs_cbk] bricks: operation failed on feed1 (107)
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed2: attempting to pipeline request type(1) op(15) with handshake
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed3: attempting to pipeline request type(1) op(15) with handshake
2009-06-14 05:17:29 W [client-protocol.c:280:client_protocol_xfer] feed4: attempting to pipeline request type(1) op(15) with handshake
...
2009-06-16 16:17:34 E [unify.c:325:unify_lookup] bricks: returning ESTALE for / [translator generation (10) inode generation (6)]
2009-06-16 16:17:34 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 2016545: (34) / => -1 (116)
...
2009-06-17 12:01:25 W [client-protocol.c:205:call_bail] client-ns: activating bail-out. pending frames = 2. last sent = 2009-06-17 12:00:40. last received = 2009-06-17 12:00:40 transport-timeout = 42
2009-06-17 12:01:25 C [client-protocol.c:212:call_bail] client-ns: bailing transport
2009-06-17 12:01:25 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-17 12:01:25 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(12) reply=@0x2aaaac000c30
2009-06-17 12:01:25 E [client-protocol.c:2627:client_open_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-17 12:01:25 E [unify.c:820:unify_open_cbk] bricks: Open success on child node, failed on namespace
2009-06-17 12:01:25 E [client-protocol.c:4834:client_protocol_cleanup] client-ns: forced unwinding frame type(1) op(40) reply=@0x2aaaac000c30
2009-06-17 12:01:25 E [client-protocol.c:4613:client_checksum_cbk] client-ns: no proper reply from server, returning ENOTCONN
2009-06-17 12:01:25 E [client-protocol.c:325:client_protocol_xfer] client-ns: transport_submit failed
2009-06-17 12:01:25 E [fuse-bridge.c:715:fuse_fd_cbk] glusterfs-fuse: 3345478: (12) /feed_processing/11460/11460.218.feed => -1 (107)
...
2009-06-21 14:37:40 W [client-protocol.c:205:call_bail] feed1: activating bail-out. pending frames = 2. last sent = 2009-06-21 14:36:53. last received = 2009-06-21 14:36:49 transport-timeout = 42
2009-06-21 14:37:40 C [client-protocol.c:212:call_bail] feed1: bailing transport
2009-06-21 14:37:40 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-21 14:37:40 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(34) reply=@0x2aaab005eea0
2009-06-21 14:37:40 E [client-protocol.c:4430:client_lookup_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-21 14:37:40 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 107
2009-06-21 14:37:40 E [client-protocol.c:4834:client_protocol_cleanup] feed1: forced unwinding frame type(1) op(34) reply=@0x2aaab005eea0
2009-06-21 14:37:40 E [client-protocol.c:4430:client_lookup_cbk] feed1: no proper reply from server, returning ENOTCONN
2009-06-21 14:37:40 E [unify.c:182:unify_lookup_cbk] bricks: feed1 returned 107
...
2009-06-25 04:44:14 W [fuse-bridge.c:411:fuse_entry_cbk] glusterfs-fuse: 4270680: (34) /feed_processing/12879/12879.206.feed => 1074574335 Rehashing because st_nlink less than dentry maps
...
2009-06-25 09:42:00 E [protocol.c:271:gf_block_unserialize_transport] feed1: EOF from peer (10.1.90.201:6996)
2009-06-25 09:42:00 W [client-protocol.c:4784:client_protocol_cleanup] feed1: cleaning up state in transport object 0x60f4e0
2009-06-25 09:42:00 E [protocol.c:271:gf_block_unserialize_transport] client-ns: EOF from peer (10.1.90.201:6996)
2009-06-25 09:42:00 W [client-protocol.c:4784:client_protocol_cleanup] client-ns: cleaning up state in transport object 0x61bf00
2009-06-25 09:42:00 E [tcp-client.c:190:tcp_connect] client-ns: non-blocking connect() returned: 111 (Connection refused)
-----------------------------------------------------------------------------
Sample gluster.server.log error messages on feed1
--------------------------------
2009-06-11 13:40:06 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/query/query_data.lock: File exists
...
2009-06-11 20:37:49 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/query/query_par.lock: File exists
...
2009-06-12 10:17:33 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (10.1.90.200:1016)
...
2009-06-13 04:15:00 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/item/item_list_27.lock: File exists
...
2009-06-14 01:51:25 E [server-protocol.c:186:generic_reply] server: transport_writev failed
...
2009-06-14 01:51:25 E [tcp.c:124:tcp_except] server: shutdown () - error: Transport endpoint is not connected
...
2009-06-14 05:28:00 W [posix.c:486:posix_mknod] brick-ns: mknod on /condor/var/PS/token/token_data.lock: File exists
...
TLA Repo Revision: glusterfs--mainline--2.5--patch-797
Time : 2009-06-25 09:41:58
Signal Number : 11
/usr/local/sbin/glusterfsd -f /etc/glusterfs/glusterfs-server.vol -l /var/log/glusterfs/gluster.server.log -L WARNING --pidfile /var/log/glusterfs/pid
volume server
type protocol/server
option auth.ip.brick-ns.allow 10.1.90.*
option auth.ip.brick.allow 10.1.90.*
option transport-type tcp/server
subvolumes brick brick-ns
end-volume
volume locks
type features/posix-locks
subvolumes brick
end-volume
volume brick-ns
type storage/posix
option directory /data1/glusterfs-ns
end-volume
volume brick
type storage/posix
option directory /data1/glusterfs
end-volume
frame : type(1) op(2)
/lib64/libc.so.6[0x2ad72cf535b0]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_mknod_cbk+0xf0)[0x2aaaaaecbb70]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/storage/posix.so(posix_mknod+0x12a)[0x2aaaaaab507a]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_mknod+0x164)[0x2aaaaaecbd34]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_interpret+0x2e7)[0x2aaaaaec3f67]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/glusterfs/1.3.12/xlator/protocol/server.so(notify+0x9c)[0x2aaaaaecf5dc]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/libglusterfs.so.0(sys_epoll_iteration+0xbb)[0x2ad72c8c8bfb]
/mnt/nfs_share/feed6/apps/install/glusterfs-1.3.12/lib/libglusterfs.so.0(poll_iteration+0x79)[0x2ad72c8c7fb9]
[glusterfs](main+0x67c)[0x40291c]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x2ad72cf40ae4]
[glusterfs][0x401de9]
-----------------------------------------------------------------------------
Sample core data:
--------------------------------
feed1:/ # gdb /usr/local/sbin/glusterfsd core.12030
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/libthread_db.so.1".
...
Core was generated by `[glusterfs] '.
Program terminated with signal 11, Segmentation fault.
#0 server_mknod_cbk (frame=0x2aaab37eab20, cookie=<value optimized out>, this=<value optimized out>, op_ret=0,
op_errno=<value optimized out>, inode=0x2aaab380bfe0, stbuf=0x7fff7e40d170) at server-protocol.c:877
877 dict_set (reply, "INODE", data_from_uint64 (inode->ino));
(gdb) p *inode
Cannot access memory at address 0x2aaab380bfe0
(gdb) p *reply
$3 = {is_static = 0 '\0', is_locked = 1 '\001', hash_size = 1, count = 3, refcount = 0, members = 0x2aaab2b71ab0,
members_list = 0x2aaab2e9b750, extra_free = 0x0, lock = 1}
-----------------------------------------------------------------------------
Yang, We are no more supporting 1.3.12 release. Please try 2.0.8+ or 3.0.0 release when it is done. Also we no longer support 'cluster/unify' translator too. Please use 'cluster/distribute' instead. That should fix all these issues. Regards, Amar |
(gdb) backtrace #0 server_mknod_cbk (frame=0x2aaab37eab20, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, op_errno=<value optimized out>, inode=0x2aaab380bfe0, stbuf=0x7fff7e40d170) at server-protocol.c:877 #1 0x00002aaaaaab507a in posix_mknod (frame=0x2aaab37eb660, this=0x60b380, loc=0x7fff7e40d2f0, mode=32896, dev=0) at posix.c:499 #2 0x00002aaaaaecbd34 in server_mknod (frame=<value optimized out>, bound_xl=0x60b380, params=<value optimized out>) at server-protocol.c:4367 #3 0x00002aaaaaec3f67 in server_protocol_interpret (trans=0x2aaab00bfb40, blk=0x2aaab37d9d50) at server-protocol.c:6074 #4 0x00002aaaaaecf5dc in notify (this=<value optimized out>, event=<value optimized out>, data=0x2aaab00bfb40) at server-protocol.c:6444 #5 0x00002ad72c8c8bfb in sys_epoll_iteration (ctx=<value optimized out>) at epoll.c:64 #6 0x00002ad72c8c7fb9 in poll_iteration (ctx=0x2aaab37ebec8) at transport.c:312 #7 0x000000000040291c in main (argc=7, argv=0x7fff7e40d628) at glusterfs.c:565