Bug 762685 (GLUSTER-953)

Summary: Crash on replica coming back up under nfs
Product: [Community] GlusterFS Reporter: Shehjar Tikoo <shehjart>
Component: replicateAssignee: Pavan Vilas Sondur <pavan>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: nfs-alphaCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Shehjar Tikoo 2010-05-26 13:42:29 UTC
FWIW, know that the previous run of the same volfile had also crashed due to a bug in nfsx, which could've left the replicate state in disarray.

Comment 1 Shehjar Tikoo 2010-05-26 16:05:35 UTC
User described the situation as follows:

The following action of steps crashes glusterfs:

- take down replica glusterfsd
- bring it back
- access storage on those replicas over nfs

The log is here is annotions:

================================================================================
Version      : glusterfs nfs-beta-rc5 built on May 19 2010 04:27:21
git: v3.0.0-213-gf424a88
Starting Time: 2010-05-20 19:48:54
Command line : glusterfs -f /root/nfs_triple.vol -l /tmp/nfs_triple.log 
PID          : 4057
System name  : Linux
Nodename     : gluster-node-01
Kernel Release : 2.6.18-164.el5
Hardware Identifier: x86_64

Given volfile:
+------------------------------------------------------------------------------+
  1: volume node-11
  2: 	type protocol/client
  3: 	option transport-type tcp
  4: 	option remote-port 6102
  5: 	option remote-host 10.6.145.114
  6: 	option remote-subvolume posix-1-locked-iot
  7: end-volume
  8: volume node-12
  9: 	type protocol/client
 10: 	option transport-type tcp
 11: 	option remote-port 6102
 12: 	option remote-host 10.6.145.114
 13: 	option remote-subvolume posix-2-locked-iot
 14: end-volume
 15: volume node-13
 16: 	type protocol/client
 17: 	option transport-type tcp
 18: 	option remote-port 6102
 19: 	option remote-host 10.6.145.114
 20: 	option remote-subvolume posix-3-locked-iot
 21: end-volume
 22: 
 23: volume node-21
 24: 	type protocol/client
 25: 	option transport-type tcp
 26: 	option remote-port 6102
 27: 	option remote-host 10.6.145.116
 28: 	option remote-subvolume posix-1-locked-iot
 29: end-volume
 30: volume node-22
 31: 	type protocol/client
 32: 	option transport-type tcp
 33: 	option remote-port 6102
 34: 	option remote-host 10.6.145.116
 35: 	option remote-subvolume posix-2-locked-iot
 36: end-volume
 37: volume node-23
 38: 	type protocol/client
 39: 	option transport-type tcp
 40: 	option remote-port 6102
 41: 	option remote-host 10.6.145.116
 42: 	option remote-subvolume posix-3-locked-iot
 43: end-volume
 44: 
 45: volume node-31
 46: 	type protocol/client
 47: 	option transport-type tcp
 48: 	option remote-port 6102
 49: 	option remote-host 10.6.145.118
 50: 	option remote-subvolume posix-1-locked-iot
 51: end-volume
 52: volume node-32
 53: 	type protocol/client
 54: 	option transport-type tcp
 55: 	option remote-port 6102
 56: 	option remote-host 10.6.145.118
 57: 	option remote-subvolume posix-2-locked-iot
 58: end-volume
 59: volume node-33
 60: 	type protocol/client
 61: 	option transport-type tcp
 62: 	option remote-port 6102
 63: 	option remote-host 10.6.145.118
 64: 	option remote-subvolume posix-3-locked-iot
 65: end-volume
 66: 
 67: volume node-41
 68: 	type protocol/client
 69: 	option transport-type tcp
 70: 	option remote-port 6102
 71: 	option remote-host 10.6.145.120
 72: 	option remote-subvolume posix-1-locked-iot
 73: end-volume
 74: volume node-42
 75: 	type protocol/client
 76: 	option transport-type tcp
 77: 	option remote-port 6102
 78: 	option remote-host 10.6.145.120
 79: 	option remote-subvolume posix-2-locked-iot
 80: end-volume
 81: volume node-43
 82: 	type protocol/client
 83: 	option transport-type tcp
 84: 	option remote-port 6102
 85: 	option remote-host 10.6.145.120
 86: 	option remote-subvolume posix-3-locked-iot
 87: end-volume
 88: 
 89: volume mirror-01
 90: 	type cluster/replicate
 91: 	subvolumes node-11 node-21 node-31
 92: end-volume
 93: volume mirror-02
 94: 	type cluster/replicate
 95: 	subvolumes node-22 node-32 node-42
 96: end-volume
 97: volume mirror-03
 98: 	type cluster/replicate
 99: 	subvolumes node-33 node-43 node-12
100: end-volume
101: volume mirror-04
102: 	type cluster/replicate
103: 	subvolumes node-41 node-13 node-23
104: end-volume
105: 
106: volume distribute
107: 	type cluster/distribute
108: 	subvolumes mirror-01 mirror-02 mirror-03 mirror-04
109: end-volume
110: 
111: volume nfs-server
112: 	type nfs/server
113: 	subvolumes distribute
114: 	option rpc-auth.addr.allow *
115: end-volume

+------------------------------------------------------------------------------+
[2010-05-20 19:48:54] N [glusterfsd.c:1477:main] glusterfs: Successfully started
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-11: Connected to 10.6.145.114:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-01: Subvolume 'node-11' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-11: Connected to 10.6.145.114:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-01: Subvolume 'node-11' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-42: Connected to 10.6.145.120:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-02: Subvolume 'node-42' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-42: Connected to 10.6.145.120:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-02: Subvolume 'node-42' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-33: Connected to 10.6.145.118:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-03: Subvolume 'node-33' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-33: Connected to 10.6.145.118:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-03: Subvolume 'node-33' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-43: Connected to 10.6.145.120:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-43: Connected to 10.6.145.120:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-12: Connected to 10.6.145.114:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-12: Connected to 10.6.145.114:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-41: Connected to 10.6.145.120:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-04: Subvolume 'node-41' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-41: Connected to 10.6.145.120:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [afr.c:2636:notify] mirror-04: Subvolume 'node-41' came back up; going online.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-13: Connected to 10.6.145.114:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-13: Connected to 10.6.145.114:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-23: Connected to 10.6.145.116:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-31: Connected to 10.6.145.118:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-31: Connected to 10.6.145.118:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-21: Connected to 10.6.145.116:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-21: Connected to 10.6.145.116:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-22: Connected to 10.6.145.116:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-22: Connected to 10.6.145.116:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-32: Connected to 10.6.145.118:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-32: Connected to 10.6.145.118:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-20 19:48:54] N [client-protocol.c:5813:client_setvolume_cbk] node-23: Connected to 10.6.145.116:6102, attached to remote volume 'posix-3-locked-iot'.






########################################################################
######  USER TOOK DOWN A SERVER, EACH SERVER HERE CONTAINS THREE  ######
###### REPLICAS, SO THE DISCONNECTION MESSAGE IS FOR THREE SUBVOS ######
########################################################################
[2010-05-21 03:33:14] N [client-protocol.c:6558:notify] node-23: disconnected
[2010-05-21 03:33:14] N [client-protocol.c:6558:notify] node-22: disconnected
[2010-05-21 03:33:14] N [client-protocol.c:6558:notify] node-21: disconnected
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-23: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-23: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-22: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-22: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-21: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:33:17] E [socket.c:762:socket_connect_finish] node-21: connection to 10.6.145.116:6102 failed (Connection refused)
[2010-05-21 03:37:54] N [client-protocol.c:6558:notify] node-33: disconnected
[2010-05-21 03:37:54] N [client-protocol.c:6558:notify] node-32: disconnected
[2010-05-21 03:37:54] N [client-protocol.c:6558:notify] node-31: disconnected
[2010-05-21 03:37:54] E [socket.c:762:socket_connect_finish] node-32: connection to 10.6.145.118:6102 failed (Connection refused)
[2010-05-21 03:37:54] E [socket.c:762:socket_connect_finish] node-32: connection to 10.6.145.118:6102 failed (Connection refused)
[2010-05-21 03:37:54] E [socket.c:762:socket_connect_finish] node-33: connection to 10.6.145.118:6102 failed (Connection refused)
[2010-05-21 03:37:54] E [socket.c:762:socket_connect_finish] node-33: connection to 10.6.145.118:6102 failed (Connection refused)
[2010-05-21 03:37:57] E [socket.c:762:socket_connect_finish] node-31: connection to 10.6.145.118:6102 failed (Connection refused)
[2010-05-21 03:37:57] E [socket.c:762:socket_connect_finish] node-31: connection to 10.6.145.118:6102 failed (Connection refused)



#####################################
######  REPLICAS CAME BACK UP  ######
#####################################
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-23: Connected to 10.6.145.116:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-23: Connected to 10.6.145.116:6102, attached to remote volume 'posix-3-locked-iot'.
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-22: Connected to 10.6.145.116:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-22: Connected to 10.6.145.116:6102, attached to remote volume 'posix-2-locked-iot'.
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-21: Connected to 10.6.145.116:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-21 03:46:18] N [client-protocol.c:5813:client_setvolume_cbk] node-21: Connected to 10.6.145.116:6102, attached to remote volume 'posix-1-locked-iot'.
[2010-05-21 03:46:19] E [afr-self-heal-common.c:1196:sh_missing_entries_create] mirror-01: conflicing filetypes exist for path /gluster-test-01/.lck-b715060000000000. returning.
[2010-05-21 03:46:19] E [afr-self-heal-common.c:1215:sh_missing_entries_create] mirror-04: no missing files - /gluster-test-01/.lck-c106120000000000. proceeding to metadata check
[2010-05-21 03:46:19] E [afr.c:96:afr_set_split_brain] mirror-01: invalid argument: inode
pending frames:

patchset: v3.0.0-213-gf424a88
signal received: 11
time of crash: 2010-05-21 03:46:19
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs nfs-beta-rc5
/lib64/libc.so.6[0x3ff08302d0]
/usr/local/lib/glusterfs/nfs-beta-rc5/xlator/cluster/replicate.so(afr_sh_data_fix+0x1d1)[0x2b4d89ba3961]
/usr/local/lib/glusterfs/nfs-beta-rc5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x11e)[0x2b4d89ba3c3e]
/usr/local/lib/glusterfs/nfs-beta-rc5/xlator/protocol/client.so(client_fstat_cbk+0xce)[0x2b4d89971cde]
/usr/local/lib/glusterfs/nfs-beta-rc5/xlator/protocol/client.so(protocol_client_pollin+0xca)[0x2b4d899591ba]
/usr/local/lib/glusterfs/nfs-beta-rc5/xlator/protocol/client.so(notify+0x212)[0x2b4d89967972]
/usr/local/lib/libglusterfs.so.0(xlator_notify+0x43)[0x2b4d88e93433]
/usr/local/lib/glusterfs/nfs-beta-rc5/transport/socket.so(socket_event_handler+0xd3)[0x2aaaaab9b073]
/usr/local/lib/libglusterfs.so.0[0x2b4d88eae4e5]
glusterfs(main+0xb17)[0x404367]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3ff081d994]
glusterfs[0x4027a9]
---------

Comment 2 Shehjar Tikoo 2010-05-31 04:12:14 UTC

*** This bug has been marked as a duplicate of bug 869 ***