763570 – (GLUSTER-1838) handle peer detach gracefully in case of lost frames

Bug 763570 (GLUSTER-1838) - handle peer detach gracefully in case of lost frames

Summary: handle peer detach gracefully in case of lost frames

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-1838
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.1-alpha
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-10-06 13:37 UTC by Pranith Kumar K
Modified:	2015-12-01 16:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Craig Carl 2010-10-06 10:47:55 UTC

Pranith -
   I read your notes, I saw the dropped frames in the logs. These two machines are running as VM's on the same KVM host. There is NO way there are network issues. The VM host has 0 load and is the fastest machine in the building. This patch may fix the problem but I think you are wrong as to root cause.

Thanks,

Craig

Comment 1 Pranith Kumar K 2010-10-06 11:34:36 UTC

(In reply to comment #1)
> Pranith -
>    I read your notes, I saw the dropped frames in the logs. These two machines
> are running as VM's on the same KVM host. There is NO way there are network
> issues. The VM host has 0 load and is the fastest machine in the building. This
> patch may fix the problem but I think you are wrong as to root cause.
> 
> Thanks,
> 
> Craig

Craig, 
      The logs on the box 240, had a lot of error logs about resolving the the ip-address of its peers. There were errors about get_addrinfo () failures.
We tried pinging one machine from the other, that gave problems too. So we came to the conclusion based on the frames lost logs that this is the reason. Let me also talk to some of the folks who implemented the communication interface and get back to you as to why else this may happen.

Thanks
Pranith.

Comment 2 Pranith Kumar K 2010-10-06 13:37:44 UTC

There are two friends in the cluster. When a detach is initiated from one machine, if the other machine gets the request but its response is lost because of network issues, the second peer detach to the same machine will cause the following crash on the machine that is being detached.


[2010-10-05 22:05:20.202143] I [glusterd-sm.c:707:glusterd_friend_sm_inject_event] glusterd: Enqueuing event: 9
pending frames:

patchset: v3.1.0qa7-513-gca86151
signal received: 6
time of crash: 2010-10-05 22:05:20
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0qa40
/lib64/libc.so.6[0x3e14a302d0]
/lib64/libc.so.6(gsignal+0x35)[0x3e14a30265]
/lib64/libc.so.6(abort+0x110)[0x3e14a31d10]
/lib64/libc.so.6(__assert_fail+0xf6)[0x3e14a296e6]
/usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_friend_sm+0xaa)[0x2aaaaaad51f7]
/usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_handle_rpc_msg+0x362)[0x2aaaaaaee227]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x2adefc83d9d9]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x184)[0x2adefc83dd5e]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0xeb)[0x2adefc843bc5]
/usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_poll_in+0x4b)[0x2aaaaada13ff]
/usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_handler+0xfc)[0x2aaaaada1770]
/usr/lib64/libglusterfs.so.0[0x2adefc600ae4]
/usr/lib64/libglusterfs.so.0[0x2adefc600cd3]
/usr/lib64/libglusterfs.so.0(event_dispatch+0x81)[0x2adefc60102f]
/usr/sbin/glusterd(main+0xec)[0x405e19]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3e14a1d994]
/usr/sbin/glusterd[0x402d39]

Comment 3 Harshavardhana 2010-10-06 22:48:02 UTC

>       The logs on the box 240, had a lot of error logs about resolving the the
> ip-address of its peers. There were errors about get_addrinfo () failures.
> We tried pinging one machine from the other, that gave problems too. So we came
> to the conclusion based on the frames lost logs that this is the reason. Let me

Pranith,

Could this be a platform issue? you need to check if the DNS resolver was working fine since we have been seeing issues off-late with that. 

If a hostname doesn't gets resolved this will be the general problem with RPC and also NFS.. so forth. 

Regards
--
Harshavardhana

Comment 4 Vijay Bellur 2010-10-11 07:32:49 UTC

PATCH: http://patches.gluster.com/patch/5421 in master (mgmt/glusterd: handle reqs from unknown peers for friend sm)

Comment 5 Pranith Kumar K 2011-01-12 02:19:42 UTC

This bug is resolved in Glusterd. Since we are not going to actively work on platform we decided to resolve it.

Glusterd was not checking if the Senders of requests it receives are in it's friend list before performing the friend state machine operations. We added the validations to fix that problem

Note You need to log in before you can comment on or make changes to this bug.