Bug 763570 - (GLUSTER-1838) handle peer detach gracefully in case of lost frames
handle peer detach gracefully in case of lost frames
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
3.1-alpha
All Linux
low Severity low
: ---
: ---
Assigned To: Pranith Kumar K
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-06 09:37 EDT by Pranith Kumar K
Modified: 2015-12-01 11:45 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Craig Carl 2010-10-06 06:47:55 EDT
Pranith -
   I read your notes, I saw the dropped frames in the logs. These two machines are running as VM's on the same KVM host. There is NO way there are network issues. The VM host has 0 load and is the fastest machine in the building. This patch may fix the problem but I think you are wrong as to root cause.

Thanks,

Craig
Comment 1 Pranith Kumar K 2010-10-06 07:34:36 EDT
(In reply to comment #1)
> Pranith -
>    I read your notes, I saw the dropped frames in the logs. These two machines
> are running as VM's on the same KVM host. There is NO way there are network
> issues. The VM host has 0 load and is the fastest machine in the building. This
> patch may fix the problem but I think you are wrong as to root cause.
> 
> Thanks,
> 
> Craig

Craig, 
      The logs on the box 240, had a lot of error logs about resolving the the ip-address of its peers. There were errors about get_addrinfo () failures.
We tried pinging one machine from the other, that gave problems too. So we came to the conclusion based on the frames lost logs that this is the reason. Let me also talk to some of the folks who implemented the communication interface and get back to you as to why else this may happen.

Thanks
Pranith.
Comment 2 Pranith Kumar K 2010-10-06 09:37:44 EDT
There are two friends in the cluster. When a detach is initiated from one machine, if the other machine gets the request but its response is lost because of network issues, the second peer detach to the same machine will cause the following crash on the machine that is being detached.


[2010-10-05 22:05:20.202143] I [glusterd-sm.c:707:glusterd_friend_sm_inject_event] glusterd: Enqueuing event: 9
pending frames:

patchset: v3.1.0qa7-513-gca86151
signal received: 6
time of crash: 2010-10-05 22:05:20
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0qa40
/lib64/libc.so.6[0x3e14a302d0]
/lib64/libc.so.6(gsignal+0x35)[0x3e14a30265]
/lib64/libc.so.6(abort+0x110)[0x3e14a31d10]
/lib64/libc.so.6(__assert_fail+0xf6)[0x3e14a296e6]
/usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_friend_sm+0xaa)[0x2aaaaaad51f7]
/usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_handle_rpc_msg+0x362)[0x2aaaaaaee227]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x2adefc83d9d9]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x184)[0x2adefc83dd5e]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0xeb)[0x2adefc843bc5]
/usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_poll_in+0x4b)[0x2aaaaada13ff]
/usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_handler+0xfc)[0x2aaaaada1770]
/usr/lib64/libglusterfs.so.0[0x2adefc600ae4]
/usr/lib64/libglusterfs.so.0[0x2adefc600cd3]
/usr/lib64/libglusterfs.so.0(event_dispatch+0x81)[0x2adefc60102f]
/usr/sbin/glusterd(main+0xec)[0x405e19]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3e14a1d994]
/usr/sbin/glusterd[0x402d39]
Comment 3 Harshavardhana 2010-10-06 18:48:02 EDT
>       The logs on the box 240, had a lot of error logs about resolving the the
> ip-address of its peers. There were errors about get_addrinfo () failures.
> We tried pinging one machine from the other, that gave problems too. So we came
> to the conclusion based on the frames lost logs that this is the reason. Let me

Pranith,

Could this be a platform issue? you need to check if the DNS resolver was working fine since we have been seeing issues off-late with that. 

If a hostname doesn't gets resolved this will be the general problem with RPC and also NFS.. so forth. 

Regards
--
Harshavardhana
Comment 4 Vijay Bellur 2010-10-11 03:32:49 EDT
PATCH: http://patches.gluster.com/patch/5421 in master (mgmt/glusterd: handle reqs from unknown peers for friend sm)
Comment 5 Pranith Kumar K 2011-01-11 21:19:42 EST
This bug is resolved in Glusterd. Since we are not going to actively work on platform we decided to resolve it.

Glusterd was not checking if the Senders of requests it receives are in it's friend list before performing the friend state machine operations. We added the validations to fix that problem

Note You need to log in before you can comment on or make changes to this bug.