Pranith - I read your notes, I saw the dropped frames in the logs. These two machines are running as VM's on the same KVM host. There is NO way there are network issues. The VM host has 0 load and is the fastest machine in the building. This patch may fix the problem but I think you are wrong as to root cause. Thanks, Craig
(In reply to comment #1) > Pranith - > I read your notes, I saw the dropped frames in the logs. These two machines > are running as VM's on the same KVM host. There is NO way there are network > issues. The VM host has 0 load and is the fastest machine in the building. This > patch may fix the problem but I think you are wrong as to root cause. > > Thanks, > > Craig Craig, The logs on the box 240, had a lot of error logs about resolving the the ip-address of its peers. There were errors about get_addrinfo () failures. We tried pinging one machine from the other, that gave problems too. So we came to the conclusion based on the frames lost logs that this is the reason. Let me also talk to some of the folks who implemented the communication interface and get back to you as to why else this may happen. Thanks Pranith.
There are two friends in the cluster. When a detach is initiated from one machine, if the other machine gets the request but its response is lost because of network issues, the second peer detach to the same machine will cause the following crash on the machine that is being detached. [2010-10-05 22:05:20.202143] I [glusterd-sm.c:707:glusterd_friend_sm_inject_event] glusterd: Enqueuing event: 9 pending frames: patchset: v3.1.0qa7-513-gca86151 signal received: 6 time of crash: 2010-10-05 22:05:20 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.0qa40 /lib64/libc.so.6[0x3e14a302d0] /lib64/libc.so.6(gsignal+0x35)[0x3e14a30265] /lib64/libc.so.6(abort+0x110)[0x3e14a31d10] /lib64/libc.so.6(__assert_fail+0xf6)[0x3e14a296e6] /usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_friend_sm+0xaa)[0x2aaaaaad51f7] /usr/lib64/glusterfs/3.1.0qa40/xlator/mgmt/glusterd.so(glusterd_handle_rpc_msg+0x362)[0x2aaaaaaee227] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x2adefc83d9d9] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x184)[0x2adefc83dd5e] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0xeb)[0x2adefc843bc5] /usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_poll_in+0x4b)[0x2aaaaada13ff] /usr/lib64/glusterfs/3.1.0qa40/rpc-transport/socket.so(socket_event_handler+0xfc)[0x2aaaaada1770] /usr/lib64/libglusterfs.so.0[0x2adefc600ae4] /usr/lib64/libglusterfs.so.0[0x2adefc600cd3] /usr/lib64/libglusterfs.so.0(event_dispatch+0x81)[0x2adefc60102f] /usr/sbin/glusterd(main+0xec)[0x405e19] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3e14a1d994] /usr/sbin/glusterd[0x402d39]
> The logs on the box 240, had a lot of error logs about resolving the the > ip-address of its peers. There were errors about get_addrinfo () failures. > We tried pinging one machine from the other, that gave problems too. So we came > to the conclusion based on the frames lost logs that this is the reason. Let me Pranith, Could this be a platform issue? you need to check if the DNS resolver was working fine since we have been seeing issues off-late with that. If a hostname doesn't gets resolved this will be the general problem with RPC and also NFS.. so forth. Regards -- Harshavardhana
PATCH: http://patches.gluster.com/patch/5421 in master (mgmt/glusterd: handle reqs from unknown peers for friend sm)
This bug is resolved in Glusterd. Since we are not going to actively work on platform we decided to resolve it. Glusterd was not checking if the Senders of requests it receives are in it's friend list before performing the friend state machine operations. We added the validations to fix that problem