Description of problem: One of the common problems we encounter are frequent connects/disconnects. Frequent disconnects can be either: 1. voluntary where the process calls a shutdown (2)/close (2) on an otherwise healthy socket connection. 2. involuntary where we get a POLLERR event from network. While debugging this class of issues, it would help if can identify whether a particular disconnect falls into which of the two above categories. We need to add enough log messages to help us classify. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
mainline patch: https://review.gluster.org/16732 moving to POST
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101324/
on_qa validation: moving to verified after discussion with the Assignee to confirm expected behavior testverion:3.8.4-35 Talked with Assignee and confirmed the below: killing brick with trace enabled on client and brick. I see below EPOLLERR messgage, which Assignee confirmed. Checked for both replicate and EC volume fuse log: [2017-07-28 10:44:46.925685] D [socket.c:564:__socket_rwv] 0-rep-client-0: EOF on socket [2017-07-28 10:44:46.926240] W [socket.c:595:__socket_rwv] 0-rep-client-0: readv on 10.70.35.14:49157 failed (No data available) [2017-07-28 10:44:46.926257] D [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now [2017-07-28 10:44:46.926287] D [MSGID: 0] [client.c:2264:client_rpc_notify] 0-rep-client-0: got RPC_CLNT_DISCONNECT [2017-07-28 10:44:46.926371] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep-client-0: disconnected from rep-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-07-28 10:44:46.928289] D [rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fccb73781e2] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8b)[0x7fccb7142ccb] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5f)[0x7fccb713f0ff] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7fccb713fbe0] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccb713b9e3] ))))) 0-: 10.70.35.14:49157: ping timer event already removed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774