Hide Forgot
Description of problem: ======================= In my systemic setup, https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 amongst many clients, I had two rhel 6 clients from where I was doing append to the same file from both the clients parallely as below: for i in {1..100000};do for j in {1..100000};do echo "hi this line is from $HOSTNAME" >> deadlockfile ;done; echo "########################" >> deadlockfile ; echo " LOOP $i from $HOSTNAME" >> deadlockfile ;done It ran for about a week, and while this was happening , I brought down one brick each from each replica pair as below: [root@dhcp35-191 bricks]# gluster v status Status of volume: distrepvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.191:/rhs/brick1/distrepvol N/A N/A N N/A Brick 10.70.37.187:/rhs/brick1/distrepvol 49154 0 Y 3867 Brick 10.70.35.3:/rhs/brick1/distrepvol N/A N/A N N/A Brick 10.70.37.150:/rhs/brick1/distrepvol 49154 0 Y 3918 Brick 10.70.35.191:/rhs/brick2/distrepvol 49155 0 Y 7568 Brick 10.70.37.187:/rhs/brick2/distrepvol N/A N/A N N/A Brick 10.70.35.3:/rhs/brick2/distrepvol 49155 0 Y 5341 Brick 10.70.37.150:/rhs/brick2/distrepvol N/A N/A N N/A Snapshot Daemon on localhost 49152 0 Y 8358 Self-heal Daemon on localhost N/A N/A Y 7588 Quota Daemon on localhost N/A N/A Y 8211 Snapshot Daemon on 10.70.37.150 49152 0 Y 4477 Self-heal Daemon on 10.70.37.150 N/A N/A Y 3957 Quota Daemon on 10.70.37.150 N/A N/A Y 4380 Snapshot Daemon on 10.70.37.187 49152 0 Y 4428 Self-heal Daemon on 10.70.37.187 N/A N/A Y 3907 Quota Daemon on 10.70.37.187 N/A N/A Y 4330 Snapshot Daemon on 10.70.35.3 49152 0 Y 5858 Self-heal Daemon on 10.70.35.3 N/A N/A Y 5361 Quota Daemon on 10.70.35.3 N/A N/A Y 5762 Task Status of Volume distrepvol ------------------------------------------------------------------------------ There are no active volume tasks When the bricks were brought down I noticed the following error logs on fuse client: [2016-10-05 11:18:25.461727] W [socket.c:590:__socket_rwv] 0-distrepvol-client-0: readv on 10.70.35.191:49154 failed (Connection reset by peer) [2016-10-05 11:18:25.462270] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7fc88e04a442] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fc88de12747] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fc88de1285e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7fc88de12919] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1cf)[0x7fc88de12b3f] ))))) 0-distrepvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2016-10-05 11:18:25.347100 (xid=0x50cde43f) [2016-10-05 11:18:25.462293] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-distrepvol-client-0: remote operation failed [Transport endpoint is not connected] [2016-10-05 11:18:25.462487] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7fc88e04a442] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fc88de12747] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fc88de1285e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7fc88de12919] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1cf)[0x7fc88de12b3f] ))))) 0-distrepvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(RELEASE(41)) called at 2016-10-05 11:18:25.346857 (xid=0x50cde43d) [2016-10-05 11:18:25.462631] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7fc88e04a442] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fc88de12747] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fc88de1285e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7fc88de12919] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1cf)[0x7fc88de12b3f] ))))) 0-distrepvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(FLUSH(15)) called at 2016-10-05 11:18:25.346926 (xid=0x50cde43e) [2016-10-05 11:18:25.462650] E [MSGID: 114031] [client-rpc-fops.c:919:client3_3_flush_cbk] 0-distrepvol-client-0: remote operation failed [Transport endpoint is not connected] [2016-10-05 11:18:25.462696] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-distrepvol-client-0: disconnected from distrepvol-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2016-10-05 11:18:35.826378] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:35.829761] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:39.829510] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:39.832450] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:43.833228] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:43.836276] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:47.836093] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:47.838921] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:51.839247] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:51.842059] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:52.474766] W [socket.c:590:__socket_rwv] 0-distrepvol-client-5: readv on 10.70.37.187:49155 failed (No data available) [2016-10-05 11:18:52.474821] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-distrepvol-client-5: disconnected from distrepvol-client-5. Client process will keep trying to connect to glusterd until brick's port is available [2016-10-05 11:18:55.841724] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:55.844622] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:59.845327] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-0: changing port to 49154 (from 0) [2016-10-05 11:18:59.848412] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-0: connection to 10.70.35.191:49154 failed (Connection refused) [2016-10-05 11:18:25.462824] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-distrepvol-client-0: remote operation failed [Transport endpoint is not connected] Version-Release number of selected component (if applicable): ============== client side glusterfs-3.8.4-1.el6.x86_64 glusterfs-cli-3.8.4-1.el6.x86_64 glusterfs-libs-3.8.4-1.el6.x86_64 glusterfs-client-xlators-3.8.4-1.el6.x86_64 glusterfs-api-3.8.4-1.el6.x86_64 glusterfs-debuginfo-3.8.4-1.el6.x86_64 glusterfs-fuse-3.8.4-1.el6.x86_64 [root@dhcp35-29 glusterfs]#
These logs are important for debugging ping-timeout related bugs. So it is not a good idea to remove these from appearing. When a disconnect happens after the operations are sent to bricks, these logs appear. It prints the time at which operation was sent etc, which is crucial in debugging ping-timeouts.