Bug 809180

Summary: random, failed to send the fop: Transport endpoint is not connected errors crop up from time to time
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: csb sysadmin <admin>
Component: rdmaAssignee: Raghavendra G <rgowdapp>
Status: CLOSED CURRENTRELEASE QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0CC: gluster-bugs, nsathyan, poelstra, rwheeler, sdharane, surs, vagarwal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-13 09:43:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description csb sysadmin 2012-04-02 16:58:18 UTC
Description of problem:

We have a 5 brick infiniband connected distributed gluster setup running 3.2.5 . Sometimes it looks like glusterfsd or glusterd on one of the bricks stops responding and then users can't access some of the files. Doing /etc/init.d/glusterd restart seems to fix the issue (this restarts both glusterd and glusterfsd). Here are some errors from etc-glusterfs-glusterd.vol.log on the brick where glusterd or glusterfsd stopped responding :

[2012-03-27 13:07:27.913560] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (160.129.138.235:991)
[2012-03-27 13:07:27.975652] I [glusterd-handler.c:2693:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: 23ad1eee-3a2f-4481-9896-1ff35ba8bbc3
[2012-03-27 13:07:27.975714] I [glusterd-handler.c:2671:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-03-27 13:07:28.183484] E [rdma.c:4468:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.2.178.24:968) after handshake is complete
.
.
Several of the messages from above
.
.
[2012-04-02 01:05:51.590620] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1023)
[2012-04-02 01:05:51.645985] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/tmp/9263d7875a61e0b2671a8ca2095c9492.socket)
[2012-04-02 01:05:52.221546] E [socket.c:2080:socket_connect] 0-management: connection attempt failed (Connection refused)

and the log from the client (tons of these warnings/errors) :

[2012-04-02 10:42:01.978360] W [client3_1-fops.c:2606:client3_1_lookup] 0-pirdist-client-3: failed to send the fop: Transport endpoint is not connected
[2012-04-02 10:42:01.982460] W [client3_1-fops.c:5253:client3_1_readdirp] 0-pirdist-client-3: (1): failed to get fd ctx. EBADFD
[2012-04-02 10:42:01.982495] W [client3_1-fops.c:5317:client3_1_readdirp] 0-pirdist-client-3: failed to send the fop: File descriptor in bad state
[2012-04-02 10:42:03.970684] E [rdma.c:4417:tcp_connect_finish] 0-pirdist-client-3: tcp connect to 10.2.178.27:24010 failed (Connection refused)

Version-Release number of selected component (if applicable):

3.2.5

How reproducible:

Happens every few days or every other week depending on glusterfs load. I'm guessing higher I/O load might be causing this but I don't know for sure

Steps to Reproduce:
1. Setup  a 5 brick distributed, IB connected glusterfs setup
2. Setup other clients that connect via IB and those that connect via tcp/ip
3. Generate I/O load, e.g. using bonnie or iozone from the IB connected clients mainly but from tcp/ip connected ones as well.
  
Actual results:

glusterd and or glusterfsd stops responding on one of the bricks.

Expected results:

glusterd and or glusterfsd should not stop responding. If for some reason it does, perhaps there should be some kind of watchdog process that restarts it on the brick.

Additional info:

Comment 1 Amar Tumballi 2012-04-17 11:06:21 UTC
moving to component rdma as the volumes are of type RDMA.

Comment 3 Amar Tumballi 2012-07-11 11:40:16 UTC
looks like readlink buffer issue. Should be fixed in first update

Comment 4 Amar Tumballi 2012-08-23 06:45:26 UTC
This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.

Comment 5 Amar Tumballi 2012-12-21 10:30:22 UTC
on master branch (glusterfs-3.4.0qa6)

Comment 6 Sachidananda Urs 2013-08-08 05:42:45 UTC
Moving out of Big Bend since RDMA support is not available in Big Bend,2.1

Comment 9 Nagaprasad Sathyanarayana 2014-05-27 04:48:00 UTC
As per comment #8, moving this BZ out of Denali.