Bug 809180 - random, failed to send the fop: Transport endpoint is not connected errors crop up from time to time
random, failed to send the fop: Transport endpoint is not connected errors cr...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rdma (Show other bugs)
1.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Raghavendra G
shylesh
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-02 12:58 EDT by csb sysadmin
Modified: 2015-02-13 04:43 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-02-13 04:43:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description csb sysadmin 2012-04-02 12:58:18 EDT
Description of problem:

We have a 5 brick infiniband connected distributed gluster setup running 3.2.5 . Sometimes it looks like glusterfsd or glusterd on one of the bricks stops responding and then users can't access some of the files. Doing /etc/init.d/glusterd restart seems to fix the issue (this restarts both glusterd and glusterfsd). Here are some errors from etc-glusterfs-glusterd.vol.log on the brick where glusterd or glusterfsd stopped responding :

[2012-03-27 13:07:27.913560] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (160.129.138.235:991)
[2012-03-27 13:07:27.975652] I [glusterd-handler.c:2693:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: 23ad1eee-3a2f-4481-9896-1ff35ba8bbc3
[2012-03-27 13:07:27.975714] I [glusterd-handler.c:2671:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-03-27 13:07:28.183484] E [rdma.c:4468:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.2.178.24:968) after handshake is complete
.
.
Several of the messages from above
.
.
[2012-04-02 01:05:51.590620] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1023)
[2012-04-02 01:05:51.645985] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/tmp/9263d7875a61e0b2671a8ca2095c9492.socket)
[2012-04-02 01:05:52.221546] E [socket.c:2080:socket_connect] 0-management: connection attempt failed (Connection refused)

and the log from the client (tons of these warnings/errors) :

[2012-04-02 10:42:01.978360] W [client3_1-fops.c:2606:client3_1_lookup] 0-pirdist-client-3: failed to send the fop: Transport endpoint is not connected
[2012-04-02 10:42:01.982460] W [client3_1-fops.c:5253:client3_1_readdirp] 0-pirdist-client-3: (1): failed to get fd ctx. EBADFD
[2012-04-02 10:42:01.982495] W [client3_1-fops.c:5317:client3_1_readdirp] 0-pirdist-client-3: failed to send the fop: File descriptor in bad state
[2012-04-02 10:42:03.970684] E [rdma.c:4417:tcp_connect_finish] 0-pirdist-client-3: tcp connect to 10.2.178.27:24010 failed (Connection refused)

Version-Release number of selected component (if applicable):

3.2.5

How reproducible:

Happens every few days or every other week depending on glusterfs load. I'm guessing higher I/O load might be causing this but I don't know for sure

Steps to Reproduce:
1. Setup  a 5 brick distributed, IB connected glusterfs setup
2. Setup other clients that connect via IB and those that connect via tcp/ip
3. Generate I/O load, e.g. using bonnie or iozone from the IB connected clients mainly but from tcp/ip connected ones as well.
  
Actual results:

glusterd and or glusterfsd stops responding on one of the bricks.

Expected results:

glusterd and or glusterfsd should not stop responding. If for some reason it does, perhaps there should be some kind of watchdog process that restarts it on the brick.

Additional info:
Comment 1 Amar Tumballi 2012-04-17 07:06:21 EDT
moving to component rdma as the volumes are of type RDMA.
Comment 3 Amar Tumballi 2012-07-11 07:40:16 EDT
looks like readlink buffer issue. Should be fixed in first update
Comment 4 Amar Tumballi 2012-08-23 02:45:26 EDT
This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.
Comment 5 Amar Tumballi 2012-12-21 05:30:22 EST
on master branch (glusterfs-3.4.0qa6)
Comment 6 Sachidananda Urs 2013-08-08 01:42:45 EDT
Moving out of Big Bend since RDMA support is not available in Big Bend,2.1
Comment 9 Nagaprasad Sathyanarayana 2014-05-27 00:48:00 EDT
As per comment #8, moving this BZ out of Denali.

Note You need to log in before you can comment on or make changes to this bug.