809180 – random, failed to send the fop: Transport endpoint is not connected errors crop up from time to time

Bug 809180 - random, failed to send the fop: Transport endpoint is not connected errors crop up from time to time

Summary: random, failed to send the fop: Transport endpoint is not connected errors cr...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rdma
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-04-02 16:58 UTC by csb sysadmin
Modified:	2015-02-13 09:43 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-13 09:43:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description csb sysadmin 2012-04-02 16:58:18 UTC

Description of problem:

We have a 5 brick infiniband connected distributed gluster setup running 3.2.5 . Sometimes it looks like glusterfsd or glusterd on one of the bricks stops responding and then users can't access some of the files. Doing /etc/init.d/glusterd restart seems to fix the issue (this restarts both glusterd and glusterfsd). Here are some errors from etc-glusterfs-glusterd.vol.log on the brick where glusterd or glusterfsd stopped responding :

[2012-03-27 13:07:27.913560] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (160.129.138.235:991)
[2012-03-27 13:07:27.975652] I [glusterd-handler.c:2693:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: 23ad1eee-3a2f-4481-9896-1ff35ba8bbc3
[2012-03-27 13:07:27.975714] I [glusterd-handler.c:2671:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-03-27 13:07:28.183484] E [rdma.c:4468:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.2.178.24:968) after handshake is complete
.
.
Several of the messages from above
.
.
[2012-04-02 01:05:51.590620] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1023)
[2012-04-02 01:05:51.645985] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/tmp/9263d7875a61e0b2671a8ca2095c9492.socket)
[2012-04-02 01:05:52.221546] E [socket.c:2080:socket_connect] 0-management: connection attempt failed (Connection refused)

and the log from the client (tons of these warnings/errors) :

[2012-04-02 10:42:01.978360] W [client3_1-fops.c:2606:client3_1_lookup] 0-pirdist-client-3: failed to send the fop: Transport endpoint is not connected
[2012-04-02 10:42:01.982460] W [client3_1-fops.c:5253:client3_1_readdirp] 0-pirdist-client-3: (1): failed to get fd ctx. EBADFD
[2012-04-02 10:42:01.982495] W [client3_1-fops.c:5317:client3_1_readdirp] 0-pirdist-client-3: failed to send the fop: File descriptor in bad state
[2012-04-02 10:42:03.970684] E [rdma.c:4417:tcp_connect_finish] 0-pirdist-client-3: tcp connect to 10.2.178.27:24010 failed (Connection refused)

Version-Release number of selected component (if applicable):

3.2.5

How reproducible:

Happens every few days or every other week depending on glusterfs load. I'm guessing higher I/O load might be causing this but I don't know for sure

Steps to Reproduce:
1. Setup  a 5 brick distributed, IB connected glusterfs setup
2. Setup other clients that connect via IB and those that connect via tcp/ip
3. Generate I/O load, e.g. using bonnie or iozone from the IB connected clients mainly but from tcp/ip connected ones as well.
  
Actual results:

glusterd and or glusterfsd stops responding on one of the bricks.

Expected results:

glusterd and or glusterfsd should not stop responding. If for some reason it does, perhaps there should be some kind of watchdog process that restarts it on the brick.

Additional info:

Comment 1 Amar Tumballi 2012-04-17 11:06:21 UTC

moving to component rdma as the volumes are of type RDMA.

Comment 3 Amar Tumballi 2012-07-11 11:40:16 UTC

looks like readlink buffer issue. Should be fixed in first update

Comment 4 Amar Tumballi 2012-08-23 06:45:26 UTC

This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.

Comment 5 Amar Tumballi 2012-12-21 10:30:22 UTC

on master branch (glusterfs-3.4.0qa6)

Comment 6 Sachidananda Urs 2013-08-08 05:42:45 UTC

Moving out of Big Bend since RDMA support is not available in Big Bend,2.1

Comment 9 Nagaprasad Sathyanarayana 2014-05-27 04:48:00 UTC

As per comment #8, moving this BZ out of Denali.

Note You need to log in before you can comment on or make changes to this bug.