995032 – dd on fuse mount failed with "Transport endpoint is not connected" when a node goes offline

Bug 995032 - dd on fuse mount failed with "Transport endpoint is not connected" when a node goes offline

Summary: dd on fuse mount failed with "Transport endpoint is not connected" when a nod...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-08 12:42 UTC by spandura
Modified:	2015-12-03 17:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-03 17:11:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description spandura 2013-08-08 12:42:40 UTC

Description of problem:
========================
In 1 x 2 replicate volume when a storage node goes offline, dd on fuse mount fails with "Transport endpoint is not connected". dd on nfs mount hangs. 

Version-Release number of selected component (if applicable):
===========================================================
glusterfs 3.4.0.18rhs built on Aug  7 2013 08:02:45

How reproducible:


Steps to Reproduce:
=======================
1. Create a 1 x 2 replicate volume with 2 storage nodes and 1 brick per storage node. 
set the background-self-heal-count to 0, data-self-heal "off", self-heal-daemon on

2. create fuse and nfs mount. { nfs mounts to storage_node2's nfs server}

3. from fuse mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

4. from nfs mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

5. set "self-heal-daemon" to off from one of the storage nodes. 

6. while the dd on both the mount points is in progress , kill all the gluster process from storage_node1. 

7. delete the brick directory and recreate the brick directory on storage_node1. 

8. after a while dd on fuse mount failed with "Transport endpoint is not connected", dd on nfs mount hangs. 

Actual results:
===============
Fuse mount
~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dd: writing `./test_file': Transport endpoint is not connected
dd: closing output file `./test_file': Transport endpoint is not connected

Nfs mount
~~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240



^C

^C
^C
^C

Expected results:
dd shouldn't fail. 

Additional info:
======================
Fuse mount didn't get the response from storage_node2 which was always online. 


[2013-08-08 11:58:39.996077] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-vol_rep-client-1: server 10.70.34.119:49153 has not responded in the last 42 seconds, disconnecting.



root@king [Aug-08-2013-18:09:26] >gluster v info
 
Volume Name: vol_rep
Type: Replicate
Volume ID: b5e2a708-3442-410d-b3ad-f9f1edbda67b
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: hicks:/rhs/bricks/b0
Brick2: king:/rhs/bricks/b1
Options Reconfigured:
cluster.self-heal-daemon: off
cluster.background-self-heal-count: 0
cluster.data-self-heal: off

root@king [Aug-08-2013-18:09:29] >gluster v status
Status of volume: vol_rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick king:/rhs/bricks/b1				49153	Y	12354
NFS Server on localhost					2049	Y	13018
 
There are no active volume tasks

root@king [Aug-08-2013-18:09:32] >./get_info.sh 
ls -lh /rhs/bricks/b1/test_file
-rw-r--r-- 2 root root 5.7G Aug  8 17:27 /rhs/bricks/b1/test_file

getfattr -d -e hex -m . /rhs/bricks/b1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/test_file
trusted.afr.vol_rep-client-0=0x0000ddf70000000000000000
trusted.afr.vol_rep-client-1=0x0000004d0000000000000000
trusted.gfid=0x560314511b7e4f1587f2c4b3187b3bfd


ls -l /proc/`cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid`/fd
cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
total 0
lr-x------ 1 root root 64 Aug  8 17:29 0 -> /dev/null
l-wx------ 1 root root 64 Aug  8 17:29 1 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 10 -> socket:[784900]
lr-x------ 1 root root 64 Aug  8 17:29 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug  8 17:29 12 -> /rhs/bricks/b1
lrwx------ 1 root root 64 Aug  8 17:29 13 -> socket:[797630]
lrwx------ 1 root root 64 Aug  8 17:29 14 -> socket:[831017]
lrwx------ 1 root root 64 Aug  8 17:29 17 -> socket:[786965]
l-wx------ 1 root root 64 Aug  8 17:29 2 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug  8 17:29 4 -> /var/log/glusterfs/bricks/rhs-bricks-b1.log
lrwx------ 1 root root 64 Aug  8 17:29 5 -> /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
lrwx------ 1 root root 64 Aug  8 17:29 6 -> socket:[784884]
lrwx------ 1 root root 64 Aug  8 17:29 7 -> socket:[784911]
lrwx------ 1 root root 64 Aug  8 17:29 8 -> socket:[784893]
lrwx------ 1 root root 64 Aug  8 17:29 9 -> socket:[797451]


Tried to take statedumps after the dd failed. The brick statedump grew upto 14GB.

Comment 2 spandura 2013-08-08 13:08:08 UTC

After some time dd on nfs mount failed with EBADFD


root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dmesg


^C

^C
^C
^C
dd: writing `./test_file': Input/output error
6372+0 records in
6371+0 records out
6680477696 bytes (6.7 GB) copied, 5331.16 s, 1.3 MB/s
dd: closing input file `/dev/urandom': Bad file descriptor
root@darrel [Aug-08-2013-18:35:49] >
root@darrel [Aug-08-2013-18:35:49] >

Comment 3 spandura 2013-08-08 13:20:06 UTC

SOS Reports , Statedumps : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/995032/

Comment 4 Vivek Agarwal 2015-12-03 17:11:14 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.