Bug 995032

Summary:	dd on fuse mount failed with "Transport endpoint is not connected" when a node goes offline
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	glusterfs	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED EOL	QA Contact:	storage-qa-internal <storage-qa-internal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.1	CC:	rhs-bugs, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-03 17:11:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description spandura 2013-08-08 12:42:40 UTC

Description of problem:
========================
In 1 x 2 replicate volume when a storage node goes offline, dd on fuse mount fails with "Transport endpoint is not connected". dd on nfs mount hangs. 

Version-Release number of selected component (if applicable):
===========================================================
glusterfs 3.4.0.18rhs built on Aug  7 2013 08:02:45

How reproducible:


Steps to Reproduce:
=======================
1. Create a 1 x 2 replicate volume with 2 storage nodes and 1 brick per storage node. 
set the background-self-heal-count to 0, data-self-heal "off", self-heal-daemon on

2. create fuse and nfs mount. { nfs mounts to storage_node2's nfs server}

3. from fuse mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

4. from nfs mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

5. set "self-heal-daemon" to off from one of the storage nodes. 

6. while the dd on both the mount points is in progress , kill all the gluster process from storage_node1. 

7. delete the brick directory and recreate the brick directory on storage_node1. 

8. after a while dd on fuse mount failed with "Transport endpoint is not connected", dd on nfs mount hangs. 

Actual results:
===============
Fuse mount
~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dd: writing `./test_file': Transport endpoint is not connected
dd: closing output file `./test_file': Transport endpoint is not connected

Nfs mount
~~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240



^C

^C
^C
^C

Expected results:
dd shouldn't fail. 

Additional info:
======================
Fuse mount didn't get the response from storage_node2 which was always online. 


[2013-08-08 11:58:39.996077] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-vol_rep-client-1: server 10.70.34.119:49153 has not responded in the last 42 seconds, disconnecting.



root@king [Aug-08-2013-18:09:26] >gluster v info
 
Volume Name: vol_rep
Type: Replicate
Volume ID: b5e2a708-3442-410d-b3ad-f9f1edbda67b
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: hicks:/rhs/bricks/b0
Brick2: king:/rhs/bricks/b1
Options Reconfigured:
cluster.self-heal-daemon: off
cluster.background-self-heal-count: 0
cluster.data-self-heal: off

root@king [Aug-08-2013-18:09:29] >gluster v status
Status of volume: vol_rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick king:/rhs/bricks/b1				49153	Y	12354
NFS Server on localhost					2049	Y	13018
 
There are no active volume tasks

root@king [Aug-08-2013-18:09:32] >./get_info.sh 
ls -lh /rhs/bricks/b1/test_file
-rw-r--r-- 2 root root 5.7G Aug  8 17:27 /rhs/bricks/b1/test_file

getfattr -d -e hex -m . /rhs/bricks/b1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/test_file
trusted.afr.vol_rep-client-0=0x0000ddf70000000000000000
trusted.afr.vol_rep-client-1=0x0000004d0000000000000000
trusted.gfid=0x560314511b7e4f1587f2c4b3187b3bfd


ls -l /proc/`cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid`/fd
cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
total 0
lr-x------ 1 root root 64 Aug  8 17:29 0 -> /dev/null
l-wx------ 1 root root 64 Aug  8 17:29 1 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 10 -> socket:[784900]
lr-x------ 1 root root 64 Aug  8 17:29 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug  8 17:29 12 -> /rhs/bricks/b1
lrwx------ 1 root root 64 Aug  8 17:29 13 -> socket:[797630]
lrwx------ 1 root root 64 Aug  8 17:29 14 -> socket:[831017]
lrwx------ 1 root root 64 Aug  8 17:29 17 -> socket:[786965]
l-wx------ 1 root root 64 Aug  8 17:29 2 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug  8 17:29 4 -> /var/log/glusterfs/bricks/rhs-bricks-b1.log
lrwx------ 1 root root 64 Aug  8 17:29 5 -> /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
lrwx------ 1 root root 64 Aug  8 17:29 6 -> socket:[784884]
lrwx------ 1 root root 64 Aug  8 17:29 7 -> socket:[784911]
lrwx------ 1 root root 64 Aug  8 17:29 8 -> socket:[784893]
lrwx------ 1 root root 64 Aug  8 17:29 9 -> socket:[797451]


Tried to take statedumps after the dd failed. The brick statedump grew upto 14GB.

Comment 2 spandura 2013-08-08 13:08:08 UTC

After some time dd on nfs mount failed with EBADFD


root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dmesg


^C

^C
^C
^C
dd: writing `./test_file': Input/output error
6372+0 records in
6371+0 records out
6680477696 bytes (6.7 GB) copied, 5331.16 s, 1.3 MB/s
dd: closing input file `/dev/urandom': Bad file descriptor
root@darrel [Aug-08-2013-18:35:49] >
root@darrel [Aug-08-2013-18:35:49] >

Comment 3 spandura 2013-08-08 13:20:06 UTC

SOS Reports , Statedumps : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/995032/

Comment 4 Vivek Agarwal 2015-12-03 17:11:14 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.