Bug 995032 - dd on fuse mount failed with "Transport endpoint is not connected" when a node goes offline
dd on fuse mount failed with "Transport endpoint is not connected" when a nod...
Status: CLOSED EOL
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.1
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
storage-qa-internal@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-08 08:42 EDT by spandura
Modified: 2015-12-03 12:11 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-03 12:11:14 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description spandura 2013-08-08 08:42:40 EDT
Description of problem:
========================
In 1 x 2 replicate volume when a storage node goes offline, dd on fuse mount fails with "Transport endpoint is not connected". dd on nfs mount hangs. 

Version-Release number of selected component (if applicable):
===========================================================
glusterfs 3.4.0.18rhs built on Aug  7 2013 08:02:45

How reproducible:


Steps to Reproduce:
=======================
1. Create a 1 x 2 replicate volume with 2 storage nodes and 1 brick per storage node. 
set the background-self-heal-count to 0, data-self-heal "off", self-heal-daemon on

2. create fuse and nfs mount. { nfs mounts to storage_node2's nfs server}

3. from fuse mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

4. from nfs mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 

5. set "self-heal-daemon" to off from one of the storage nodes. 

6. while the dd on both the mount points is in progress , kill all the gluster process from storage_node1. 

7. delete the brick directory and recreate the brick directory on storage_node1. 

8. after a while dd on fuse mount failed with "Transport endpoint is not connected", dd on nfs mount hangs. 

Actual results:
===============
Fuse mount
~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dd: writing `./test_file': Transport endpoint is not connected
dd: closing output file `./test_file': Transport endpoint is not connected

Nfs mount
~~~~~~~~~~~~~~~~
root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240



^C

^C
^C
^C

Expected results:
dd shouldn't fail. 

Additional info:
======================
Fuse mount didn't get the response from storage_node2 which was always online. 


[2013-08-08 11:58:39.996077] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-vol_rep-client-1: server 10.70.34.119:49153 has not responded in the last 42 seconds, disconnecting.



root@king [Aug-08-2013-18:09:26] >gluster v info
 
Volume Name: vol_rep
Type: Replicate
Volume ID: b5e2a708-3442-410d-b3ad-f9f1edbda67b
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: hicks:/rhs/bricks/b0
Brick2: king:/rhs/bricks/b1
Options Reconfigured:
cluster.self-heal-daemon: off
cluster.background-self-heal-count: 0
cluster.data-self-heal: off

root@king [Aug-08-2013-18:09:29] >gluster v status
Status of volume: vol_rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick king:/rhs/bricks/b1				49153	Y	12354
NFS Server on localhost					2049	Y	13018
 
There are no active volume tasks

root@king [Aug-08-2013-18:09:32] >./get_info.sh 
ls -lh /rhs/bricks/b1/test_file
-rw-r--r-- 2 root root 5.7G Aug  8 17:27 /rhs/bricks/b1/test_file

getfattr -d -e hex -m . /rhs/bricks/b1/test_file
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/test_file
trusted.afr.vol_rep-client-0=0x0000ddf70000000000000000
trusted.afr.vol_rep-client-1=0x0000004d0000000000000000
trusted.gfid=0x560314511b7e4f1587f2c4b3187b3bfd


ls -l /proc/`cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid`/fd
cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
total 0
lr-x------ 1 root root 64 Aug  8 17:29 0 -> /dev/null
l-wx------ 1 root root 64 Aug  8 17:29 1 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 10 -> socket:[784900]
lr-x------ 1 root root 64 Aug  8 17:29 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug  8 17:29 12 -> /rhs/bricks/b1
lrwx------ 1 root root 64 Aug  8 17:29 13 -> socket:[797630]
lrwx------ 1 root root 64 Aug  8 17:29 14 -> socket:[831017]
lrwx------ 1 root root 64 Aug  8 17:29 17 -> socket:[786965]
l-wx------ 1 root root 64 Aug  8 17:29 2 -> /dev/null
lrwx------ 1 root root 64 Aug  8 17:29 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug  8 17:29 4 -> /var/log/glusterfs/bricks/rhs-bricks-b1.log
lrwx------ 1 root root 64 Aug  8 17:29 5 -> /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid
lrwx------ 1 root root 64 Aug  8 17:29 6 -> socket:[784884]
lrwx------ 1 root root 64 Aug  8 17:29 7 -> socket:[784911]
lrwx------ 1 root root 64 Aug  8 17:29 8 -> socket:[784893]
lrwx------ 1 root root 64 Aug  8 17:29 9 -> socket:[797451]


Tried to take statedumps after the dd failed. The brick statedump grew upto 14GB.
Comment 2 spandura 2013-08-08 09:08:08 EDT
After some time dd on nfs mount failed with EBADFD


root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240
dmesg


^C

^C
^C
^C
dd: writing `./test_file': Input/output error
6372+0 records in
6371+0 records out
6680477696 bytes (6.7 GB) copied, 5331.16 s, 1.3 MB/s
dd: closing input file `/dev/urandom': Bad file descriptor
root@darrel [Aug-08-2013-18:35:49] >
root@darrel [Aug-08-2013-18:35:49] >
Comment 3 spandura 2013-08-08 09:20:06 EDT
SOS Reports , Statedumps : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/995032/
Comment 4 Vivek Agarwal 2015-12-03 12:11:14 EST
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.