Description of problem: ======================== In 1 x 2 replicate volume when a storage node goes offline, dd on fuse mount fails with "Transport endpoint is not connected". dd on nfs mount hangs. Version-Release number of selected component (if applicable): =========================================================== glusterfs 3.4.0.18rhs built on Aug 7 2013 08:02:45 How reproducible: Steps to Reproduce: ======================= 1. Create a 1 x 2 replicate volume with 2 storage nodes and 1 brick per storage node. set the background-self-heal-count to 0, data-self-heal "off", self-heal-daemon on 2. create fuse and nfs mount. { nfs mounts to storage_node2's nfs server} 3. from fuse mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 4. from nfs mount execute "dd if=/dev/urandom of=test_file bs=1M count=10240" 5. set "self-heal-daemon" to off from one of the storage nodes. 6. while the dd on both the mount points is in progress , kill all the gluster process from storage_node1. 7. delete the brick directory and recreate the brick directory on storage_node1. 8. after a while dd on fuse mount failed with "Transport endpoint is not connected", dd on nfs mount hangs. Actual results: =============== Fuse mount ~~~~~~~~~~~~~~~ root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240 dd: writing `./test_file': Transport endpoint is not connected dd: closing output file `./test_file': Transport endpoint is not connected Nfs mount ~~~~~~~~~~~~~~~~ root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240 ^C ^C ^C ^C Expected results: dd shouldn't fail. Additional info: ====================== Fuse mount didn't get the response from storage_node2 which was always online. [2013-08-08 11:58:39.996077] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-vol_rep-client-1: server 10.70.34.119:49153 has not responded in the last 42 seconds, disconnecting. root@king [Aug-08-2013-18:09:26] >gluster v info Volume Name: vol_rep Type: Replicate Volume ID: b5e2a708-3442-410d-b3ad-f9f1edbda67b Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: hicks:/rhs/bricks/b0 Brick2: king:/rhs/bricks/b1 Options Reconfigured: cluster.self-heal-daemon: off cluster.background-self-heal-count: 0 cluster.data-self-heal: off root@king [Aug-08-2013-18:09:29] >gluster v status Status of volume: vol_rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick king:/rhs/bricks/b1 49153 Y 12354 NFS Server on localhost 2049 Y 13018 There are no active volume tasks root@king [Aug-08-2013-18:09:32] >./get_info.sh ls -lh /rhs/bricks/b1/test_file -rw-r--r-- 2 root root 5.7G Aug 8 17:27 /rhs/bricks/b1/test_file getfattr -d -e hex -m . /rhs/bricks/b1/test_file getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/b1/test_file trusted.afr.vol_rep-client-0=0x0000ddf70000000000000000 trusted.afr.vol_rep-client-1=0x0000004d0000000000000000 trusted.gfid=0x560314511b7e4f1587f2c4b3187b3bfd ls -l /proc/`cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid`/fd cat /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid total 0 lr-x------ 1 root root 64 Aug 8 17:29 0 -> /dev/null l-wx------ 1 root root 64 Aug 8 17:29 1 -> /dev/null lrwx------ 1 root root 64 Aug 8 17:29 10 -> socket:[784900] lr-x------ 1 root root 64 Aug 8 17:29 11 -> /dev/urandom lr-x------ 1 root root 64 Aug 8 17:29 12 -> /rhs/bricks/b1 lrwx------ 1 root root 64 Aug 8 17:29 13 -> socket:[797630] lrwx------ 1 root root 64 Aug 8 17:29 14 -> socket:[831017] lrwx------ 1 root root 64 Aug 8 17:29 17 -> socket:[786965] l-wx------ 1 root root 64 Aug 8 17:29 2 -> /dev/null lrwx------ 1 root root 64 Aug 8 17:29 3 -> anon_inode:[eventpoll] l-wx------ 1 root root 64 Aug 8 17:29 4 -> /var/log/glusterfs/bricks/rhs-bricks-b1.log lrwx------ 1 root root 64 Aug 8 17:29 5 -> /var/lib/glusterd/vols/vol_rep/run/king-rhs-bricks-b1.pid lrwx------ 1 root root 64 Aug 8 17:29 6 -> socket:[784884] lrwx------ 1 root root 64 Aug 8 17:29 7 -> socket:[784911] lrwx------ 1 root root 64 Aug 8 17:29 8 -> socket:[784893] lrwx------ 1 root root 64 Aug 8 17:29 9 -> socket:[797451] Tried to take statedumps after the dd failed. The brick statedump grew upto 14GB.
After some time dd on nfs mount failed with EBADFD root@darrel [Aug-08-2013-17:06:44] >dd if=/dev/urandom of=./test_file bs=1M count=10240 dmesg ^C ^C ^C ^C dd: writing `./test_file': Input/output error 6372+0 records in 6371+0 records out 6680477696 bytes (6.7 GB) copied, 5331.16 s, 1.3 MB/s dd: closing input file `/dev/urandom': Bad file descriptor root@darrel [Aug-08-2013-18:35:49] > root@darrel [Aug-08-2013-18:35:49] >
SOS Reports , Statedumps : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/995032/
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.