+++ This bug was initially created as a clone of Bug #787258 +++ Description of problem: We're using Gluster in an HPC environment over RDMA transport. We're trying to isolate the workload that is triggering this, but until we do, I'll provide as much info as I can. Version-Release number of selected component (if applicable): GlusterFS 3.2.5 RPMs, distributed from gluster.org for RHEL 6.1 Gluster packages: glusterfs-core-3.2.5-2.el6.x86_64 glusterfs-rdma-3.2.5-2.el6.x86_64 glusterfs-fuse-3.2.5-2.el6.x86_64 OS: Scientific Linux 6.1 Client kernel: 2.6.32-220.2.1.el6.x86_64 Brick kernel: 2.6.32-220.2.1.el6.x86_64 OFED/IB packages (stock, SL 6.1): ibsim-0.5-4.el6.x86_64 ibutils-1.5.4-3.el6.x86_64 infiniband-diags-1.5.5-1.el6.x86_64 infinipath-psm-1.13-2.el6.x86_64 infinipath-psm-devel-1.13-2.el6.x86_64 libibcm-1.0.5-2.el6.x86_64 libibcm-devel-1.0.5-2.el6.x86_64 libibmad-1.3.4-1.el6.x86_64 libibmad-devel-1.3.4-1.el6.x86_64 libibumad-1.3.4-1.el6.x86_64 libibumad-devel-1.3.4-1.el6.x86_64 libibverbs-1.1.4-2.el6.x86_64 libibverbs-devel-1.1.4-2.el6.x86_64 libibverbs-utils-1.1.4-2.el6.x86_64 libipathverbs-1.2-2.el6.x86_64 libmlx4-1.0.1-7.el6.x86_64 libmthca-1.0.5-7.el6.x86_64 librdmacm-1.0.10-2.el6.x86_64 librdmacm-devel-1.0.10-2.el6.x86_64 librdmacm-utils-1.0.10-2.el6.x86_64 mstflint-1.4-3.el6.x86_64 opensm-libs-3.3.5-1.el6.x86_64 perftest-1.2.3-3.el6.x86_64 qperf-0.4.6-2.el6.x86_64 rdma-1.0-9.el6.noarch fstab entry: wh-hpcfs:work.rdma /work glusterfs rw,noatime 0 0 volume info: volume work-client-0 type protocol/client option remote-host wh-hpcfs-01 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-1 type protocol/client option remote-host wh-hpcfs-02 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-2 type protocol/client option remote-host wh-hpcfs-03 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-3 type protocol/client option remote-host wh-hpcfs-04 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-dht type cluster/distribute subvolumes work-client-0 work-client-1 work-client-2 work-client-3 end-volume volume work-write-behind type performance/write-behind subvolumes work-dht end-volume volume work-read-ahead type performance/read-ahead subvolumes work-write-behind end-volume volume work-io-cache type performance/io-cache subvolumes work-read-ahead end-volume volume work-quick-read type performance/quick-read subvolumes work-io-cache end-volume volume work-stat-prefetch type performance/stat-prefetch subvolumes work-quick-read end-volume volume work type debug/io-stats option latency-measurement off option count-fop-hits off subvolumes work-stat-prefetch end-volume How reproducible: It seems fairly consistent within a certain subset of jobs. I will track down this information. Steps to Reproduce: 1. Mount filesystem on compute nodes 2. Let jobs run through, using wh-fs:work.rdma as their working directory (will find specific cases) 3. Various nodes lose mount; "transport endpoint not connected". Tasks are still attached to the mount. Actual results: Jobs fail because FS mount is dropped Expected results: Jobs should complete and the FS mount should be stable Additional info: Log snippet of /var/log/glusterfs/work.log (on compute node): http://pastie.org/3291330 Will try to get more logs and debugging information --- Additional comment from amarts on 2012-02-27 05:35:57 EST --- This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks. --- Additional comment from amarts on 2012-07-11 07:29:06 EDT --- mostly looks like readlink buffer length issue. Need to verify.
*** This bug has been marked as a duplicate of bug 822337 ***