Description of problem: We're using Gluster in an HPC environment over RDMA transport. We're trying to isolate the workload that is triggering this, but until we do, I'll provide as much info as I can. Version-Release number of selected component (if applicable): GlusterFS 3.2.5 RPMs, distributed from gluster.org for RHEL 6.1 Gluster packages: glusterfs-core-3.2.5-2.el6.x86_64 glusterfs-rdma-3.2.5-2.el6.x86_64 glusterfs-fuse-3.2.5-2.el6.x86_64 OS: Scientific Linux 6.1 Client kernel: 2.6.32-220.2.1.el6.x86_64 Brick kernel: 2.6.32-220.2.1.el6.x86_64 OFED/IB packages (stock, SL 6.1): ibsim-0.5-4.el6.x86_64 ibutils-1.5.4-3.el6.x86_64 infiniband-diags-1.5.5-1.el6.x86_64 infinipath-psm-1.13-2.el6.x86_64 infinipath-psm-devel-1.13-2.el6.x86_64 libibcm-1.0.5-2.el6.x86_64 libibcm-devel-1.0.5-2.el6.x86_64 libibmad-1.3.4-1.el6.x86_64 libibmad-devel-1.3.4-1.el6.x86_64 libibumad-1.3.4-1.el6.x86_64 libibumad-devel-1.3.4-1.el6.x86_64 libibverbs-1.1.4-2.el6.x86_64 libibverbs-devel-1.1.4-2.el6.x86_64 libibverbs-utils-1.1.4-2.el6.x86_64 libipathverbs-1.2-2.el6.x86_64 libmlx4-1.0.1-7.el6.x86_64 libmthca-1.0.5-7.el6.x86_64 librdmacm-1.0.10-2.el6.x86_64 librdmacm-devel-1.0.10-2.el6.x86_64 librdmacm-utils-1.0.10-2.el6.x86_64 mstflint-1.4-3.el6.x86_64 opensm-libs-3.3.5-1.el6.x86_64 perftest-1.2.3-3.el6.x86_64 qperf-0.4.6-2.el6.x86_64 rdma-1.0-9.el6.noarch fstab entry: wh-hpcfs:work.rdma /work glusterfs rw,noatime 0 0 volume info: volume work-client-0 type protocol/client option remote-host wh-hpcfs-01 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-1 type protocol/client option remote-host wh-hpcfs-02 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-2 type protocol/client option remote-host wh-hpcfs-03 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-client-3 type protocol/client option remote-host wh-hpcfs-04 option remote-subvolume /hpcfs/objects option transport-type rdma end-volume volume work-dht type cluster/distribute subvolumes work-client-0 work-client-1 work-client-2 work-client-3 end-volume volume work-write-behind type performance/write-behind subvolumes work-dht end-volume volume work-read-ahead type performance/read-ahead subvolumes work-write-behind end-volume volume work-io-cache type performance/io-cache subvolumes work-read-ahead end-volume volume work-quick-read type performance/quick-read subvolumes work-io-cache end-volume volume work-stat-prefetch type performance/stat-prefetch subvolumes work-quick-read end-volume volume work type debug/io-stats option latency-measurement off option count-fop-hits off subvolumes work-stat-prefetch end-volume How reproducible: It seems fairly consistent within a certain subset of jobs. I will track down this information. Steps to Reproduce: 1. Mount filesystem on compute nodes 2. Let jobs run through, using wh-fs:work.rdma as their working directory (will find specific cases) 3. Various nodes lose mount; "transport endpoint not connected". Tasks are still attached to the mount. Actual results: Jobs fail because FS mount is dropped Expected results: Jobs should complete and the FS mount should be stable Additional info: Log snippet of /var/log/glusterfs/work.log (on compute node): http://pastie.org/3291330 Will try to get more logs and debugging information
This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks.
mostly looks like readlink buffer length issue. Need to verify.
*** This bug has been marked as a duplicate of bug 822337 ***