Bug 787258

Summary: RDMA-connected clients drop mount with "transport endpoint not connected"
Product: [Community] GlusterFS Reporter: Brian Smith <brs>
Component: rdmaAssignee: Raghavendra G <rgowdapp>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 3.2.5CC: amarts, gluster-bugs, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 849132 (view as bug list) Environment:
Last Closed: 2012-12-26 05:09:08 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 849132, 858453    

Description Brian Smith 2012-02-03 12:40:57 EST
Description of problem:

We're using Gluster in an HPC environment over RDMA transport.  We're trying to isolate the workload that is triggering this, but until we do, I'll provide as much info as I can.

Version-Release number of selected component (if applicable):

GlusterFS 3.2.5 RPMs, distributed from gluster.org for RHEL 6.1

Gluster packages:
glusterfs-core-3.2.5-2.el6.x86_64
glusterfs-rdma-3.2.5-2.el6.x86_64
glusterfs-fuse-3.2.5-2.el6.x86_64

OS: Scientific Linux 6.1
Client kernel: 2.6.32-220.2.1.el6.x86_64
Brick kernel: 2.6.32-220.2.1.el6.x86_64

OFED/IB packages (stock, SL 6.1): 
ibsim-0.5-4.el6.x86_64
ibutils-1.5.4-3.el6.x86_64
infiniband-diags-1.5.5-1.el6.x86_64
infinipath-psm-1.13-2.el6.x86_64
infinipath-psm-devel-1.13-2.el6.x86_64
libibcm-1.0.5-2.el6.x86_64
libibcm-devel-1.0.5-2.el6.x86_64
libibmad-1.3.4-1.el6.x86_64
libibmad-devel-1.3.4-1.el6.x86_64
libibumad-1.3.4-1.el6.x86_64
libibumad-devel-1.3.4-1.el6.x86_64
libibverbs-1.1.4-2.el6.x86_64
libibverbs-devel-1.1.4-2.el6.x86_64
libibverbs-utils-1.1.4-2.el6.x86_64
libipathverbs-1.2-2.el6.x86_64
libmlx4-1.0.1-7.el6.x86_64
libmthca-1.0.5-7.el6.x86_64
librdmacm-1.0.10-2.el6.x86_64
librdmacm-devel-1.0.10-2.el6.x86_64
librdmacm-utils-1.0.10-2.el6.x86_64
mstflint-1.4-3.el6.x86_64
opensm-libs-3.3.5-1.el6.x86_64
perftest-1.2.3-3.el6.x86_64
qperf-0.4.6-2.el6.x86_64
rdma-1.0-9.el6.noarch

fstab entry:
wh-hpcfs:work.rdma	/work	glusterfs	rw,noatime	0	0

volume info:
volume work-client-0
    type protocol/client
    option remote-host wh-hpcfs-01
    option remote-subvolume /hpcfs/objects
    option transport-type rdma
end-volume

volume work-client-1
    type protocol/client
    option remote-host wh-hpcfs-02
    option remote-subvolume /hpcfs/objects
    option transport-type rdma
end-volume

volume work-client-2
    type protocol/client
    option remote-host wh-hpcfs-03
    option remote-subvolume /hpcfs/objects
    option transport-type rdma
end-volume

volume work-client-3
    type protocol/client
    option remote-host wh-hpcfs-04
    option remote-subvolume /hpcfs/objects
    option transport-type rdma
end-volume

volume work-dht
    type cluster/distribute
    subvolumes work-client-0 work-client-1 work-client-2 work-client-3
end-volume

volume work-write-behind
    type performance/write-behind
    subvolumes work-dht
end-volume

volume work-read-ahead
    type performance/read-ahead
    subvolumes work-write-behind
end-volume

volume work-io-cache
    type performance/io-cache
    subvolumes work-read-ahead
end-volume

volume work-quick-read
    type performance/quick-read
    subvolumes work-io-cache
end-volume

volume work-stat-prefetch
    type performance/stat-prefetch
    subvolumes work-quick-read
end-volume

volume work
    type debug/io-stats
    option latency-measurement off
    option count-fop-hits off
    subvolumes work-stat-prefetch
end-volume

How reproducible:
It seems fairly consistent within a certain subset of jobs.  I will track down this information.


Steps to Reproduce:
1. Mount filesystem on compute nodes 
2. Let jobs run through, using wh-fs:work.rdma as their working directory (will find specific cases)
3. Various nodes lose mount; "transport endpoint not connected".  Tasks are still attached to the mount.
  
Actual results:
Jobs fail because FS mount is dropped


Expected results:
Jobs should complete and the FS mount should be stable


Additional info:
Log snippet of /var/log/glusterfs/work.log (on compute node): http://pastie.org/3291330

Will try to get more logs and debugging information
Comment 1 Amar Tumballi 2012-02-27 05:35:57 EST
This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks.
Comment 2 Amar Tumballi 2012-07-11 07:29:06 EDT
mostly looks like readlink buffer length issue. Need to verify.
Comment 3 Raghavendra G 2012-12-26 05:09:08 EST

*** This bug has been marked as a duplicate of bug 822337 ***