Bug 978148

Summary: Attempting to mount distributed-replicate volume on RHEL 6.4 hangs in upstream 3.4.0 Beta 3
Product: [Community] GlusterFS Reporter: Justin Clift <jclift>
Component: rdmaAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: pre-releaseCC: bugs, gluster-bugs, kwade
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-22 15:40:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Client node gluster log directory.
none
First storage node gluster log directory
none
Second storage node log directory
none
Sosreport from client node
none
Sosreport from first storage node
none
Sosreport from second storage node
none
/var/lib/glusterd/ from gluster storage node 1
none
/var/lib/glusterd/ from gluster storage node 2 none

Description Justin Clift 2013-06-26 04:57:29 UTC
Description of problem:

  Attempting to mount a Distributed-Replicate volume using RDMA
  transport is hanging, using upstream GlusterFS 3.4.0 Beta3 on
  RHEL 6.4.  (this is a late entry for the 3.4.0 beta 3 RDMA
  "Test Day")

  Tried several times (using Control-C to cancel after a few
  minutes each time).

    # mount -t glusterfs gluster1-2:test4 /foo4
    ^C
    # mount -t glusterfs gluster1-2:test4 /foo4
    ^C
    # mount -t glusterfs gluster1-2:test4 /foo4
    ^C
    # ps -ef|grep -i gluster
    root      1808     1  0 05:21 ?        00:00:00 /usr/sbin/glusterd -p /var/run/glusterd.pid
    root      1874     1  0 05:21 ?        00:00:00 /usr/sbin/glusterfs --volfile-id=test4 --volfile-server=gluster1-2 /foo4
    root      1882     1  0 05:21 pts/0    00:00:00 /bin/sh /sbin/mount.glusterfs gluster1-2:test4 /foo4 -o rw
    root      1888     1  0 05:22 pts/0    00:00:00 /bin/sh /sbin/mount.glusterfs gluster1-2:test4 /foo4 -o rw
    root      1904     1  0 05:22 pts/0    00:00:00 /bin/sh /sbin/mount.glusterfs gluster1-2:test4 /foo4 -o rw
    #

  The Gluster log for the mount point (attached) seems to say
  it is having issues connecting to one of the subvolumes,
  although none of the other volumes on the same servers are
  having issues.  It seems to only be a problem with the
  Distributed-Replica volume, and it's using the same backend
  physical storage as the other volumes (just different
  directories).

    # gluster volume info

    Volume Name: test4
    Type: Distributed-Replicate
    Volume ID: 384780ee-306c-419e-a6d6-d58abbd24a58
    Status: Started
    Number of Bricks: 2 x 2 = 4
    Transport-type: rdma
    Bricks:
    Brick1: gluster1-2:/export/brick1/test4
    Brick2: gluster2-2:/export/brick1/test4
    Brick3: gluster1-2:/export/brick2/test4
    Brick4: gluster2-2:/export/brick2/test4

    Volume Name: test3
    Type: Replicate
    Volume ID: a3b5e22c-28c9-4963-84fd-c192f4b9261b
    Status: Started
    Number of Bricks: 1 x 2 = 2
    Transport-type: rdma
    Bricks:
    Brick1: gluster1-2:/export/brick2/test3
    Brick2: gluster2-2:/export/brick2/test3

    Volume Name: test1
    Type: Distribute
    Volume ID: 63017415-d946-473e-8aa4-8746e5265f9c
    Status: Started
    Number of Bricks: 1
    Transport-type: rdma
    Bricks:
    Brick1: gluster1-2:/export/brick1/test1

    Volume Name: test5
    Type: Stripe
    Volume ID: e6e78330-3bc8-445c-b500-93fb15ebdf6d
    Status: Started
    Number of Bricks: 1 x 2 = 2
    Transport-type: rdma
    Bricks:
    Brick1: gluster1-2:/export/brick1/test5
    Brick2: gluster2-2:/export/brick1/test5

    Volume Name: test2
    Type: Distribute
    Volume ID: 694f1cbf-e1e5-42e6-9f07-605d409ff95f
    Status: Started
    Number of Bricks: 2
    Transport-type: rdma
    Bricks:
    Brick1: gluster1-2:/export/brick2/test2
    Brick2: gluster2-2:/export/brick2/test2


Version-Release number of selected component (if applicable):

  glusterfs-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-api-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-debuginfo-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-devel-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-fuse-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-rdma-3.4.0-0.6.beta3.el6.x86_64
  glusterfs-server-3.4.0-0.6.beta3.el6.x86_64


How reproducible:

  Every time, even after several reboots of every server in
  the environment. :(


Steps to Reproduce:

  1. With a 3 node setup (2 gluster nodes, one "client"
     node for mounting), create all of the volumes as per
     the Gluster 3.4.0 beta3 RDMA test day instructions.
     Add "transport rdma" to every volume creation command
     though, so it's using RDMA.

       http://www.gluster.org/community/documentation/index.php/3.4.0_Beta_1_Tests

  2. Attempt to mount all of the volumes as per Test 4a
     (native client mounting).

     The problem occurs here.  Mounting will hang on the
     "test4" (distributed replica) volume.


Additional info:

  Tarballs for the /var/log/glusterfs/ directory for all
  3 nodes are attached.  The logs are clean (wiped just
  before recreating this issue), so they should be relevant.

  Also including the sosreports for all three nodes,
  generated using "sosreport -e infiniband".

Comment 1 Justin Clift 2013-06-26 04:59:39 UTC
Created attachment 765385 [details]
Client node gluster log directory.

The "foo4.log" file is the log for the hanging mount point.

Comment 2 Justin Clift 2013-06-26 05:00:19 UTC
Created attachment 765386 [details]
First storage node gluster log directory

Comment 3 Justin Clift 2013-06-26 05:01:10 UTC
Created attachment 765387 [details]
Second storage node log directory

Comment 4 Justin Clift 2013-06-26 05:03:57 UTC
Created attachment 765389 [details]
Sosreport from client node

Generated using "sosreport -e infiniband"

Comment 5 Justin Clift 2013-06-26 05:04:44 UTC
Created attachment 765390 [details]
Sosreport from first storage node

Generated using "sosreport -e infiniband"

Comment 6 Justin Clift 2013-06-26 05:05:26 UTC
Created attachment 765391 [details]
Sosreport from second storage node

Generated using "sosreport -e infiniband"

Comment 7 Justin Clift 2013-06-26 05:47:48 UTC
Interestingly, "ps -ef|grep -i glusterfsd" is showing two glusterfsd processes for the test4 volume.  Probably relevant to the problem.

# ps -ef|grep -i glusterfsd
root      1895     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test4.gluster1-2.export-brick1-test4 -p /var/lib/glusterd/vols/test4/run/gluster1-2-export-brick1-test4.pid -S /var/run/0e97d6ffbb276bbdda66eefdfa0177a3.socket --brick-name /export/brick1/test4 -l /var/log/glusterfs/bricks/export-brick1-test4.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49163 --xlator-option test4-server.listen-port=49163
root      1899     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test4.gluster1-2.export-brick2-test4 -p /var/lib/glusterd/vols/test4/run/gluster1-2-export-brick2-test4.pid -S /var/run/bfabdcacf2d8f16138631e941242b7c3.socket --brick-name /export/brick2/test4 -l /var/log/glusterfs/bricks/export-brick2-test4.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49164 --xlator-option test4-server.listen-port=49164
root      1909     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test3.gluster1-2.export-brick2-test3 -p /var/lib/glusterd/vols/test3/run/gluster1-2-export-brick2-test3.pid -S /var/run/e351ec752b247414834521b8ee755418.socket --brick-name /export/brick2/test3 -l /var/log/glusterfs/bricks/export-brick2-test3.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49162 --xlator-option test3-server.listen-port=49162
root      1913     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test1.gluster1-2.export-brick1-test1 -p /var/lib/glusterd/vols/test1/run/gluster1-2-export-brick1-test1.pid -S /var/run/265f49a2e421e2c1c52658c1215db1e8.socket --brick-name /export/brick1/test1 -l /var/log/glusterfs/bricks/export-brick1-test1.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49160 --xlator-option test1-server.listen-port=49160
root      1922     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test5.gluster1-2.export-brick1-test5 -p /var/lib/glusterd/vols/test5/run/gluster1-2-export-brick1-test5.pid -S /var/run/1a66008adc837acb4b95ef45312a69f1.socket --brick-name /export/brick1/test5 -l /var/log/glusterfs/bricks/export-brick1-test5.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49165 --xlator-option test5-server.listen-port=49165
root      1927     1  0 06:29 ?        00:00:00 /usr/sbin/glusterfsd -s gluster1-2 --volfile-id test2.gluster1-2.export-brick2-test2 -p /var/lib/glusterd/vols/test2/run/gluster1-2-export-brick2-test2.pid -S /var/run/4f0257c2b42a1968d918e757d0ad9779.socket --brick-name /export/brick2/test2 -l /var/log/glusterfs/bricks/export-brick2-test2.log --xlator-option *-posix.glusterd-uuid=6776b03c-9406-46bd-ae4f-4b6c63a46503 --brick-port 49161 --xlator-option test2-server.listen-port=49161
#

Comment 8 Justin Clift 2013-06-26 05:59:09 UTC
Created attachment 765399 [details]
/var/lib/glusterd/ from gluster storage node 1

Comment 9 Justin Clift 2013-06-26 05:59:33 UTC
Created attachment 765400 [details]
/var/lib/glusterd/ from gluster storage node 2

Comment 11 Kaleb KEITHLEY 2015-10-22 15:40:20 UTC
pre-release version is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.