Bug 985424 - Gluster 3.4.0 RDMA stops working with more then a small handful of nodes
Summary: Gluster 3.4.0 RDMA stops working with more then a small handful of nodes
Alias: None
Product: GlusterFS
Classification: Community
Component: rdma
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: GlusterFS Bugs list
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2013-07-17 13:07 UTC by Ryan
Modified: 2014-10-14 13:15 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2014-10-14 13:15:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)

Description Ryan 2013-07-17 13:07:07 UTC
Description of problem:
I was wondering if anyone has had a similar experience as to mine, when creating/mounting RDMA volumes of ~half dozen or less nodes - I am able to successfully create, start, and mount these RDMA only volumes. 

However if I try to scale this to 20, 50, or even 100 nodes RDMA only volumes completely fall over on themselves. Some of the basic symptoms I'm seeing are:

* Volume create always completes successfully, however when you go to start the node it will report failure - only to have the volume info command for that volume state that they are started
* Attempting to mount this "started" volume results in a failure to mount/hanging at the mount process
* Attempting to stop this "started" volume results in a failure, with no error/success reported (the command simply times out and gives an empty status result)
* Attempting to delete this "started" volume also results in failure - without any status of error/success reported

In order to clear the state, I have to stop/killall gluster processes, then respawn them. After this is completed the volume info command still shows the volume as started, however I can now successfully stop/delete the volume with a status of success:

root@cs1-p:~# gluster volume stop perftest
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: perftest: success

Volume Name: perftest
Type: Distributed-Replicate
Volume ID: ef206a76-7b26-4c12-9ccf-b3d250f36403
Status: Stopped
Number of Bricks: 50 x 2 = 100
Transport-type: rdma

root@cs1-p:~# gluster volume delete perftest
Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y
volume delete: perftest: success

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a gluster volume of more then 20-30 nodes
2. Observe that the start command reports failure, however the volume is flagged as "started"
3. Attempt to mount the volume, however this mount will fail.

Actual results:

Volume fails to mount

Expected results:

Volume should mount successfully

Additional info:

Comment 1 Kaleb KEITHLEY 2014-10-14 13:15:06 UTC
RDMA on IB works in 3.5. Please upgrade to a current 3.5 release if you need RDMA.

Note You need to log in before you can comment on or make changes to this bug.