Hi Dheeraj, Can you provide the correct device name using 'option transport.rdma.device-name <device-name>' (in your case it will be 'option transport.rdma.device-name mthca0') in volume file (usually found in /etc/glusterd/vols/<vol-name>/ on the volume-file-server) for protocol/client and protocol/server and let us know whether it works fine? Also bug 764869, seems to be duplicate of this. Can you confirm this? regards, Raghavendra.
3137 is the duplication of 3139. I will try the above given option on server. (In reply to comment #1) > Hi Dheeraj, > > Can you provide the correct device name using 'option > transport.rdma.device-name <device-name>' (in your case it will be 'option > transport.rdma.device-name mthca0') in volume file (usually found in > /etc/glusterd/vols/<vol-name>/ on the volume-file-server) for protocol/client > and protocol/server and let us know whether it works fine? > > Also bug 764869, seems to be duplicate of this. Can you confirm this? > > regards, > Raghavendra.
*** Bug 3137 has been marked as a duplicate of this bug. ***
(In reply to comment #3) > *** Bug 3137 has been marked as a duplicate of this bug. *** This is the filesystem details gluster volume info Volume Name: crlgfs1 Type: Distribute Status: Started Number of Bricks: 2 Transport-type: rdma Bricks: Brick1: glus01-ib:/data/gluster/brick-1 Brick2: glus02-ib:/data/gluster/brick-2 Options Reconfigured: features.quota: off diagnostics.count-fop-hits: on diagnostics.latency-measurement: on features.quota-timeout: 5 I have added option transport.rdma.device-name mthca0 to /etc/glusterd/vols/crlgfs1/crlgfs1.glus01-ib.data-gluster-brick-1.vol and /etc/glusterd/vols/crlgfs1/crlgfs1.glus02-ib.data-gluster-brick-2.vol on both nodes. Restarted glusterd. I didnt add any line on clients. I tried mounting it and it stills failed. Should I add any line in client also. if yes on which file.
(In reply to comment #4) > (In reply to comment #3) > > *** Bug 3137 has been marked as a duplicate of this bug. *** > > > This is the filesystem details > > gluster volume info > > Volume Name: crlgfs1 > Type: Distribute > Status: Started > Number of Bricks: 2 > Transport-type: rdma > Bricks: > Brick1: glus01-ib:/data/gluster/brick-1 > Brick2: glus02-ib:/data/gluster/brick-2 > Options Reconfigured: > features.quota: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > features.quota-timeout: 5 > > > I have added option transport.rdma.device-name mthca0 to > /etc/glusterd/vols/crlgfs1/crlgfs1.glus01-ib.data-gluster-brick-1.vol and > /etc/glusterd/vols/crlgfs1/crlgfs1.glus02-ib.data-gluster-brick-2.vol on both > nodes. > Restarted glusterd. > > I didnt add any line on clients. > I tried mounting it and it stills failed. > Should I add any line in client also. > if yes on which file. Yes, you should be adding that to client file also. You can find the file as /etc/glusterd/vols/crlgfs1/crlgfs1-fuse.vol. Please make sure you make these changes in volfile server (or on all the servers if you don't know which is the volfile server), so that modified configuration files are used to startup gluster. You should be able to see these changes in the volfile dump present in all the client and server log files (which can be found in <install-dir>/var/log/glusterfs. regards, Raghavendra.
Instead of picking the ACTIVE port 0 (mthca0), gluster client tried to connect through mthca1 which is down. The error message includes: [2011-07-06 21:36:41.567718] W [write-behind.c:3023:init] 0-crlgfs1-write-behind: disabling write-behind for first 0 bytes [2011-07-06 21:36:41.572288] W [rdma.c:3742:rdma_get_device] 0-rpc-transport/rdma: On device mthca1: provided port:1 is found to be offline, continuing to use the same port ibv_devinfo gives the below given output: # ibv_devinfo hca_id: mthca1 transport: InfiniBand (0) fw_ver: 1.2.400 node_guid: 0019:bbff:fff7:9bfc sys_image_guid: 0019:bbff:fff7:9bff vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: HP_0010000001 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: IB hca_id: mthca0 transport: InfiniBand (0) fw_ver: 1.2.400 node_guid: 0019:bbff:fff7:abbc sys_image_guid: 0019:bbff:fff7:abbf vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: HP_0010000001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 3 port_lmc: 0x00 link_layer: IB Hope this will help you to give me an idea of the issue.
Created attachment 546
Hi Dheeraj, Can you check whether the attached patch fixes the issue (even without modifying configuration files)? # cd <glusterfs-source-dir> # patch -p1 < patch # (make && sudo make install) > /dev/null regards, Raghavendra.
(In reply to comment #7) > Hi Dheeraj, > > Can you check whether the attached patch fixes the issue (even without > modifying configuration files)? > > # cd <glusterfs-source-dir> > # patch -p1 < patch > # (make && sudo make install) > /dev/null > > regards, > Raghavendra. The patch also didnt seem to work for me. I didnt try too much on the case. I just removed the additional IB card and its working fine now. Thanks for you effort. Will try using both the card and will let you know the result. Note: I noticed a peculiar thing: On RHEL5 (2.6.18) it works fine with 2 HCA cards, but when I recompiled the kernel with 2.6.25 its giving the issue, similarly it gave issue for FC12 also(2.6.32).
We've to confirm that patch enables gluster to pick correct default device.
(In reply to comment #9) > We've to confirm that patch enables gluster to pick correct default device. While doing the path it gave me following error. [root@st0 glusterfs-3.2.0]# patch -p1 < 0001-rpc-transport-rdma-Use-a-device-with-an-active-port.patch patching file rpc/rpc-transport/rdma/src/rdma.c Hunk #5 FAILED at 3911. Hunk #6 succeeded at 3989 (offset -1 lines). 1 out of 6 hunks FAILED -- saving rejects to file rpc/rpc-transport/rdma/src/rdma.c.rej I ran #./configure --prefix=<path> #make // make gave me error rdma.c:3972: error: too few arguments to function 'rdma_get_device' make[5]: *** [rdma.lo] Error 1 make[5]: Leaving directory `/root/glusterfs-3.2.0/rpc/rpc-transport/rdma/src' make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory `/root/glusterfs-3.2.0/rpc/rpc-transport/rdma' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/root/glusterfs-3.2.0/rpc/rpc-transport' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/glusterfs-3.2.0/rpc' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/glusterfs-3.2.0' make: *** [all] Error 2
This bug will be automatically fixed if we use rdma-cm for connection establishment, since rdma-cm finds an active device/port for us.
*** This bug has been marked as a duplicate of bug 3319 ***
Since rdma-cm might not be integrated to 3.2 and 3.1, a patch fixing this is needed.
Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.
'rdma-cm' will make it to only 3.3.x branch. Raghavendra has sent a patch to handle this case in release-3.2 at http://review.gluster.com/483
This is the priority for immediate future (before 3.3.0 GA release). Will bump the priority up once we take RDMA related tasks.
*** This bug has been marked as a duplicate of bug 765051 ***