Description of problem: We have a 5 brick infiniband connected distributed gluster setup running 3.2.5 . Sometimes it looks like glusterfsd or glusterd on one of the bricks stops responding and then users can't access some of the files. Doing /etc/init.d/glusterd restart seems to fix the issue (this restarts both glusterd and glusterfsd). Here are some errors from etc-glusterfs-glusterd.vol.log on the brick where glusterd or glusterfsd stopped responding : [2012-03-27 13:07:27.913560] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (160.129.138.235:991) [2012-03-27 13:07:27.975652] I [glusterd-handler.c:2693:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: 23ad1eee-3a2f-4481-9896-1ff35ba8bbc3 [2012-03-27 13:07:27.975714] I [glusterd-handler.c:2671:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0 [2012-03-27 13:07:28.183484] E [rdma.c:4468:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.2.178.24:968) after handshake is complete . . Several of the messages from above . . [2012-04-02 01:05:51.590620] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1023) [2012-04-02 01:05:51.645985] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/tmp/9263d7875a61e0b2671a8ca2095c9492.socket) [2012-04-02 01:05:52.221546] E [socket.c:2080:socket_connect] 0-management: connection attempt failed (Connection refused) and the log from the client (tons of these warnings/errors) : [2012-04-02 10:42:01.978360] W [client3_1-fops.c:2606:client3_1_lookup] 0-pirdist-client-3: failed to send the fop: Transport endpoint is not connected [2012-04-02 10:42:01.982460] W [client3_1-fops.c:5253:client3_1_readdirp] 0-pirdist-client-3: (1): failed to get fd ctx. EBADFD [2012-04-02 10:42:01.982495] W [client3_1-fops.c:5317:client3_1_readdirp] 0-pirdist-client-3: failed to send the fop: File descriptor in bad state [2012-04-02 10:42:03.970684] E [rdma.c:4417:tcp_connect_finish] 0-pirdist-client-3: tcp connect to 10.2.178.27:24010 failed (Connection refused) Version-Release number of selected component (if applicable): 3.2.5 How reproducible: Happens every few days or every other week depending on glusterfs load. I'm guessing higher I/O load might be causing this but I don't know for sure Steps to Reproduce: 1. Setup a 5 brick distributed, IB connected glusterfs setup 2. Setup other clients that connect via IB and those that connect via tcp/ip 3. Generate I/O load, e.g. using bonnie or iozone from the IB connected clients mainly but from tcp/ip connected ones as well. Actual results: glusterd and or glusterfsd stops responding on one of the bricks. Expected results: glusterd and or glusterfsd should not stop responding. If for some reason it does, perhaps there should be some kind of watchdog process that restarts it on the brick. Additional info:
moving to component rdma as the volumes are of type RDMA.
looks like readlink buffer issue. Should be fixed in first update
This bug is not seen in current master branch (which will get branched as RHS 2.1.0 soon). To consider it for fixing, want to make sure this bug still exists in RHS servers. If not reproduced, would like to close this.
on master branch (glusterfs-3.4.0qa6)
Moving out of Big Bend since RDMA support is not available in Big Bend,2.1
As per comment #8, moving this BZ out of Denali.