Description of problem: While rebalance is in progress adding new peer to the cluster makes one of the node to crash and dumped a core. Details: After adding a brick ,while rebalance is in progress did peer probe for a node, Peer probe was successfull but after that gluster volume commands on the nodes were getting hanged.It is observed that one of the node from cluster is crashed and dumped a core. Also the new node which was added to the cluster was running an older version of glusterfs than the other two nodes.Upgraded the gluster version on this new node and restarted glusterd. volume logs: [2013-12-06 11:09:36.019225] E [glusterd-utils.c:3801:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/8f7a3961f7bf2a66e38daec99628ffa1.socket error: No such file or directory [2013-12-06 11:09:36.031131] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2013-12-06 11:09:36.031252] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled [2013-12-06 11:09:36.031273] I [socket.c:3520:socket_init] 0-management: using system polling thread [2013-12-06 11:09:41.091829] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2013-12-06 11:09:41.091960] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled [2013-12-06 11:09:41.091980] I [socket.c:3520:socket_init] 0-management: using system polling thread [2013-12-06 11:09:41.092438] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2 [2013-12-06 11:09:41.115784] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now [2013-12-06 11:09:41.115883] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now Version-Release number of selected component (if applicable): glusterfs-3.4.0.44.1u2rhs-1.el6rhs.x86_64 How reproducible: tried it once. Steps to Reproduce: 1.Create a volume, mount it via smb on windows client , Run I/O's 2.Add a brick and start rebalance. 3.While rebalance is in progress ,do peer probe to a new node Peer probe is successfull but one of the node got crashed. Actual results: One of the node in cluster crashed and dumped a core. Expected results: The Node should not crash. Additional info: Will update the logs and sos reports.
Tried the rebalance test and saw the crash again. Glusterfs version: glusterfs-fuse-3.4.0.49rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.49rhs-1.el6rhs.x86_64 [2013-12-17 11:24:10.216412] I [socket.c:3520:socket_init] 0-management: using system polling thread [2013-12-17 11:24:15.391601] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2013-12-17 11:24:15.406033] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2013-12-17 11:24:15.426745] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2013-12-17 11:26:05.756373] I [glusterd-handshake.c:364:__server_event_notify] 0-: received defrag status updated [2013-12-17 11:26:05.763349] W [socket.c:522:__socket_rwv] 0-management: readv o Latest Sosreports placed in above location.
Please add doctext for this known issue.
Could you please retry this with the latest patches? There have been couple of fixes which are part of 3.4.0.54rhs build, that addresses similar issues. A similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=1024316
I will it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64 and update the results
I tried it on build 33 and was able to reproduce the bug on it. Here are the details: Creating directory at /mnt/withreaddir//TestDir0/TestDir2/TestDir2 Creating files in /mnt/withreaddir//TestDir0/TestDir2/TestDir2...... Cannot open file: No such file or directory flock() on closed filehandle FH at ./CreateDirAndFileTree.pl line 74. Cannot lock - Bad file descriptor root.42.178[Jan-08-2014- 6:30:55] >rpm -qa | grep gluster glusterfs-fuse-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-rdma-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-geo-replication-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-api-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-devel-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-api-devel-3.4.0.33rhs-1.el6rhs.x86_64 glusterfs-debuginfo-3.4.0.33rhs-1.el6rhs.x86_64 Analysis as of now: Gluster fails to create/open a file when: a. File's hash corresponds to the new brick. b. File must not be directly under the / of the volume. c. Folder or multiple folders under which the file lies are not yet created on the new brick.
The above analysis is for BZ 1049181.
Tried it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64: Core is not generated now but the failures seen while doing rebalance are still present. [2014-01-20 06:16:57.077109] E [glusterd-utils.c:4007:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/fdc31c62f15c054be9507d58711f3d14.sock et error: No such file or directory [2014-01-20 06:16:57.079450] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-20 06:16:57.079473] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-01-20 06:17:22.171858] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2014-01-20 06:17:22.189004] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2014-01-20 06:17:22.307779] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index Sosreports are updated.
Please review the edited doc text and sign off.
Doc text looks fine
Cloning to 3.1. To be fixed in future release