Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1039533 - SMB:Glusterd on one node crashes while doing add-brick operation followed by rebalance.
SMB:Glusterd on one node crashes while doing add-brick operation followed by ...
Status: CLOSED DEFERRED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute (Show other bugs)
2.1
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Nithya Balachandran
surabhi
dht-add-brick
:
Depends On:
Blocks: 1035040 1286074
  Show dependency treegraph
 
Reported: 2013-12-09 06:17 EST by surabhi
Modified: 2015-11-27 05:36 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
While rebalance is in progress, adding a brick to the cluster displays an error message, "failed to get index" in the gluster log file.
Story Points: ---
Clone Of:
: 1286074 (view as bug list)
Environment:
Last Closed: 2015-11-27 05:35:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description surabhi 2013-12-09 06:17:49 EST
Description of problem:
While rebalance is in progress adding new peer to the cluster makes one of the node to crash and dumped a core.

Details:
After adding a brick ,while rebalance is in progress did peer probe for a node, Peer probe was successfull but after that gluster volume commands on the nodes were getting hanged.It is observed that one of the node from cluster is crashed and dumped a core.

Also the new node which was added to the cluster was running an older version of glusterfs than the other two nodes.Upgraded the gluster version on this new node and restarted glusterd.

volume logs:
[2013-12-06 11:09:36.019225] E [glusterd-utils.c:3801:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/8f7a3961f7bf2a66e38daec99628ffa1.socket error: No such file or directory
[2013-12-06 11:09:36.031131] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-12-06 11:09:36.031252] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2013-12-06 11:09:36.031273] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-06 11:09:41.091829] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-12-06 11:09:41.091960] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2013-12-06 11:09:41.091980] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-06 11:09:41.092438] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2
[2013-12-06 11:09:41.115784] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2013-12-06 11:09:41.115883] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now



Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44.1u2rhs-1.el6rhs.x86_64

How reproducible:
tried it once.

Steps to Reproduce:
1.Create a volume, mount it via smb on windows client , Run I/O's
2.Add a brick and start rebalance.
3.While rebalance is in progress ,do peer probe to a new node
Peer probe is successfull but one of the node got crashed.

Actual results:
One of the node in cluster crashed and dumped a core.

Expected results:
The Node should not crash.

Additional info:
Will update the logs and sos reports.
Comment 3 surabhi 2013-12-17 06:41:19 EST
Tried the rebalance test and saw the crash again.

Glusterfs version: 
glusterfs-fuse-3.4.0.49rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.49rhs-1.el6rhs.x86_64

[2013-12-17 11:24:10.216412] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-17 11:24:15.391601] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:24:15.406033] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:24:15.426745] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:26:05.756373] I [glusterd-handshake.c:364:__server_event_notify] 0-: received defrag status updated
[2013-12-17 11:26:05.763349] W [socket.c:522:__socket_rwv] 0-management: readv o

Latest Sosreports placed in above location.
Comment 4 Shalaka 2014-01-03 03:59:33 EST
Please add doctext for this known issue.
Comment 5 Poornima G 2014-01-06 05:53:56 EST
Could you please retry this with the latest patches? 
There have been couple of fixes which are part of 3.4.0.54rhs build, that addresses similar issues.

A similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=1024316
Comment 6 surabhi 2014-01-08 01:26:24 EST
I will it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64 and update the results
Comment 7 Raghavendra Talur 2014-01-08 02:37:42 EST
I tried it on build 33 and was able to reproduce the bug on it.

Here are the details:
Creating directory at /mnt/withreaddir//TestDir0/TestDir2/TestDir2
Creating files in /mnt/withreaddir//TestDir0/TestDir2/TestDir2......
Cannot open file: No such file or directory
flock() on closed filehandle FH at ./CreateDirAndFileTree.pl line 74.
Cannot lock - Bad file descriptor


root@10.70.42.178[Jan-08-2014- 6:30:55] >rpm -qa | grep gluster
glusterfs-fuse-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-api-devel-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.33rhs-1.el6rhs.x86_64



Analysis as of now:

Gluster fails to create/open a file when:
a. File's hash corresponds to the new brick.
b. File must not be directly under the / of the volume.
c. Folder or multiple folders under which the file lies are not yet created on the new brick.
Comment 8 surabhi 2014-01-08 05:27:35 EST
The above analysis is for BZ 1049181.
Comment 9 surabhi 2014-01-20 01:23:00 EST
Tried it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64:

Core is not generated now but the failures seen while doing rebalance are still present.

[2014-01-20 06:16:57.077109] E [glusterd-utils.c:4007:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/fdc31c62f15c054be9507d58711f3d14.sock
et error: No such file or directory
[2014-01-20 06:16:57.079450] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2014-01-20 06:16:57.079473] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-01-20 06:17:22.171858] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2014-01-20 06:17:22.189004] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2014-01-20 06:17:22.307779] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index

Sosreports are updated.
Comment 10 Shalaka 2014-02-06 05:53:51 EST
Please review the edited doc text and sign off.
Comment 11 Poornima G 2014-02-18 05:00:08 EST
Doc text looks fine
Comment 12 Susant Kumar Palai 2015-11-27 05:35:59 EST
Cloning to 3.1. To be fixed in future release

Note You need to log in before you can comment on or make changes to this bug.