1039533 – SMB:Glusterd on one node crashes while doing add-brick operation followed by rebalance.

Bug 1039533 - SMB:Glusterd on one node crashes while doing add-brick operation followed by rebalance.

Summary: SMB:Glusterd on one node crashes while doing add-brick operation followed by ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nithya Balachandran
QA Contact:	surabhi
Docs Contact:
URL:
Whiteboard:	dht-add-brick
Depends On:
Blocks:	1035040 1286074
TreeView+	depends on / blocked

Reported:	2013-12-09 11:17 UTC by surabhi
Modified:	2015-11-27 10:36 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	While rebalance is in progress, adding a brick to the cluster displays an error message, "failed to get index" in the gluster log file.
Clone Of:
Clones:	1286074 (view as bug list)
Environment:
Last Closed:	2015-11-27 10:35:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description surabhi 2013-12-09 11:17:49 UTC

Description of problem:
While rebalance is in progress adding new peer to the cluster makes one of the node to crash and dumped a core.

Details:
After adding a brick ,while rebalance is in progress did peer probe for a node, Peer probe was successfull but after that gluster volume commands on the nodes were getting hanged.It is observed that one of the node from cluster is crashed and dumped a core.

Also the new node which was added to the cluster was running an older version of glusterfs than the other two nodes.Upgraded the gluster version on this new node and restarted glusterd.

volume logs:
[2013-12-06 11:09:36.019225] E [glusterd-utils.c:3801:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/8f7a3961f7bf2a66e38daec99628ffa1.socket error: No such file or directory
[2013-12-06 11:09:36.031131] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-12-06 11:09:36.031252] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2013-12-06 11:09:36.031273] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-06 11:09:41.091829] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-12-06 11:09:41.091960] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2013-12-06 11:09:41.091980] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-06 11:09:41.092438] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2
[2013-12-06 11:09:41.115784] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2013-12-06 11:09:41.115883] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now



Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44.1u2rhs-1.el6rhs.x86_64

How reproducible:
tried it once.

Steps to Reproduce:
1.Create a volume, mount it via smb on windows client , Run I/O's
2.Add a brick and start rebalance.
3.While rebalance is in progress ,do peer probe to a new node
Peer probe is successfull but one of the node got crashed.

Actual results:
One of the node in cluster crashed and dumped a core.

Expected results:
The Node should not crash.

Additional info:
Will update the logs and sos reports.

Comment 3 surabhi 2013-12-17 11:41:19 UTC

Tried the rebalance test and saw the crash again.

Glusterfs version: 
glusterfs-fuse-3.4.0.49rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.49rhs-1.el6rhs.x86_64

[2013-12-17 11:24:10.216412] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-12-17 11:24:15.391601] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:24:15.406033] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:24:15.426745] E [glusterd-utils.c:7825:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-12-17 11:26:05.756373] I [glusterd-handshake.c:364:__server_event_notify] 0-: received defrag status updated
[2013-12-17 11:26:05.763349] W [socket.c:522:__socket_rwv] 0-management: readv o

Latest Sosreports placed in above location.

Comment 4 Shalaka 2014-01-03 08:59:33 UTC

Please add doctext for this known issue.

Comment 5 Poornima G 2014-01-06 10:53:56 UTC

Could you please retry this with the latest patches? 
There have been couple of fixes which are part of 3.4.0.54rhs build, that addresses similar issues.

A similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=1024316

Comment 6 surabhi 2014-01-08 06:26:24 UTC

I will it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64 and update the results

Comment 7 Raghavendra Talur 2014-01-08 07:37:42 UTC

I tried it on build 33 and was able to reproduce the bug on it.

Here are the details:
Creating directory at /mnt/withreaddir//TestDir0/TestDir2/TestDir2
Creating files in /mnt/withreaddir//TestDir0/TestDir2/TestDir2......
Cannot open file: No such file or directory
flock() on closed filehandle FH at ./CreateDirAndFileTree.pl line 74.
Cannot lock - Bad file descriptor


root.42.178[Jan-08-2014- 6:30:55] >rpm -qa | grep gluster
glusterfs-fuse-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-api-devel-3.4.0.33rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.33rhs-1.el6rhs.x86_64



Analysis as of now:

Gluster fails to create/open a file when:
a. File's hash corresponds to the new brick.
b. File must not be directly under the / of the volume.
c. Folder or multiple folders under which the file lies are not yet created on the new brick.

Comment 8 surabhi 2014-01-08 10:27:35 UTC

The above analysis is for BZ 1049181.

Comment 9 surabhi 2014-01-20 06:23:00 UTC

Tried it on glusterfs-3.4.0.55rhs-1.el6rhs.x86_64:

Core is not generated now but the failures seen while doing rebalance are still present.

[2014-01-20 06:16:57.077109] E [glusterd-utils.c:4007:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/fdc31c62f15c054be9507d58711f3d14.sock
et error: No such file or directory
[2014-01-20 06:16:57.079450] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2014-01-20 06:16:57.079473] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-01-20 06:17:22.171858] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2014-01-20 06:17:22.189004] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2014-01-20 06:17:22.307779] E [glusterd-utils.c:7964:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index

Sosreports are updated.

Comment 10 Shalaka 2014-02-06 10:53:51 UTC

Please review the edited doc text and sign off.

Comment 11 Poornima G 2014-02-18 10:00:08 UTC

Doc text looks fine

Comment 12 Susant Kumar Palai 2015-11-27 10:35:59 UTC

Cloning to 3.1. To be fixed in future release

Note You need to log in before you can comment on or make changes to this bug.