Bug 1326663

Summary:	[DHT-Rebalance]: with few brick process down, rebalance process isn't killed even after stopping rebalance process
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	krishnaram Karthick <kramdoss>
Component:	distribute	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED ERRATA	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	asrivast, nbalacha, rgowdapp, rhinduja, sanandpa, spalai
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.3
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-3	Doc Type:	Bug Fix
Doc Text:	When the crawler thread for rebalance reaches its queue limit, it sleeps until it receives a signal from the migration threads. However, migration threads exited without notifying the crawler thread for a REBALANCE_STOP event. This meant that, when 'rebalance stop' was run when the crawler thread had reached its queue limit, the rebalance process did not actually stop. Migration threads now signal the crawler before they exit, and the 'rebalance stop' command works as expected.	Story Points:	---
Clone Of:
Clones:	1327507 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:17:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817

Description krishnaram Karthick 2016-04-13 10:10:18 UTC

Description of problem:
On a 4x2 volume, when one of the brick process in each sub-volume is brought down rebalance stop doesn't actually stop the process. The following error message is continuously thrown in the rebalance logs.

snippet of rebalance-logs:
[2016-04-13 09:53:28.534599] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused)
[2016-04-13 09:53:28.541054] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused)
[2016-04-13 09:53:29.535727] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-2: changing port to 49152 (from 0)
[2016-04-13 09:53:29.543157] E [socket.c:2279:socket_connect_finish] 0-testvol-client-2: connection to 10.70.47.9:49152 failed (Connection refused)
[2016-04-13 09:53:30.541661] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0)
[2016-04-13 09:53:30.548590] E [socket.c:2279:socket_connect_finish] 0-testvol-client-4: connection to 10.70.47.90:49153 failed (Connection refused)
[2016-04-13 09:53:31.549488] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0)
[2016-04-13 09:53:31.556892] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-6: changing port to 49153 (from 0)
[2016-04-13 09:53:31.562246] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused)
[2016-04-13 09:53:31.568492] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused)


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-1.el7rhgs.x86_64

How reproducible:
1/1, yet to try if it is consistently reproducible

Steps to Reproduce:
1. On a 4x2 dis-rep volume, add 4 more bricks
2. create 10k files of 1kb size
3. Trigger rebalance process
4. kill one brick process from each subvolume on existing bricks [so, one brick process on each sub-volume is up]
5. rename few files
6. After a while, stop rebalance process
7. check if rebalance process is actually stopped

Actual results:
rebalance process is still running

Expected results:
rebalance process should have stopped

Additional info:
gluster v info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 02427025-adcf-48a2-ac58-ae494839e9f8
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.47.90:/bricks/brick0/ct
Brick2: 10.70.47.105:/bricks/brick0/ct
Brick3: 10.70.47.9:/bricks/brick0/ct
Brick4: 10.70.46.94:/bricks/brick0/ct
Brick5: 10.70.47.90:/bricks/brick1/ct
Brick6: 10.70.47.105:/bricks/brick1/ct
Brick7: 10.70.47.9:/bricks/brick1/ct
Brick8: 10.70.46.94:/bricks/brick1/ct
Brick9: 10.70.47.90:/bricks/brick2/ct
Brick10: 10.70.47.105:/bricks/brick2/ct
Brick11: 10.70.47.9:/bricks/brick2/ct
Brick12: 10.70.46.94:/bricks/brick2/ct
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on

sosreports shall be attached shortly.

Comment 2 Susant Kumar Palai 2016-04-15 06:19:43 UTC

Will wait for sos-report. In case the issue is seen again, please take a state-dump as well.

Comment 3 Susant Kumar Palai 2016-04-19 07:05:25 UTC

http://review.gluster.org/#/c/14004/

Comment 6 Raghavendra G 2016-04-28 06:33:04 UTC

https://code.engineering.redhat.com/gerrit/73093

Comment 8 Sweta Anandpara 2016-05-03 06:50:54 UTC

During validation of this BZ, I tried to reproduce the same on the build glusterfs-3.7.9-2, following the steps mentioned in the description.

I had a 4*2 volume, created 10k files of size 1k, added 4 more bricks (making it 6*2) and started rebalance. Killed one brick of each subvolume and I was able to see connection refused errors in the rebalance logs. 

[2016-05-03 06:34:43.186445] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]
[2016-05-03 06:34:43.192118] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-1: connection to 10.70.35.85:49152 failed (Connection refused)
[2016-05-03 06:34:43.207999] I [dht-rebalance.c:1214:dht_migrate_file] 0-dist-rep2-dht: /1k_files/file778: attempting to move from dist-rep2-replicate-2 to dist-rep2-replicate-4
[2016-05-03 06:34:43.208787] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]


[2016-05-03 06:35:13.189547] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]
[2016-05-03 06:35:13.249361] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-3: changing port to 49152 (from 0)
[2016-05-03 06:35:13.254546] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-3: connection to 10.70.35.13:49152 failed (Connection refused)
[2016-05-03 06:35:13.264568] W [MSGID: 114031] [client-rpc-fops.c:2539:client3_3_lk_cbk] 0-dist-rep2-client-1: remote operation failed [Transport endpoint is not connected]


Before I could stop the rebalance operation, it had already gotten completed. Added 4 more bricks (making it 8*2 volume), started off a rename for 2k files from the mountpoint, and then restarted the rebalance operation. Status showed that the operation was in progress. Attempted to stop rebalance and that went through successfully. However I was able to see similar errors in the logs.

[2016-05-03 06:43:08.413496] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-dist-rep2-dht: Fix layout failed for /1k_files
[2016-05-03 06:43:08.434422] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-4: changing port to 49153 (from 0)
[2016-05-03 06:43:08.440845] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-6: changing port to 49153 (from 0)
[2016-05-03 06:43:08.442697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-4: connection to 10.70.35.210:49153 failed (Connection refused)
[2016-05-03 06:43:08.445697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-6: connection to 10.70.35.137:49153 failed (Connection refused)
[2016-05-03 06:43:08.818536] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-9: changing port to 49154 (from 0)
[2016-05-03 06:43:08.823343] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-9: connection to 10.70.35.85:49154 failed (Connection refused)


Is there any step that I am missing? Karthick, do you remember doing something else which was preventing you to stop the started-rebalance-process?

Comment 9 Sweta Anandpara 2016-05-03 10:21:45 UTC

So the rebalance stop command has to be given exactly when the crawler has reached its queue limit and has gone off to sleep, to reproduce the above issue.

Redid the steps as mentioned. Invoked gdb this time, with the help of Sushant. Waited for crawler to reach the queue limit of 500, and issued a gluster v rebalance stop. 

The rebalance process did not get killed in the setup of 3.7.9-2 build, but got killed in the setup which had 3.7.9-3 build. 

Moving this BZ to verified in 3.1.3.

Comment 10 Sweta Anandpara 2016-05-03 10:23:11 UTC

Moving the qa contact, as I am answerable to the verification of this bug.

Comment 12 Susant Kumar Palai 2016-06-06 06:40:12 UTC

Yes Laura, doc is fine.

Comment 14 errata-xmlrpc 2016-06-23 05:17:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240