Bug 1326663 - [DHT-Rebalance]: with few brick process down, rebalance process isn't killed even after stopping rebalance process
Summary: [DHT-Rebalance]: with few brick process down, rebalance process isn't killed ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.1
Hardware: All
OS: Unspecified
unspecified
medium
Target Milestone: ---
: RHGS 3.1.3
Assignee: Susant Kumar Palai
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1311817
TreeView+ depends on / blocked
 
Reported: 2016-04-13 10:10 UTC by krishnaram Karthick
Modified: 2020-04-15 14:26 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.7.9-3
Doc Type: Bug Fix
Doc Text:
When the crawler thread for rebalance reaches its queue limit, it sleeps until it receives a signal from the migration threads. However, migration threads exited without notifying the crawler thread for a REBALANCE_STOP event. This meant that, when 'rebalance stop' was run when the crawler thread had reached its queue limit, the rebalance process did not actually stop. Migration threads now signal the crawler before they exit, and the 'rebalance stop' command works as expected.
Clone Of:
: 1327507 (view as bug list)
Environment:
Last Closed: 2016-06-23 05:17:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1240 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 Update 3 2016-06-23 08:51:28 UTC

Description krishnaram Karthick 2016-04-13 10:10:18 UTC
Description of problem:
On a 4x2 volume, when one of the brick process in each sub-volume is brought down rebalance stop doesn't actually stop the process. The following error message is continuously thrown in the rebalance logs.

snippet of rebalance-logs:
[2016-04-13 09:53:28.534599] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused)
[2016-04-13 09:53:28.541054] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused)
[2016-04-13 09:53:29.535727] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-2: changing port to 49152 (from 0)
[2016-04-13 09:53:29.543157] E [socket.c:2279:socket_connect_finish] 0-testvol-client-2: connection to 10.70.47.9:49152 failed (Connection refused)
[2016-04-13 09:53:30.541661] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0)
[2016-04-13 09:53:30.548590] E [socket.c:2279:socket_connect_finish] 0-testvol-client-4: connection to 10.70.47.90:49153 failed (Connection refused)
[2016-04-13 09:53:31.549488] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0)
[2016-04-13 09:53:31.556892] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-6: changing port to 49153 (from 0)
[2016-04-13 09:53:31.562246] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused)
[2016-04-13 09:53:31.568492] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused)


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-1.el7rhgs.x86_64

How reproducible:
1/1, yet to try if it is consistently reproducible

Steps to Reproduce:
1. On a 4x2 dis-rep volume, add 4 more bricks
2. create 10k files of 1kb size
3. Trigger rebalance process
4. kill one brick process from each subvolume on existing bricks [so, one brick process on each sub-volume is up]
5. rename few files
6. After a while, stop rebalance process
7. check if rebalance process is actually stopped

Actual results:
rebalance process is still running

Expected results:
rebalance process should have stopped

Additional info:
gluster v info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 02427025-adcf-48a2-ac58-ae494839e9f8
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.47.90:/bricks/brick0/ct
Brick2: 10.70.47.105:/bricks/brick0/ct
Brick3: 10.70.47.9:/bricks/brick0/ct
Brick4: 10.70.46.94:/bricks/brick0/ct
Brick5: 10.70.47.90:/bricks/brick1/ct
Brick6: 10.70.47.105:/bricks/brick1/ct
Brick7: 10.70.47.9:/bricks/brick1/ct
Brick8: 10.70.46.94:/bricks/brick1/ct
Brick9: 10.70.47.90:/bricks/brick2/ct
Brick10: 10.70.47.105:/bricks/brick2/ct
Brick11: 10.70.47.9:/bricks/brick2/ct
Brick12: 10.70.46.94:/bricks/brick2/ct
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on

sosreports shall be attached shortly.

Comment 2 Susant Kumar Palai 2016-04-15 06:19:43 UTC
Will wait for sos-report. In case the issue is seen again, please take a state-dump as well.

Comment 3 Susant Kumar Palai 2016-04-19 07:05:25 UTC
http://review.gluster.org/#/c/14004/

Comment 6 Raghavendra G 2016-04-28 06:33:04 UTC
https://code.engineering.redhat.com/gerrit/73093

Comment 8 Sweta Anandpara 2016-05-03 06:50:54 UTC
During validation of this BZ, I tried to reproduce the same on the build glusterfs-3.7.9-2, following the steps mentioned in the description.

I had a 4*2 volume, created 10k files of size 1k, added 4 more bricks (making it 6*2) and started rebalance. Killed one brick of each subvolume and I was able to see connection refused errors in the rebalance logs. 

[2016-05-03 06:34:43.186445] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]
[2016-05-03 06:34:43.192118] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-1: connection to 10.70.35.85:49152 failed (Connection refused)
[2016-05-03 06:34:43.207999] I [dht-rebalance.c:1214:dht_migrate_file] 0-dist-rep2-dht: /1k_files/file778: attempting to move from dist-rep2-replicate-2 to dist-rep2-replicate-4
[2016-05-03 06:34:43.208787] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]


[2016-05-03 06:35:13.189547] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument]
[2016-05-03 06:35:13.249361] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-3: changing port to 49152 (from 0)
[2016-05-03 06:35:13.254546] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-3: connection to 10.70.35.13:49152 failed (Connection refused)
[2016-05-03 06:35:13.264568] W [MSGID: 114031] [client-rpc-fops.c:2539:client3_3_lk_cbk] 0-dist-rep2-client-1: remote operation failed [Transport endpoint is not connected]


Before I could stop the rebalance operation, it had already gotten completed. Added 4 more bricks (making it 8*2 volume), started off a rename for 2k files from the mountpoint, and then restarted the rebalance operation. Status showed that the operation was in progress. Attempted to stop rebalance and that went through successfully. However I was able to see similar errors in the logs.

[2016-05-03 06:43:08.413496] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-dist-rep2-dht: Fix layout failed for /1k_files
[2016-05-03 06:43:08.434422] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-4: changing port to 49153 (from 0)
[2016-05-03 06:43:08.440845] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-6: changing port to 49153 (from 0)
[2016-05-03 06:43:08.442697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-4: connection to 10.70.35.210:49153 failed (Connection refused)
[2016-05-03 06:43:08.445697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-6: connection to 10.70.35.137:49153 failed (Connection refused)
[2016-05-03 06:43:08.818536] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-9: changing port to 49154 (from 0)
[2016-05-03 06:43:08.823343] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-9: connection to 10.70.35.85:49154 failed (Connection refused)


Is there any step that I am missing? Karthick, do you remember doing something else which was preventing you to stop the started-rebalance-process?

Comment 9 Sweta Anandpara 2016-05-03 10:21:45 UTC
So the rebalance stop command has to be given exactly when the crawler has reached its queue limit and has gone off to sleep, to reproduce the above issue.

Redid the steps as mentioned. Invoked gdb this time, with the help of Sushant. Waited for crawler to reach the queue limit of 500, and issued a gluster v rebalance stop. 

The rebalance process did not get killed in the setup of 3.7.9-2 build, but got killed in the setup which had 3.7.9-3 build. 

Moving this BZ to verified in 3.1.3.

Comment 10 Sweta Anandpara 2016-05-03 10:23:11 UTC
Moving the qa contact, as I am answerable to the verification of this bug.

Comment 12 Susant Kumar Palai 2016-06-06 06:40:12 UTC
Yes Laura, doc is fine.

Comment 14 errata-xmlrpc 2016-06-23 05:17:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240


Note You need to log in before you can comment on or make changes to this bug.