Description of problem: On a 4x2 volume, when one of the brick process in each sub-volume is brought down rebalance stop doesn't actually stop the process. The following error message is continuously thrown in the rebalance logs. snippet of rebalance-logs: [2016-04-13 09:53:28.534599] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused) [2016-04-13 09:53:28.541054] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused) [2016-04-13 09:53:29.535727] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-2: changing port to 49152 (from 0) [2016-04-13 09:53:29.543157] E [socket.c:2279:socket_connect_finish] 0-testvol-client-2: connection to 10.70.47.9:49152 failed (Connection refused) [2016-04-13 09:53:30.541661] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-4: changing port to 49153 (from 0) [2016-04-13 09:53:30.548590] E [socket.c:2279:socket_connect_finish] 0-testvol-client-4: connection to 10.70.47.90:49153 failed (Connection refused) [2016-04-13 09:53:31.549488] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-0: changing port to 49152 (from 0) [2016-04-13 09:53:31.556892] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-testvol-client-6: changing port to 49153 (from 0) [2016-04-13 09:53:31.562246] E [socket.c:2279:socket_connect_finish] 0-testvol-client-0: connection to 10.70.47.90:49152 failed (Connection refused) [2016-04-13 09:53:31.568492] E [socket.c:2279:socket_connect_finish] 0-testvol-client-6: connection to 10.70.47.9:49153 failed (Connection refused) Version-Release number of selected component (if applicable): glusterfs-server-3.7.9-1.el7rhgs.x86_64 How reproducible: 1/1, yet to try if it is consistently reproducible Steps to Reproduce: 1. On a 4x2 dis-rep volume, add 4 more bricks 2. create 10k files of 1kb size 3. Trigger rebalance process 4. kill one brick process from each subvolume on existing bricks [so, one brick process on each sub-volume is up] 5. rename few files 6. After a while, stop rebalance process 7. check if rebalance process is actually stopped Actual results: rebalance process is still running Expected results: rebalance process should have stopped Additional info: gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 02427025-adcf-48a2-ac58-ae494839e9f8 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.47.90:/bricks/brick0/ct Brick2: 10.70.47.105:/bricks/brick0/ct Brick3: 10.70.47.9:/bricks/brick0/ct Brick4: 10.70.46.94:/bricks/brick0/ct Brick5: 10.70.47.90:/bricks/brick1/ct Brick6: 10.70.47.105:/bricks/brick1/ct Brick7: 10.70.47.9:/bricks/brick1/ct Brick8: 10.70.46.94:/bricks/brick1/ct Brick9: 10.70.47.90:/bricks/brick2/ct Brick10: 10.70.47.105:/bricks/brick2/ct Brick11: 10.70.47.9:/bricks/brick2/ct Brick12: 10.70.46.94:/bricks/brick2/ct Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on sosreports shall be attached shortly.
Will wait for sos-report. In case the issue is seen again, please take a state-dump as well.
http://review.gluster.org/#/c/14004/
https://code.engineering.redhat.com/gerrit/73093
During validation of this BZ, I tried to reproduce the same on the build glusterfs-3.7.9-2, following the steps mentioned in the description. I had a 4*2 volume, created 10k files of size 1k, added 4 more bricks (making it 6*2) and started rebalance. Killed one brick of each subvolume and I was able to see connection refused errors in the rebalance logs. [2016-05-03 06:34:43.186445] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:34:43.192118] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-1: connection to 10.70.35.85:49152 failed (Connection refused) [2016-05-03 06:34:43.207999] I [dht-rebalance.c:1214:dht_migrate_file] 0-dist-rep2-dht: /1k_files/file778: attempting to move from dist-rep2-replicate-2 to dist-rep2-replicate-4 [2016-05-03 06:34:43.208787] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:35:13.189547] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:35:13.249361] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-3: changing port to 49152 (from 0) [2016-05-03 06:35:13.254546] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-3: connection to 10.70.35.13:49152 failed (Connection refused) [2016-05-03 06:35:13.264568] W [MSGID: 114031] [client-rpc-fops.c:2539:client3_3_lk_cbk] 0-dist-rep2-client-1: remote operation failed [Transport endpoint is not connected] Before I could stop the rebalance operation, it had already gotten completed. Added 4 more bricks (making it 8*2 volume), started off a rename for 2k files from the mountpoint, and then restarted the rebalance operation. Status showed that the operation was in progress. Attempted to stop rebalance and that went through successfully. However I was able to see similar errors in the logs. [2016-05-03 06:43:08.413496] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-dist-rep2-dht: Fix layout failed for /1k_files [2016-05-03 06:43:08.434422] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-4: changing port to 49153 (from 0) [2016-05-03 06:43:08.440845] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-6: changing port to 49153 (from 0) [2016-05-03 06:43:08.442697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-4: connection to 10.70.35.210:49153 failed (Connection refused) [2016-05-03 06:43:08.445697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-6: connection to 10.70.35.137:49153 failed (Connection refused) [2016-05-03 06:43:08.818536] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-9: changing port to 49154 (from 0) [2016-05-03 06:43:08.823343] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-9: connection to 10.70.35.85:49154 failed (Connection refused) Is there any step that I am missing? Karthick, do you remember doing something else which was preventing you to stop the started-rebalance-process?
So the rebalance stop command has to be given exactly when the crawler has reached its queue limit and has gone off to sleep, to reproduce the above issue. Redid the steps as mentioned. Invoked gdb this time, with the help of Sushant. Waited for crawler to reach the queue limit of 500, and issued a gluster v rebalance stop. The rebalance process did not get killed in the setup of 3.7.9-2 build, but got killed in the setup which had 3.7.9-3 build. Moving this BZ to verified in 3.1.3.
Moving the qa contact, as I am answerable to the verification of this bug.
Yes Laura, doc is fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240