| Summary: | [DHT-Rebalance]: with few brick process down, rebalance process isn't killed even after stopping rebalance process | |||
|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> | |
| Component: | distribute | Assignee: | Susant Kumar Palai <spalai> | |
| Status: | CLOSED ERRATA | QA Contact: | Sweta Anandpara <sanandpa> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.1 | CC: | asrivast, nbalacha, rgowdapp, rhinduja, sanandpa, spalai | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | RHGS 3.1.3 | |||
| Hardware: | All | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.9-3 | Doc Type: | Bug Fix | |
| Doc Text: |
When the crawler thread for rebalance reaches its queue limit, it sleeps until it receives a signal from the migration threads. However, migration threads exited without notifying the crawler thread for a REBALANCE_STOP event. This meant that, when 'rebalance stop' was run when the crawler thread had reached its queue limit, the rebalance process did not actually stop. Migration threads now signal the crawler before they exit, and the 'rebalance stop' command works as expected.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1327507 (view as bug list) | Environment: | ||
| Last Closed: | 2016-06-23 05:17:23 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1311817 | |||
|
Description
krishnaram Karthick
2016-04-13 10:10:18 UTC
Will wait for sos-report. In case the issue is seen again, please take a state-dump as well. During validation of this BZ, I tried to reproduce the same on the build glusterfs-3.7.9-2, following the steps mentioned in the description. I had a 4*2 volume, created 10k files of size 1k, added 4 more bricks (making it 6*2) and started rebalance. Killed one brick of each subvolume and I was able to see connection refused errors in the rebalance logs. [2016-05-03 06:34:43.186445] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:34:43.192118] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-1: connection to 10.70.35.85:49152 failed (Connection refused) [2016-05-03 06:34:43.207999] I [dht-rebalance.c:1214:dht_migrate_file] 0-dist-rep2-dht: /1k_files/file778: attempting to move from dist-rep2-replicate-2 to dist-rep2-replicate-4 [2016-05-03 06:34:43.208787] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:35:13.189547] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f0227a542f0] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f0235700c6c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f02356feb16] ) 0-dict: !this || !value for key=link-count [Invalid argument] [2016-05-03 06:35:13.249361] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-3: changing port to 49152 (from 0) [2016-05-03 06:35:13.254546] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-3: connection to 10.70.35.13:49152 failed (Connection refused) [2016-05-03 06:35:13.264568] W [MSGID: 114031] [client-rpc-fops.c:2539:client3_3_lk_cbk] 0-dist-rep2-client-1: remote operation failed [Transport endpoint is not connected] Before I could stop the rebalance operation, it had already gotten completed. Added 4 more bricks (making it 8*2 volume), started off a rename for 2k files from the mountpoint, and then restarted the rebalance operation. Status showed that the operation was in progress. Attempted to stop rebalance and that went through successfully. However I was able to see similar errors in the logs. [2016-05-03 06:43:08.413496] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-dist-rep2-dht: Fix layout failed for /1k_files [2016-05-03 06:43:08.434422] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-4: changing port to 49153 (from 0) [2016-05-03 06:43:08.440845] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-6: changing port to 49153 (from 0) [2016-05-03 06:43:08.442697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-4: connection to 10.70.35.210:49153 failed (Connection refused) [2016-05-03 06:43:08.445697] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-6: connection to 10.70.35.137:49153 failed (Connection refused) [2016-05-03 06:43:08.818536] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep2-client-9: changing port to 49154 (from 0) [2016-05-03 06:43:08.823343] E [socket.c:2279:socket_connect_finish] 0-dist-rep2-client-9: connection to 10.70.35.85:49154 failed (Connection refused) Is there any step that I am missing? Karthick, do you remember doing something else which was preventing you to stop the started-rebalance-process? So the rebalance stop command has to be given exactly when the crawler has reached its queue limit and has gone off to sleep, to reproduce the above issue. Redid the steps as mentioned. Invoked gdb this time, with the help of Sushant. Waited for crawler to reach the queue limit of 500, and issued a gluster v rebalance stop. The rebalance process did not get killed in the setup of 3.7.9-2 build, but got killed in the setup which had 3.7.9-3 build. Moving this BZ to verified in 3.1.3. Moving the qa contact, as I am answerable to the verification of this bug. Yes Laura, doc is fine. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |