Bug 1237059
Summary: | DHT-rebalance: rebalance status shows failed when replica pair bricks are brought down in distrep volume while re-name of files going on | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Triveni Rao <trao> | |
Component: | distribute | Assignee: | Susant Kumar Palai <spalai> | |
Status: | CLOSED ERRATA | QA Contact: | RajeshReddy <rmekala> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | annair, asriram, asrivast, bmohanra, byarlaga, mzywusko, nbalacha, rgowdapp, rhs-bugs, rmekala, sankarshan, sashinde, shmohan, smohan, spalai | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.1.2 | Flags: | bmohanra:
needinfo+
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.5-5 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the rebalance status showed failed when the replica pair bricks were brought down in a distributed replicated volume. With this fix, rebalance will skip the error affected directory and continue with other directories. The rebalance process on a distributed-replicated volume will not be aborted even if a brick from a replica pair goes down.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1257076 1319592 (view as bug list) | Environment: | ||
Last Closed: | 2016-03-01 05:27:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1216951, 1243815, 1257076, 1260783, 1318196 |
Description
Triveni Rao
2015-06-30 09:58:08 UTC
I just tried with single brick down scenario i could re-produce the problem on the volume. Volume Name: kit Type: Distributed-Replicate Volume ID: ae805fc4-45c2-4d80-94e8-ce50336bc3c4 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.57:/rhs/brick1/s0 Brick2: 10.70.35.136:/rhs/brick1/s0 Brick3: 10.70.35.57:/rhs/brick2/s0 Brick4: 10.70.35.136:/rhs/brick2/s0 Options Reconfigured: performance.readdir-ahead: on [root@casino-vm1 ~]# [root@casino-vm1 ~]# [root@casino-vm1 ~]# gluster v status kit Status of volume: kit Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/s0 49218 0 Y 26079 Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477 Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097 Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495 NFS Server on localhost 2049 0 Y 26116 Self-heal Daemon on localhost N/A N/A Y 26124 NFS Server on 10.70.35.136 2049 0 Y 15514 Self-heal Daemon on 10.70.35.136 N/A N/A Y 15522 Task Status of Volume kit ------------------------------------------------------------------------------ There are no active volume tasks [root@casino-vm1 ~]# gluster v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0 volume add-brick: success (reverse-i-search)`gluster ': ^Custer v add-brick kit 10.70.35.57:/rhs/brick4/s0 10.70.35.136:/rhs/brick4/s0 [root@casino-vm1 ~]# gluster v rebalance kit start force volume rebalance: kit: success: Rebalance on kit has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 6062fa7e-094b-48b9-9885-9a0ccb75e668 [root@casino-vm1 ~]# gluster v rebalance kit status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 116 0Bytes 762 0 0 in progress 4.00 10.70.35.136 0 0Bytes 0 0 0 in progress 4.00 volume rebalance: kit: success: [root@casino-vm1 ~]# kill -9 26079 [root@casino-vm1 ~]# gluster v status kit Status of volume: kit Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.57:/rhs/brick1/s0 N/A N/A N N/A Brick 10.70.35.136:/rhs/brick1/s0 49196 0 Y 15477 Brick 10.70.35.57:/rhs/brick2/s0 49219 0 Y 26097 Brick 10.70.35.136:/rhs/brick2/s0 49197 0 Y 15495 Brick 10.70.35.57:/rhs/brick4/s0 49220 0 Y 26498 Brick 10.70.35.136:/rhs/brick4/s0 49198 0 Y 15878 NFS Server on localhost 2049 0 Y 26517 Self-heal Daemon on localhost N/A N/A Y 26525 NFS Server on 10.70.35.136 2049 0 Y 15897 Self-heal Daemon on 10.70.35.136 N/A N/A Y 15905 Task Status of Volume kit ------------------------------------------------------------------------------ Task : Rebalance ID : 6062fa7e-094b-48b9-9885-9a0ccb75e668 Status : failed [root@casino-vm1 ~]# gluster v rebalance kit status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 649 0Bytes 2222 2 2 failed 17.00 10.70.35.136 0 0Bytes 0 0 0 completed 7.00 volume rebalance: kit: success: [root@casino-vm1 ~]# Log messages: root@casino-vm1 ~]# less /var/log/glusterfs/kit-rebalance.log [2015-06-30 03:52:02.022740] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/kit --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=38c8a3d7-e190-4d7b-9a79-f7bbac5e146b --xlator-option *dht.commit-hash=2895155977 --socket-file /var/run/gluster/gluster-rebalance-ae805fc4-45c2-4d80-94e8-ce50336bc3c4.sock --pid-file /var/lib/glusterd/vols/kit/rebalance/38c8a3d7-e190-4d7b-9a79-f7bbac5e146b.pid -l /var/log/glusterfs/kit-rebalance.log) [2015-06-30 03:52:02.038890] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-06-30 03:52:07.042428] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'commit-hash' for volume 'kit-dht' with value '2895155977' [2015-06-30 03:52:07.042448] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'node-uuid' for volume 'kit-dht' with value '38c8a3d7-e190-4d7b-9a79-f7bbac5e146b' [2015-06-30 03:52:07.042458] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'rebalance-cmd' for volume 'kit-dht' with value '5' [2015-06-30 03:52:07.042467] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'readdir-optimize' for volume 'kit-dht' with value 'on' [2015-06-30 03:52:07.042477] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'assert-no-child-down' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042486] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'lookup-unhashed' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042495] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-dht: adding option 'use-readdirp' for volume 'kit-dht' with value 'yes' [2015-06-30 03:52:07.042505] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'readdir-failover' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042515] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'entry-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042524] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'metadata-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042533] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-2: adding option 'data-self-heal' for volume 'kit-replicate-2' with value 'off' [2015-06-30 03:52:07.042543] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-kit-replicate-1: adding option 'readdir-failover' for volume 'ki...skipping... [2015-06-30 03:52:29.762368] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b36 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.766353] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b37: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.808333] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b10 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.813728] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b11: attempting to move from kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.819245] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b37 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.826413] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b51: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.874437] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b11 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.878284] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b13: attempting to move from kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.883269] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b51 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.890494] I [dht-rebalance.c:1002:dht_migrate_file] 0-kit-dht: /a9/b56: attempting to move from kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.908997] W [MSGID: 114061] [client-rpc-fops.c:5987:client3_3_readdirp] 0-kit-client-0: (0cb784e4-4545-4f7c-b2ad-c05cfb8b072e) remote_fd is -1. EBADFD [File descriptor in bad state] [2015-06-30 03:52:29.909498] E [MSGID: 109021] [dht-rebalance.c:1872:gf_defrag_get_entry] 0-kit-dht: /x11: Migrate data failed: Readdir returned File descriptor in bad state. Aborting migrate-data [2015-06-30 03:52:29.909513] I [dht-rebalance.c:2289:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry [2015-06-30 03:52:29.909681] E [MSGID: 109016] [dht-rebalance.c:2550:gf_defrag_fix_layout] 0-kit-dht: Fix layout failed for /x11 [2015-06-30 03:52:29.909912] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2015-06-30 03:52:29.910002] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 [2015-06-30 03:52:29.930303] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b13 from subvolume kit-replicate-0 to kit-replicate-2 [2015-06-30 03:52:29.933258] I [MSGID: 109022] [dht-rebalance.c:1282:dht_migrate_file] 0-kit-dht: completed migration of /a9/b56 from subvolume kit-replicate-1 to kit-replicate-2 [2015-06-30 03:52:29.934002] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-kit-dht: Rebalance is failed. Time taken is 17.00 secs [2015-06-30 03:52:29.934024] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-kit-dht: Files migrated: 649, size: 0, lookups: 2222, failures: 2, skipped: 2 [2015-06-30 03:52:29.934153] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down (END) Doc text is edited. Please sign off to be included in Known Issues. Doc text is edited. Please sign off to be included in Known Issues. Sorry, i am providing qe-ack again. Upstream patch: http://review.gluster.org/#/c/12013/ As this is yet to be merged upstream, moving this bug back to POST. Tested with build glusterfs-server-3.7.5-5, after killing the brick process of any of the replica pair, re-balance is not continuing from the other replica pair [root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 completed 1.00 rhs-client18.lab.eng.blr.redhat.com 625 2.7KB 1519 1 0 completed 25.00 volume rebalance: dht1x2: success: [root@rhs-client19 glusterfs]# Out of 15k files it scanned 1519 only, when all brick process are up and running, i re-initiated the re balance this time it scanned 15k files [root@rhs-client19 glusterfs]# gluster vol rebalance dht1x2 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2342 12.3KB 6421 0 0 completed 189.00 rhs-client18.lab.eng.blr.redhat.com 3028 15.9KB 8990 0 0 completed 217.00 volume rebalance: dht1x2: success: Tested with multiple directories, and if one of the brick process of any of the replica pair down rebalance of other directories going fine and if volume has only one directory then rebalance can't continue with other replica and it is expected so marking this as verified *** Bug 1064481 has been marked as a duplicate of this bug. *** Hi Susant, The doc text is edited. Do signoff on the same if it looks OK. Is there a way I can see the doc text I submitted? Just want to verify as if I remember correctly there was no reference of rename in it. updated the doc text based on my discussion with susant. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html |