Description of problem: ----------------------- 4 Node Master 2 Node Slave. Geo-rep syncing was I.P. (unsure if this has got anything to do with it). Was running Bonnie,smallfiles,iozone and kernel untar from 8 FUSE mounts. Added two bricks and triggered a rebalance. Firstly it showed a negative value. Then it failed : [root@gqas013 glusterfs]# gluster v rebalance testvol status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 1 0 failed 0:00:25 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 1 0 failed 0:00:20 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 1 0 failed 0:00:20 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 1 0 failed 0:00:20 volume rebalance: testvol: success [root@gqas013 glusterfs]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.4-28.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Actual results: ---------------- Rebalance fails. Expected results: ----------------- No rebalance failures. Additional info: ----------------- *MAster* : Volume Name: testvol Type: Distributed-Replicate Volume ID: 29f18f45-c822-4c0e-84ef-737e128e0368 Status: Started Snapshot Count: 0 Number of Bricks: 12 x 2 = 24 Transport-type: tcp Bricks: Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick2: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick5: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick6: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick7: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick8: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick10: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick13: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick14: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick17: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick18: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick19: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick20: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick21: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick22: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Options Reconfigured: nfs.disable: off transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 50000 server.event-threads: 4 client.event-threads: 4 geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on cluster.enable-shared-storage: enable [root@gqas013 glusterfs]# *Slave* : [root@gqas015 ~]# gluster v info Volume Name: butcher Type: Distributed-Disperse Volume ID: 6de155ee-2200-44bb-a8ed-bdae5acf348f Status: Started Snapshot Count: 0 Number of Bricks: 4 x (4 + 2) = 24 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1 Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1 Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1 Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1 Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1 Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1 Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1 Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1 Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1 Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1 Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1 Brick22: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/A1 Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1 Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/A1 Options Reconfigured: network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet nfs.disable: off [root@gqas015 ~]# [root@gqas013 glusterfs]# gluster volume geo-replication testvol gqas014.sbu.lab.eng.bos.redhat.com::butcher status detail MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ENTRY DATA META FAILURES CHECKPOINT TIME CHECKPOINT COMPLETED CHECKPOINT COMPLETION TIME ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks1/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7299 0 0 2017-06-16 05:32:22 No N/A gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks2/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7321 0 0 2017-06-16 05:32:22 No N/A gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks3/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7303 0 0 2017-06-16 05:32:22 No N/A gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks4/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 8192 7318 0 0 2017-06-16 05:32:22 No N/A gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks5/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7308 0 0 2017-06-16 05:32:22 No N/A gqas013.sbu.lab.eng.bos.redhat.com testvol /bricks6/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7293 0 0 2017-06-16 05:32:22 No N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks1/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks2/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks3/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks4/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks5/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas005.sbu.lab.eng.bos.redhat.com testvol /bricks6/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks1/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks2/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks3/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7310 0 0 2017-06-16 05:32:22 No N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks4/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks5/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas008.sbu.lab.eng.bos.redhat.com testvol /bricks6/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas014.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks1/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7305 0 0 2017-06-16 05:32:22 No N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks2/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 8192 7311 0 0 2017-06-16 05:32:22 No N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks3/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Passive N/A N/A N/A N/A N/A N/A N/A N/A N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks4/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7311 0 0 2017-06-16 05:32:22 No N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks5/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7309 0 0 2017-06-16 05:32:22 No N/A gqas006.sbu.lab.eng.bos.redhat.com testvol /bricks6/A1 root gqas014.sbu.lab.eng.bos.redhat.com::butcher gqas015.sbu.lab.eng.bos.redhat.com Active Hybrid Crawl N/A 0 7311 0 0 2017-06-16 05:32:22 No N/A [root@gqas013 glusterfs]#
From logs,I see failures while fixing layout on / and a heal failure before that : [2017-06-16 11:07:50.852073] I [MSGID: 109028] [dht-rebalance.c:4717:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0 [2017-06-16 11:08:05.034660] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-testvol-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0 [2017-06-16 11:08:05.034692] W [MSGID: 109005] [dht-selfheal.c:2111:dht_selfheal_directory] 0-testvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid = [2017-06-16 11:08:05.034734] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-testvol-replicate-2: selecting local read_child testvol-client-5 [2017-06-16 11:08:05.034831] I [MSGID: 108006] [afr-common.c:4854:afr_local_init] 0-testvol-replicate-1: no subvolumes up [2017-06-16 11:08:05.034865] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-testvol-dht: failed to get disk info from testvol-replicate-1 [Transport endpoint is not connected] [2017-06-16 11:08:05.035468] I [dht-rebalance.c:4211:gf_defrag_start_crawl] 0-testvol-dht: gf_defrag_start_crawl using commit hash 3390955361 [2017-06-16 11:08:05.035548] I [MSGID: 108006] [afr-common.c:4854:afr_local_init] 0-testvol-replicate-1: no subvolumes up [2017-06-16 11:08:05.042751] I [MSGID: 109081] [dht-common.c:4258:dht_setxattr] 0-testvol-dht: fixing the layout of / [2017-06-16 11:08:05.042778] W [MSGID: 109016] [dht-selfheal.c:1738:dht_fix_layout_of_directory] 0-testvol-dht: Layout fix failed: 1 subvolume(s) are down. Skipping fix layout. [2017-06-16 11:08:05.043029] E [MSGID: 109026] [dht-rebalance.c:4253:gf_defrag_start_crawl] 0-testvol-dht: fix layout on / failed [2017-06-16 11:08:05.043176] I [MSGID: 109028] [dht-rebalance.c:4713:gf_defrag_status_get] 0-testvol-dht: Rebalance is failed. Time taken is 25.00 secs [2017-06-16 11:08:05.043188] I [MSGID: 109028] [dht-rebalance.c:4717:gf_defrag_status_get] 0-testvol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
Milind, Can you take a look into logs and identify the cause for disconnection?
What's the latest on this bug? Can we confirm if this bug is still valid in the latest releases? Considering this bug being quite old, we should try to take this to closure.
requesting re-validation of BZ to Nag
see comment #16
Closing - if it happens again or we have more information, please re-open.