Bug 1462181 - [Stress] : Rebalance fails when Geo Rep is in progress
[Stress] : Rebalance fails when Geo Rep is in progress
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Milind Changire
Rahul Hinduja
rpc-3.4.0?
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-16 07:42 EDT by Ambarish
Modified: 2017-09-28 13:11 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-06-16 07:42:28 EDT
Description of problem:
-----------------------


4 Node Master
2 Node Slave.

Geo-rep syncing was I.P. (unsure if this has got anything to do with it).

Was running Bonnie,smallfiles,iozone and kernel untar from 8 FUSE mounts.

Added two bricks and triggered a rebalance.

Firstly it showed a negative value.

Then it failed :

[root@gqas013 glusterfs]# gluster v rebalance testvol  status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed        0:00:25
      gqas005.sbu.lab.eng.bos.redhat.com                0        0Bytes             0             1             0               failed        0:00:20
      gqas006.sbu.lab.eng.bos.redhat.com                0        0Bytes             0             1             0               failed        0:00:20
      gqas008.sbu.lab.eng.bos.redhat.com                0        0Bytes             0             1             0               failed        0:00:20
volume rebalance: testvol: success
[root@gqas013 glusterfs]# 



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.4-28.el7rhgs.x86_64

How reproducible:
-----------------

1/1



Actual results:
----------------

Rebalance fails.

Expected results:
-----------------

No rebalance failures.

Additional info:
-----------------


*MAster* :

 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 29f18f45-c822-4c0e-84ef-737e128e0368
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks:
Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick5: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick6: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick7: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick8: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick10: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick13: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick14: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick17: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick18: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick19: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick20: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick21: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick22: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Options Reconfigured:
nfs.disable: off
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
server.event-threads: 4
client.event-threads: 4
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
cluster.enable-shared-storage: enable
[root@gqas013 glusterfs]# 

*Slave* :

[root@gqas015 ~]# gluster v info
 
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 6de155ee-2200-44bb-a8ed-bdae5acf348f
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off
[root@gqas015 ~]# 



[root@gqas013 glusterfs]# gluster volume geo-replication testvol gqas014.sbu.lab.eng.bos.redhat.com::butcher  status  detail
 
MASTER NODE                           MASTER VOL    MASTER BRICK    SLAVE USER    SLAVE                                          SLAVE NODE                            STATUS     CRAWL STATUS    LAST_SYNCED    ENTRY    DATA    META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED    CHECKPOINT COMPLETION TIME   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7299    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7321    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7303    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            8192     7318    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7308    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7293    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7310    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7305    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            8192     7311    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7311    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7309    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7311    0       0           2017-06-16 05:32:22    No                      N/A                          
[root@gqas013 glusterfs]#
Comment 2 Ambarish 2017-06-16 07:46:40 EDT
From logs,I see failures while fixing layout on / and a heal failure before that :


[2017-06-16 11:07:50.852073] I [MSGID: 109028] [dht-rebalance.c:4717:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2017-06-16 11:08:05.034660] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-testvol-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2017-06-16 11:08:05.034692] W [MSGID: 109005] [dht-selfheal.c:2111:dht_selfheal_directory] 0-testvol-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
[2017-06-16 11:08:05.034734] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-testvol-replicate-2: selecting local read_child testvol-client-5
[2017-06-16 11:08:05.034831] I [MSGID: 108006] [afr-common.c:4854:afr_local_init] 0-testvol-replicate-1: no subvolumes up
[2017-06-16 11:08:05.034865] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-testvol-dht: failed to get disk info from testvol-replicate-1 [Transport endpoint is not connected]
[2017-06-16 11:08:05.035468] I [dht-rebalance.c:4211:gf_defrag_start_crawl] 0-testvol-dht: gf_defrag_start_crawl using commit hash 3390955361
[2017-06-16 11:08:05.035548] I [MSGID: 108006] [afr-common.c:4854:afr_local_init] 0-testvol-replicate-1: no subvolumes up
[2017-06-16 11:08:05.042751] I [MSGID: 109081] [dht-common.c:4258:dht_setxattr] 0-testvol-dht: fixing the layout of /
[2017-06-16 11:08:05.042778] W [MSGID: 109016] [dht-selfheal.c:1738:dht_fix_layout_of_directory] 0-testvol-dht: Layout fix failed: 1 subvolume(s) are down. Skipping fix layout.
[2017-06-16 11:08:05.043029] E [MSGID: 109026] [dht-rebalance.c:4253:gf_defrag_start_crawl] 0-testvol-dht: fix layout on / failed
[2017-06-16 11:08:05.043176] I [MSGID: 109028] [dht-rebalance.c:4713:gf_defrag_status_get] 0-testvol-dht: Rebalance is failed. Time taken is 25.00 secs
[2017-06-16 11:08:05.043188] I [MSGID: 109028] [dht-rebalance.c:4717:gf_defrag_status_get] 0-testvol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
Comment 12 Raghavendra G 2017-09-07 02:29:43 EDT
Milind,

Can you take a look into logs and identify the cause for disconnection?

Note You need to log in before you can comment on or make changes to this bug.