Bug 1243786

Summary: DHT-rebalance: IO hangs on Fuse mount while rebalance is happening with a brick down and multiple add-bricks
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Triveni Rao <trao>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED CURRENTRELEASE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.1CC: nbalacha, rcyriac, sashinde, tdesala
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: Rebalance
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-22 04:51:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1243358    
Bug Blocks:    

Description Triveni Rao 2015-07-16 10:02:16 UTC
Problem statement:
========================
DHT-rebalance: IO hangs on Fuse mount while rebalance is happening with a brick down and multiple add-bricks 

Procedure:
===========
1. create distrep volume 2x2
2. mount the same volume with fuse and nfs protocol.
3. Run untarring of kernel on fuse.
4. run lookup on nfs mount like find <file_name> in loop
5. start adding bricks continuously one after the other.like addbrick b1 b1
addbrick b2 b2 addbrick b3 b3
6. bring down one of the bricks from replica pair.
7. start rebalance force.
8. observe the rebalance status and lookup and untarring.

Actual result:
====================

NFS lookup was happening- no issues
Untarring hanged after some time of rebalance start.

Expected result:
=================
Io should continue without any issues.

Output:
===========



root@rhsqa14-vm1 ~]# gluster v status
Status of volume: good
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/rhs/brick1/g0           49152     0          Y       2720
Brick 10.70.47.163:/rhs/brick1/g0           49152     0          Y       545
Brick 10.70.47.165:/rhs/brick2/g0           49153     0          Y       2738
Brick 10.70.47.163:/rhs/brick2/g0           49153     0          Y       563
Snapshot Daemon on localhost                49154     0          Y       2919
NFS Server on localhost                     2049      0          Y       2929
Self-heal Daemon on localhost               N/A       N/A        Y       2765
Quota Daemon on localhost                   N/A       N/A        Y       2855
Snapshot Daemon on 10.70.47.163             49154     0          Y       711
NFS Server on 10.70.47.163                  2049      0          Y       720
Self-heal Daemon on 10.70.47.163            N/A       N/A        Y       591
Quota Daemon on 10.70.47.163                N/A       N/A        Y       662

Task Status of Volume good
------------------------------------------------------------------------------
There are no active volume tasks

[root@rhsqa14-vm1 ~]# 
[root@rhsqa14-vm1 ~]# gluster v add-brick good 10.70.47.165:/rhs/brick3/g0 10.70.47.163:/rhs/brick3/g0
volume add-brick: success
[root@rhsqa14-vm1 ~]# gluster v add-brick good 10.70.47.165:/rhs/brick4/g0 10.70.47.163:/rhs/brick4/g0
volume add-brick: success
[root@rhsqa14-vm1 ~]# gluster v add-brick good 10.70.47.165:/rhs/brick5/g0 10.70.47.163:/rhs/brick5/g0
volume add-brick: success
[root@rhsqa14-vm1 ~]# gluster v add-brick good 10.70.47.165:/rhs/brick6/g0 10.70.47.163:/rhs/brick6/g0
volume add-brick: success
[root@rhsqa14-vm1 ~]# gluster v info

Volume Name: good
Type: Distributed-Replicate
Volume ID: f16949f1-9bf8-4926-9828-405c76a084da
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp   
Bricks:
Brick1: 10.70.47.165:/rhs/brick1/g0
Brick2: 10.70.47.163:/rhs/brick1/g0
Brick3: 10.70.47.165:/rhs/brick2/g0
Brick4: 10.70.47.163:/rhs/brick2/g0
Brick5: 10.70.47.165:/rhs/brick3/g0
Brick6: 10.70.47.163:/rhs/brick3/g0
Brick7: 10.70.47.165:/rhs/brick4/g0
Brick8: 10.70.47.163:/rhs/brick4/g0
Brick9: 10.70.47.165:/rhs/brick5/g0
Brick10: 10.70.47.163:/rhs/brick5/g0
Brick11: 10.70.47.165:/rhs/brick6/g0
Brick12: 10.70.47.163:/rhs/brick6/g0
Options Reconfigured: 
features.uss: enable  
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
cluster.min-free-disk: 10
performance.readdir-ahead: on
[root@rhsqa14-vm1 ~]# gluster v status
Status of volume: good
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/rhs/brick1/g0           49152     0          Y       2720
Brick 10.70.47.163:/rhs/brick1/g0           49152     0          Y       545
Brick 10.70.47.165:/rhs/brick2/g0           49153     0          Y       2738
Brick 10.70.47.163:/rhs/brick2/g0           49153     0          Y       563
Brick 10.70.47.165:/rhs/brick3/g0           49155     0          Y       3441
Brick 10.70.47.163:/rhs/brick3/g0           49155     0          Y       1912
Brick 10.70.47.165:/rhs/brick4/g0           49156     0          Y       3577
Brick 10.70.47.163:/rhs/brick4/g0           49156     0          Y       2090
Brick 10.70.47.165:/rhs/brick5/g0           49157     0          Y       3689
Brick 10.70.47.163:/rhs/brick5/g0           49157     0          Y       2210
Brick 10.70.47.165:/rhs/brick6/g0           49158     0          Y       3803
Brick 10.70.47.163:/rhs/brick6/g0           49158     0          Y       2413
Snapshot Daemon on localhost                49154     0          Y       2919
NFS Server on localhost                     2049      0          Y       3823
Self-heal Daemon on localhost               N/A       N/A        Y       3831
Quota Daemon on localhost                   N/A       N/A        Y       3850
Snapshot Daemon on 10.70.47.163             49154     0          Y       711
NFS Server on 10.70.47.163                  2049      0          Y       2437
Self-heal Daemon on 10.70.47.163            N/A       N/A        Y       2445
Quota Daemon on 10.70.47.163                N/A       N/A        Y       2455

Task Status of Volume good
------------------------------------------------------------------------------
There are no active volume tasks

[root@rhsqa14-vm1 ~]# kill -9 2720
[root@rhsqa14-vm1 ~]# 
[root@rhsqa14-vm1 ~]# gluster v status
Status of volume: good
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/rhs/brick1/g0           N/A       N/A        N       N/A
Brick 10.70.47.163:/rhs/brick1/g0           49152     0          Y       545
Brick 10.70.47.165:/rhs/brick2/g0           49153     0          Y       2738
Brick 10.70.47.163:/rhs/brick2/g0           49153     0          Y       563
Brick 10.70.47.165:/rhs/brick3/g0           49155     0          Y       3441
Brick 10.70.47.163:/rhs/brick3/g0           49155     0          Y       1912
Brick 10.70.47.165:/rhs/brick4/g0           49156     0          Y       3577
Brick 10.70.47.163:/rhs/brick4/g0           49156     0          Y       2090
Brick 10.70.47.165:/rhs/brick5/g0           49157     0          Y       3689
Brick 10.70.47.163:/rhs/brick5/g0           49157     0          Y       2210
Brick 10.70.47.165:/rhs/brick6/g0           49158     0          Y       3803
Brick 10.70.47.163:/rhs/brick6/g0           49158     0          Y       2413
Snapshot Daemon on localhost                49154     0          Y       2919
NFS Server on localhost                     2049      0          Y       3823
Self-heal Daemon on localhost               N/A       N/A        Y       3831
Quota Daemon on localhost                   N/A       N/A        Y       3850
Snapshot Daemon on 10.70.47.163             49154     0          Y       711
NFS Server on 10.70.47.163                  2049      0          Y       2437
Self-heal Daemon on 10.70.47.163            N/A       N/A        Y       2445
Quota Daemon on 10.70.47.163                N/A       N/A        Y       2455

Task Status of Volume good
------------------------------------------------------------------------------
There are no active volume tasks

[root@rhsqa14-vm1 ~]# gluster v rebalance good start force
volume rebalance: good: success: Rebalance on good has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 1088d13b-aefb-4ea6-ba0f-87c339fec1bf

[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               27       644.2KB           109             0             0          in progress               8.00
                            10.70.47.163                0        0Bytes           113             0            60          in progress               8.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              118         1.0MB           450             0             0          in progress              30.00
                            10.70.47.163                0        0Bytes           369             0           214          in progress              30.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              293         2.7MB           846             0             0          in progress              74.00
                            10.70.47.163                0        0Bytes           610             0           374          in progress              74.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              529         3.1MB          1409             0             0          in progress             132.00
                            10.70.47.163                0        0Bytes          1041             0           621          in progress             133.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              793         3.5MB          2199             0             0          in progress             199.00
                            10.70.47.163                0        0Bytes          1572             0           905          in progress             199.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1520         8.5MB          5080             0             0          in progress             372.00
                            10.70.47.163                1        0Bytes          3079             0          1514          in progress             372.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# [root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1766         9.6MB          6166             0             0          in progress             420.00
                            10.70.47.163                1        0Bytes          3503             0          1692          in progress             420.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# gluster v rebalance good status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             4043        23.5MB         24452             0             0            completed             905.00
                            10.70.47.163                1        0Bytes          7452             0          2221            completed             713.00
volume rebalance: good: success:
[root@rhsqa14-vm1 ~]# 



linux-4.1-rc7/drivers/net/ethernet/intel/e1000/e1000_osdep.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000/e1000_param.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/80003es2lan.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/80003es2lan.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/82571.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/82571.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/Makefile
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/defines.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/e1000.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/ethtool.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/hw.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/ich8lan.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/ich8lan.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/mac.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/mac.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/manage.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/manage.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/netdev.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/nvm.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/nvm.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/param.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/phy.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/phy.h
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/ptp.c
linux-4.1-rc7/drivers/net/ethernet/intel/e1000e/regs.h
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/Makefile
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k.h
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_common.c
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_common.h
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_dcbnl.c
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_debugfs.c
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
linux-4.1-rc7/drivers/net/ethernet/intel/fm10k/fm10k_main.c



[root@rhsqa14-vm1 ~]# less /var/log/glusterfs/good-rebalance.log 
[2015-07-16 08:16:17.948316] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.1 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/good --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *replicate*.readdir-failover=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=c2746cfb-5132-4289-bb04-d118619630c1 --xlator-option *dht.commit-hash=2906342028 --socket-file /var/run/gluster/gluster-rebalance-f16949f1-9bf8-4926-9828-405c76a084da.sock --pid-file /var/lib/glusterd/vols/good/rebalance/c2746cfb-5132-4289-bb04-d118619630c1.pid -l /var/log/glusterfs/good-rebalance.log)
[2015-07-16 08:16:17.977821] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-07-16 08:16:22.976641] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'commit-hash' for volume 'good-dht' with value '2906342028'
[2015-07-16 08:16:22.976668] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'node-uuid' for volume 'good-dht' with value 'c2746cfb-5132-4289-bb04-d118619630c1'
[2015-07-16 08:16:22.976686] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'rebalance-cmd' for volume 'good-dht' with value '5'
[2015-07-16 08:16:22.976699] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'readdir-optimize' for volume 'good-dht' with value 'on'
[2015-07-16 08:16:22.976714] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'assert-no-child-down' for volume 'good-dht' with value 'yes'
[2015-07-16 08:16:22.976726] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'lookup-unhashed' for volume 'good-dht' with value 'yes'
[2015-07-16 08:16:22.976737] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-dht: adding option 'use-readdirp' for volume 'good-dht' with value 'yes'
[2015-07-16 08:16:22.976752] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-5: adding option 'readdir-failover' for volume 'good-replicate-5' with value 'off'
[2015-07-16 08:16:22.976774] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-5: adding option 'entry-self-heal' for volume 'good-replicate-5' with value 'off'
[2015-07-16 08:16:22.976788] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-5: adding option 'metadata-self-heal' for volume 'good-replicate-5' with value 'off'
[2015-07-16 08:16:22.976800] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-5: adding option 'data-self-heal' for volume 'good-replicate-5' with value 'off'
[2015-07-16 08:16:22.976814] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-4: adding option 'readdir-failover' for volume 'good-replicate-4' with value 'off'
[2015-07-16 08:16:22.976826] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-4: adding option 'entry-self-heal' for volume 'good-replicate-4' with value 'off'
[2015-07-16 08:16:22.976837] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-4: adding option 'metadata-self-heal' for volume 'good-replicate-4' with value 'off'
[2015-07-16 08:16:22.976849] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-4: adding option 'data-self-heal' for volume 'good-replicate-4' with value 'off'
[2015-07-16 08:16:22.976863] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-3: adding option 'readdir-failover' for volume 'good-replicate-3' with value 'off'
[2015-07-16 08:16:22.976875] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-3: adding option 'entry-self-heal' for volume 'good-replicate-3' with value 'off'
[2015-07-16 08:16:22.976893] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-3: adding option 'metadata-self-heal' for volume 'good-replicate-3' with value 'off'
[2015-07-16 08:16:22.976905] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-3: adding option 'data-self-heal' for volume 'good-replicate-3' with value 'off'
[2015-07-16 08:16:22.976919] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-2: adding option 'readdir-failover' for volume 'good-replicate-2' with value 'off'
[2015-07-16 08:16:22.976931] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-2: adding option 'entry-self-heal' for volume 'good-replicate-2' with value 'off'
[2015-07-16 08:16:22.976943] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-2: adding option 'metadata-self-heal' for volume 'good-replicate-2' with value 'off'
[2015-07-16 08:16:22.976954] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-2: adding option 'data-self-heal' for volume 'good-replicate-2' with value 'off'
[2015-07-16 08:16:22.976968] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-1: adding option 'readdir-failover' for volume 'good-replicate-1' with value 'off'
[2015-07-16 08:16:22.976980] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-1: adding option 'entry-self-heal' for volume 'good-replicate-1' with value 'off'
[2015-07-16 08:16:22.976992] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-1: adding option 'metadata-self-heal' for volume 'good-replicate-1' with value 'off'
[2015-07-16 08:16:22.977004] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-1: adding option 'data-self-heal' for volume 'good-replicate-1' with value 'off'
[2015-07-16 08:16:22.977018] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-0: adding option 'readdir-failover' for volume 'good-replicate-0' with value 'off'
[2015-07-16 08:16:22.977029] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-0: adding option 'entry-self-heal' for volume 'good-replicate-0' with value 'off'
[2015-07-16 08:16:22.977041] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-0: adding option 'metadata-self-heal' for volume 'good-replicate-0' with value 'off'
[2015-07-16 08:16:22.977053] I [MSGID: 101173] [graph.c:271:gf_add_cmdline_options] 0-good-replicate-0: adding option 'data-self-heal' for volume 'good-replicate-0' with value 'off'
[2015-07-16 08:16:22.977826] I [MSGID: 109104] [dht-shared.c:607:dht_init] 0-good-dht: dht_init using commit hash 2906342028
[2015-07-16 08:16:23.002575] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2015-07-16 08:16:23.008072] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-5: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008117] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-4: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008157] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-3: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008186] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-2: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008213] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-1: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008239] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-good-replicate-0: option 'readdir-failover' is not recognized
[2015-07-16 08:16:23.008386] I [MSGID: 114020] [client.c:2118:notify] 0-good-client-0: parent translators are ready, attempting connect on transport
[2015-07-16 08:16:23.012139] I [MSGID: 114020] [client.c:2118:notify] 0-good-client-1: parent translators are ready, attempting connect on transport
[2015-07-16 08:16:23.012655] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-good-client-0: changing port to 49152 (from 0)
[2015-07-16 08:16:23.015625] I [MSGID: 114020] [client.c:2118:notify] 0-good-client-2: parent translators are ready, attempting connect on transport
[2015-07-16 08:16:23.019025] E [socket.c:2332:socket_connect_finish] 0-good-client-0: connection to 10.70.47.165:49152 failed (Connection refused)
[2015-07-16 08:16:23.022314] I [MSGID: 114020] [client.c:2118:notify] 0-good-client-3: parent translators are ready, attempting connect on transport
[2015-07-16 08:16:23.022574] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-good-client-2: changing port to 49153 (from 0)
[2015-07-16 08:16:23.025647] I [MSGID: 114020] [client.c:2118:notify] 0-good-client-4: parent translators are ready, attempting connect on transport
[2015-07-16 08:16:23.029788] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-good-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
...skipping...
 1
[2015-07-16 08:31:34.094943] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e, gfid = 7fe6d32f-c742-4447-ab25-70a7fa89f72a
[2015-07-16 08:31:34.096168] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-4; inode layout - 3400826014 - 4294967295 - 1; disk layout - 3578999630 - 4294967295 - 1
[2015-07-16 08:31:34.096259] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e, gfid = 7fe6d32f-c742-4447-ab25-70a7fa89f72a
[2015-07-16 08:31:34.096373] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-3; inode layout - 2506928916 - 3400826013 - 1; disk layout - 2863199704 - 3578999629 - 1
[2015-07-16 08:31:34.096407] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e, gfid = 7fe6d32f-c742-4447-ab25-70a7fa89f72a
[2015-07-16 08:31:34.096837] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-5; inode layout - 0 - 893897097 - 1; disk layout - 0 - 715799925 - 1
[2015-07-16 08:31:34.096879] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e, gfid = 7fe6d32f-c742-4447-ab25-70a7fa89f72a
[2015-07-16 08:31:34.103251] I [dht-rebalance.c:2119:gf_defrag_process_dir] 0-good-dht: migrate data called on /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e
[2015-07-16 08:31:34.207424] I [dht-rebalance.c:2328:gf_defrag_process_dir] 0-good-dht: Migration operation on dir /linux-4.1-rc7/drivers/net/ethernet/intel/e1000e took 0.10 secs
[2015-07-16 08:31:34.250153] I [MSGID: 109081] [dht-common.c:3783:dht_setxattr] 0-good-dht: fixing the layout of /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k
[2015-07-16 08:31:34.250204] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 0 (good-replicate-0): 20480 chunks
[2015-07-16 08:31:34.250284] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 1 (good-replicate-1): 20480 chunks
[2015-07-16 08:31:34.250308] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 2 (good-replicate-2): 50914 chunks
[2015-07-16 08:31:34.250331] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 3 (good-replicate-3): 50914 chunks
[2015-07-16 08:31:34.250364] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 4 (good-replicate-4): 50914 chunks
[2015-07-16 08:31:34.250387] I [MSGID: 109045] [dht-selfheal.c:1496:dht_fix_layout_of_directory] 0-good-dht: subvolume 5 (good-replicate-5): 50914 chunks
[2015-07-16 08:31:34.266859] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-0; inode layout - 2681691294 - 3041258653 - 1; disk layout - 2147399778 - 2863199703 - 1
[2015-07-16 08:31:34.266907] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.268559] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-2; inode layout - 3400826014 - 4294967295 - 1; disk layout - 3578999630 - 4294967295 - 1
[2015-07-16 08:31:34.268604] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.268733] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-1; inode layout - 3041258654 - 3400826013 - 1; disk layout - 2863199704 - 3578999629 - 1
[2015-07-16 08:31:34.268766] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.268860] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-4; inode layout - 893897098 - 1787794195 - 1; disk layout - 715799926 - 1431599851 - 1
[2015-07-16 08:31:34.268891] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.269633] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-3; inode layout - 0 - 893897097 - 1; disk layout - 0 - 715799925 - 1
[2015-07-16 08:31:34.269688] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.269851] I [MSGID: 109064] [dht-layout.c:823:dht_layout_dir_mismatch] 0-good-dht: subvol: good-replicate-5; inode layout - 1787794196 - 2681691293 - 1; disk layout - 1431599852 - 2147399777 - 1
[2015-07-16 08:31:34.269896] I [MSGID: 109018] [dht-common.c:814:dht_revalidate_cbk] 0-good-dht: Mismatching layouts for /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k, gfid = 34abab2b-80d2-45e3-a293-d52f65ad4e5f
[2015-07-16 08:31:34.274803] I [dht-rebalance.c:2119:gf_defrag_process_dir] 0-good-dht: migrate data called on /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k
[2015-07-16 08:31:34.319042] I [dht-rebalance.c:2328:gf_defrag_process_dir] 0-good-dht: Migration operation on dir /linux-4.1-rc7/drivers/net/ethernet/intel/fm10k took 0.04 secs
[2015-07-16 08:31:34.416991] I [dht-rebalance.c:2809:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2015-07-16 08:31:34.417240] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3
[2015-07-16 08:31:34.417714] I [dht-rebalance.c:1764:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4
[2015-07-16 08:31:34.417927] I [MSGID: 109028] [dht-rebalance.c:3029:gf_defrag_status_get] 0-good-dht: Rebalance is completed. Time taken is 905.00 secs
[2015-07-16 08:31:34.417968] I [MSGID: 109028] [dht-rebalance.c:3033:gf_defrag_status_get] 0-good-dht: Files migrated: 4043, size: 24647589, lookups: 24452, failures: 0, skipped: 0
[2015-07-16 08:31:34.418667] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3fbcc07a51) [0x7f92bcd37a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7f92be116075] -->/usr/sbin/glusterfs(cleanup_and_exit+0x71) [0x7f92be115ba1] ) 0-: received signum (15), shutting down


(gdb) info threads
  9 Thread 0x7ffb87423700 (LWP 2440)  0x00007ffb8f200fbd in nanosleep () from /lib64/libpthread.so.0
  8 Thread 0x7ffb86a22700 (LWP 2441)  0x00007ffb8f201535 in sigwait () from /lib64/libpthread.so.0
  7 Thread 0x7ffb86021700 (LWP 2442)  0x00007ffb8f1fda0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 Thread 0x7ffb85620700 (LWP 2443)  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
  5 Thread 0x7ffb833ed700 (LWP 2445)  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
  4 Thread 0x7ffb80e48700 (LWP 2446)  0x00007ffb8eb63f63 in epoll_wait () from /lib64/libc.so.6
  3 Thread 0x7ffb7aaa3700 (LWP 2449)  0x00007ffb8eb5b9b7 in readv () from /lib64/libc.so.6
  2 Thread 0x7ffb7a0a2700 (LWP 2450)  0x00007ffb8f2007dd in read () from /lib64/libpthread.so.0
* 1 Thread 0x7ffb905bc740 (LWP 2439)  0x00007ffb8f1fa2ad in pthread_join () from /lib64/libpthread.so.0
(gdb) t 6
[Switching to thread 6 (Thread 0x7ffb85620700 (LWP 2443))]#0  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffb8eaf7f7e in _L_lock_5746 () from /lib64/libc.so.6
#2  0x00007ffb8eaf3a8b in _int_free () from /lib64/libc.so.6
#3  0x00007ffb9015a578 in synctask_destroy () from /usr/lib64/libglusterfs.so.0
#4  0x00007ffb9015aa90 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#5  0x00007ffb8f1f9a51 in start_thread () from /lib64/libpthread.so.0
#6  0x00007ffb8eb6396d in clone () from /lib64/libc.so.6
(gdb) t 7
[Switching to thread 7 (Thread 0x7ffb86021700 (LWP 2442))]#0  0x00007ffb8f1fda0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007ffb8f1fda0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffb90155c8b in syncenv_task () from /usr/lib64/libglusterfs.so.0
#2  0x00007ffb9015aaa0 in syncenv_processor () from /usr/lib64/libglusterfs.so.0
#3  0x00007ffb8f1f9a51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffb8eb6396d in clone () from /lib64/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffb833ed700 (LWP 2445))]#0  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffb8eb730ce in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffb8eaf829f in _L_lock_9730 () from /lib64/libc.so.6
#2  0x00007ffb8eaf588b in calloc () from /lib64/libc.so.6
#3  0x00007ffb9015669e in ?? () from /usr/lib64/libglusterfs.so.0
#4  0x00007ffb90156cc8 in synctask_create () from /usr/lib64/libglusterfs.so.0
#5  0x00007ffb9015a749 in synctask_new1 () from /usr/lib64/libglusterfs.so.0
#6  0x00007ffb8258ee32 in afr_inode_refresh_done () from /usr/lib64/glusterfs/3.7.1/xlator/cluster/replicate.so
#7  0x00007ffb8258f901 in afr_inode_refresh_subvol_cbk () from /usr/lib64/glusterfs/3.7.1/xlator/cluster/replicate.so
#8  0x00007ffb827cd0c7 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.7.1/xlator/protocol/client.so
#9  0x00007ffb8fedd445 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0
#10 0x00007ffb8fede8f2 in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0
#11 0x00007ffb8fed9ad8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#12 0x00007ffb84a14255 in ?? () from /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so
#13 0x00007ffb84a15e2d in ?? () from /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so
#14 0x00007ffb901729b0 in ?? () from /usr/lib64/libglusterfs.so.0
#15 0x00007ffb8f1f9a51 in start_thread () from /lib64/libpthread.so.0
#16 0x00007ffb8eb6396d in clone () from /lib64/libc.so.6
(gdb)

Comment 5 Prasad Desala 2016-07-21 06:30:03 UTC
This issue is not seen in gluster version 3.7.9-10.

Here are the steps that were followed,

1. created distrep volume 2x2.
2. mount the volume with nfs protocol.
3. Started untarring kernel package on nfs mount.
4. started lookup on nfs mount using find <file_name> in loop.
5. Added new bricks continuously to the volume one after the other lets say;brick-1,brick-2,brick-3.
6. Brought down one of the bricks from replica pair by killing its process.
7. started rebalance force.
8. Rebalance was started and no issues were seen on infinite lookup and untarrring. 
Rebalance was completed successfully.

Followed the same steps mounting the volume with fuse protocol and no issue is seen.