Description of problem: Rebalance operation triggered on a tiered volume due to detach tier operation seems to have hung on all nodes. Volume Name: superman Type: Tier Volume ID: afa1866e-2d0b-424f-8e22-a782f9068b25 Status: Started Number of Bricks: 20 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: 10.70.35.133:/bricks/brick7/sm1 Brick2: 10.70.35.10:/bricks/brick7/sm1 Brick3: 10.70.35.11:/bricks/brick7/sm1 Brick4: 10.70.35.225:/bricks/brick7/sm1 Brick5: 10.70.35.239:/bricks/brick7/sm1 Brick6: 10.70.37.60:/bricks/brick7/sm1 Brick7: 10.70.37.120:/bricks/brick7/sm1 Brick8: 10.70.37.101:/bricks/brick7/sm1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick9: 10.70.37.101:/bricks/brick0/l1 Brick10: 10.70.37.120:/bricks/brick0/l1 Brick11: 10.70.37.60:/bricks/brick0/l1 Brick12: 10.70.35.239:/bricks/brick0/l1 Brick13: 10.70.35.225:/bricks/brick0/l1 Brick14: 10.70.35.11:/bricks/brick0/l1 Brick15: 10.70.35.10:/bricks/brick0/l1 Brick16: 10.70.35.133:/bricks/brick0/l1 Brick17: 10.70.37.101:/bricks/brick1/l1 Brick18: 10.70.37.120:/bricks/brick1/l1 Brick19: 10.70.37.60:/bricks/brick1/l1 Brick20: 10.70.35.239:/bricks/brick1/l1 Options Reconfigured: cluster.tier-mode: cache features.ctr-enabled: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on cluster.enable-shared-storage: enable nfs-ganesha: disable Version-Release number of selected component (if applicable): glusterfs-server-3.7.9-4.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. On a tiered volume, create a disperse volume 2. create several files and dirs from nfs mount - mkdir -p A{1..1000}/B{1..20}/ - for i in {1..10000}; do dd if=/dev/urandom of=file-$i bs=1M count=10; done - untar linux kernel package 3. halfway through of step2, attach hot tier 4. enable quota and set limits 5. Allow all operations triggered in step-2 to complete 6. continuously append to 100 files created in step-2 while true; do for i in {1..100}; do echo "ee" >> file-$i; done; done 7. start detach tier and wait for completion Actual results: Detach tier seems to have hung on all nodes. No log updates are seen. tail -2 /var/log/glusterfs/superman-rebalance.log [2016-05-10 13:50:58.518210] I [dht-rebalance.c:2728:gf_defrag_process_dir] 0-superman-tier-dht: Migration operation on dir /A756 took 0.01 secs [2016-05-10 13:50:58.528485] I [MSGID: 109063] [dht-layout.c:718:dht_layout_normalize] 0-superman-tier-dht: Found anomalies in /A756/B1 (gfid = 15f9d92f-655e-4f7a-8906-cb34013bb20a). Holes=1 overlaps=0 [root@dhcp37-101 ~]# date -u Tue May 10 16:28:26 UTC 2016 Expected results: detach tier operation should complete Additional info: sosreports and statedumps shall be attached from all nodes.
Rebalance is hung, because there is a lock held on 10.70.35.10 for brick /bricks/brick0/l1 on domain "dht.layout.heal". But this also waiting on a lock held on quota enforcer on node 10.70.35.239 for a lock in domain disperse. Finally those lock waiting for a lock held by NFS process on 10.70.35.239 for client-7. This lock is stale disperse lock. [2016-05-10 09:00:31.975421] W [MSGID: 122033] [ec-common.c:1438:ec_locked] 0-superman-disperse-1: Failed to complete preop lock [Transport endpoint is not connected] The time (08:59:26) when lock granted for nfs server from the client-7, all other client lock request were failed with ENOTCONN. So the assumption here is, the lock failure on majority of clients resulted in unwind of the request frame without unlocking the succeeded lock.
This issue seems to be similar to 1330997. Please feel free to dup this bug to that issue. Pranith
*** This bug has been marked as a duplicate of bug 1330997 ***