Bug 1334860

Summary: [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hung on a tiered vol
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: tierAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DUPLICATE QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: nbalacha, pkarampu, rhs-bugs, rkavunga, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-12 08:30:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description krishnaram Karthick 2016-05-10 16:30:55 UTC
Description of problem:
Rebalance operation triggered on a tiered volume due to detach tier operation seems to have hung on all nodes.

Volume Name: superman
Type: Tier
Volume ID: afa1866e-2d0b-424f-8e22-a782f9068b25
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.133:/bricks/brick7/sm1
Brick2: 10.70.35.10:/bricks/brick7/sm1
Brick3: 10.70.35.11:/bricks/brick7/sm1
Brick4: 10.70.35.225:/bricks/brick7/sm1
Brick5: 10.70.35.239:/bricks/brick7/sm1
Brick6: 10.70.37.60:/bricks/brick7/sm1
Brick7: 10.70.37.120:/bricks/brick7/sm1
Brick8: 10.70.37.101:/bricks/brick7/sm1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.101:/bricks/brick0/l1
Brick10: 10.70.37.120:/bricks/brick0/l1
Brick11: 10.70.37.60:/bricks/brick0/l1
Brick12: 10.70.35.239:/bricks/brick0/l1
Brick13: 10.70.35.225:/bricks/brick0/l1
Brick14: 10.70.35.11:/bricks/brick0/l1
Brick15: 10.70.35.10:/bricks/brick0/l1
Brick16: 10.70.35.133:/bricks/brick0/l1
Brick17: 10.70.37.101:/bricks/brick1/l1
Brick18: 10.70.37.120:/bricks/brick1/l1
Brick19: 10.70.37.60:/bricks/brick1/l1
Brick20: 10.70.35.239:/bricks/brick1/l1
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
nfs-ganesha: disable


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-4.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. On a tiered volume, create a disperse volume
2. create several files and dirs from nfs mount
 - mkdir -p A{1..1000}/B{1..20}/
 - for i in {1..10000}; do dd if=/dev/urandom of=file-$i bs=1M count=10; done
 - untar linux kernel package
3. halfway through of step2, attach hot tier
4. enable quota and set limits
5. Allow all operations triggered in step-2 to complete
6. continuously append to 100 files created in step-2
while true; do for i in {1..100}; do echo "ee" >> file-$i; done; done
7. start detach tier and wait for completion

Actual results:
Detach tier seems to have hung on all nodes. No log updates are seen.

tail -2 /var/log/glusterfs/superman-rebalance.log
[2016-05-10 13:50:58.518210] I [dht-rebalance.c:2728:gf_defrag_process_dir] 0-superman-tier-dht: Migration operation on dir /A756 took 0.01 secs
[2016-05-10 13:50:58.528485] I [MSGID: 109063] [dht-layout.c:718:dht_layout_normalize] 0-superman-tier-dht: Found anomalies in /A756/B1 (gfid = 15f9d92f-655e-4f7a-8906-cb34013bb20a). Holes=1 overlaps=0
[root@dhcp37-101 ~]# date -u
Tue May 10 16:28:26 UTC 2016

Expected results:
detach tier operation should complete

Additional info:
sosreports and statedumps shall be attached from all nodes.

Comment 4 Mohammed Rafi KC 2016-05-11 12:25:40 UTC
Rebalance is hung, because there is a lock held on 10.70.35.10 for brick /bricks/brick0/l1 on domain "dht.layout.heal". But this also waiting on a lock held on quota enforcer on node 10.70.35.239 for a lock in domain disperse. Finally those lock waiting for a lock held by NFS process on 10.70.35.239 for client-7. This lock is stale disperse lock.


[2016-05-10 09:00:31.975421] W [MSGID: 122033] [ec-common.c:1438:ec_locked] 0-superman-disperse-1: Failed to complete preop lock [Transport endpoint is not connected]

The time (08:59:26) when lock granted for nfs server from the client-7, all other client lock request were failed with ENOTCONN. So the assumption here is, the lock failure on majority of clients resulted in unwind of the request frame without unlocking the succeeded lock.

Comment 5 Pranith Kumar K 2016-05-12 07:00:36 UTC
This issue seems to be similar to 1330997. Please feel free to dup this bug to that issue.

Pranith

Comment 6 Nithya Balachandran 2016-05-12 08:30:52 UTC

*** This bug has been marked as a duplicate of bug 1330997 ***