Bug 1334860 - [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hung on a tiered vol
Summary: [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hun...
Keywords:
Status: CLOSED DUPLICATE of bug 1330997
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tier
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-10 16:30 UTC by krishnaram Karthick
Modified: 2016-09-17 15:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 08:30:52 UTC
Embargoed:


Attachments (Terms of Use)

Description krishnaram Karthick 2016-05-10 16:30:55 UTC
Description of problem:
Rebalance operation triggered on a tiered volume due to detach tier operation seems to have hung on all nodes.

Volume Name: superman
Type: Tier
Volume ID: afa1866e-2d0b-424f-8e22-a782f9068b25
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.133:/bricks/brick7/sm1
Brick2: 10.70.35.10:/bricks/brick7/sm1
Brick3: 10.70.35.11:/bricks/brick7/sm1
Brick4: 10.70.35.225:/bricks/brick7/sm1
Brick5: 10.70.35.239:/bricks/brick7/sm1
Brick6: 10.70.37.60:/bricks/brick7/sm1
Brick7: 10.70.37.120:/bricks/brick7/sm1
Brick8: 10.70.37.101:/bricks/brick7/sm1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.101:/bricks/brick0/l1
Brick10: 10.70.37.120:/bricks/brick0/l1
Brick11: 10.70.37.60:/bricks/brick0/l1
Brick12: 10.70.35.239:/bricks/brick0/l1
Brick13: 10.70.35.225:/bricks/brick0/l1
Brick14: 10.70.35.11:/bricks/brick0/l1
Brick15: 10.70.35.10:/bricks/brick0/l1
Brick16: 10.70.35.133:/bricks/brick0/l1
Brick17: 10.70.37.101:/bricks/brick1/l1
Brick18: 10.70.37.120:/bricks/brick1/l1
Brick19: 10.70.37.60:/bricks/brick1/l1
Brick20: 10.70.35.239:/bricks/brick1/l1
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
nfs-ganesha: disable


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-4.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. On a tiered volume, create a disperse volume
2. create several files and dirs from nfs mount
 - mkdir -p A{1..1000}/B{1..20}/
 - for i in {1..10000}; do dd if=/dev/urandom of=file-$i bs=1M count=10; done
 - untar linux kernel package
3. halfway through of step2, attach hot tier
4. enable quota and set limits
5. Allow all operations triggered in step-2 to complete
6. continuously append to 100 files created in step-2
while true; do for i in {1..100}; do echo "ee" >> file-$i; done; done
7. start detach tier and wait for completion

Actual results:
Detach tier seems to have hung on all nodes. No log updates are seen.

tail -2 /var/log/glusterfs/superman-rebalance.log
[2016-05-10 13:50:58.518210] I [dht-rebalance.c:2728:gf_defrag_process_dir] 0-superman-tier-dht: Migration operation on dir /A756 took 0.01 secs
[2016-05-10 13:50:58.528485] I [MSGID: 109063] [dht-layout.c:718:dht_layout_normalize] 0-superman-tier-dht: Found anomalies in /A756/B1 (gfid = 15f9d92f-655e-4f7a-8906-cb34013bb20a). Holes=1 overlaps=0
[root@dhcp37-101 ~]# date -u
Tue May 10 16:28:26 UTC 2016

Expected results:
detach tier operation should complete

Additional info:
sosreports and statedumps shall be attached from all nodes.

Comment 4 Mohammed Rafi KC 2016-05-11 12:25:40 UTC
Rebalance is hung, because there is a lock held on 10.70.35.10 for brick /bricks/brick0/l1 on domain "dht.layout.heal". But this also waiting on a lock held on quota enforcer on node 10.70.35.239 for a lock in domain disperse. Finally those lock waiting for a lock held by NFS process on 10.70.35.239 for client-7. This lock is stale disperse lock.


[2016-05-10 09:00:31.975421] W [MSGID: 122033] [ec-common.c:1438:ec_locked] 0-superman-disperse-1: Failed to complete preop lock [Transport endpoint is not connected]

The time (08:59:26) when lock granted for nfs server from the client-7, all other client lock request were failed with ENOTCONN. So the assumption here is, the lock failure on majority of clients resulted in unwind of the request frame without unlocking the succeeded lock.

Comment 5 Pranith Kumar K 2016-05-12 07:00:36 UTC
This issue seems to be similar to 1330997. Please feel free to dup this bug to that issue.

Pranith

Comment 6 Nithya Balachandran 2016-05-12 08:30:52 UTC

*** This bug has been marked as a duplicate of bug 1330997 ***


Note You need to log in before you can comment on or make changes to this bug.