1334860 – [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hung on a tiered vol

Bug 1334860 - [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hung on a tiered vol

Summary: [Tiering-rebalance]: Rebalance triggered due to detach tier seems to have hun...

Keywords:
Status:	CLOSED DUPLICATE of bug 1330997
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	tier
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-10 16:30 UTC by krishnaram Karthick
Modified:	2016-09-17 15:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 08:30:52 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description krishnaram Karthick 2016-05-10 16:30:55 UTC

Description of problem:
Rebalance operation triggered on a tiered volume due to detach tier operation seems to have hung on all nodes.

Volume Name: superman
Type: Tier
Volume ID: afa1866e-2d0b-424f-8e22-a782f9068b25
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.133:/bricks/brick7/sm1
Brick2: 10.70.35.10:/bricks/brick7/sm1
Brick3: 10.70.35.11:/bricks/brick7/sm1
Brick4: 10.70.35.225:/bricks/brick7/sm1
Brick5: 10.70.35.239:/bricks/brick7/sm1
Brick6: 10.70.37.60:/bricks/brick7/sm1
Brick7: 10.70.37.120:/bricks/brick7/sm1
Brick8: 10.70.37.101:/bricks/brick7/sm1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.101:/bricks/brick0/l1
Brick10: 10.70.37.120:/bricks/brick0/l1
Brick11: 10.70.37.60:/bricks/brick0/l1
Brick12: 10.70.35.239:/bricks/brick0/l1
Brick13: 10.70.35.225:/bricks/brick0/l1
Brick14: 10.70.35.11:/bricks/brick0/l1
Brick15: 10.70.35.10:/bricks/brick0/l1
Brick16: 10.70.35.133:/bricks/brick0/l1
Brick17: 10.70.37.101:/bricks/brick1/l1
Brick18: 10.70.37.120:/bricks/brick1/l1
Brick19: 10.70.37.60:/bricks/brick1/l1
Brick20: 10.70.35.239:/bricks/brick1/l1
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
nfs-ganesha: disable


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-4.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. On a tiered volume, create a disperse volume
2. create several files and dirs from nfs mount
 - mkdir -p A{1..1000}/B{1..20}/
 - for i in {1..10000}; do dd if=/dev/urandom of=file-$i bs=1M count=10; done
 - untar linux kernel package
3. halfway through of step2, attach hot tier
4. enable quota and set limits
5. Allow all operations triggered in step-2 to complete
6. continuously append to 100 files created in step-2
while true; do for i in {1..100}; do echo "ee" >> file-$i; done; done
7. start detach tier and wait for completion

Actual results:
Detach tier seems to have hung on all nodes. No log updates are seen.

tail -2 /var/log/glusterfs/superman-rebalance.log
[2016-05-10 13:50:58.518210] I [dht-rebalance.c:2728:gf_defrag_process_dir] 0-superman-tier-dht: Migration operation on dir /A756 took 0.01 secs
[2016-05-10 13:50:58.528485] I [MSGID: 109063] [dht-layout.c:718:dht_layout_normalize] 0-superman-tier-dht: Found anomalies in /A756/B1 (gfid = 15f9d92f-655e-4f7a-8906-cb34013bb20a). Holes=1 overlaps=0
[root@dhcp37-101 ~]# date -u
Tue May 10 16:28:26 UTC 2016

Expected results:
detach tier operation should complete

Additional info:
sosreports and statedumps shall be attached from all nodes.

Comment 4 Mohammed Rafi KC 2016-05-11 12:25:40 UTC

Rebalance is hung, because there is a lock held on 10.70.35.10 for brick /bricks/brick0/l1 on domain "dht.layout.heal". But this also waiting on a lock held on quota enforcer on node 10.70.35.239 for a lock in domain disperse. Finally those lock waiting for a lock held by NFS process on 10.70.35.239 for client-7. This lock is stale disperse lock.


[2016-05-10 09:00:31.975421] W [MSGID: 122033] [ec-common.c:1438:ec_locked] 0-superman-disperse-1: Failed to complete preop lock [Transport endpoint is not connected]

The time (08:59:26) when lock granted for nfs server from the client-7, all other client lock request were failed with ENOTCONN. So the assumption here is, the lock failure on majority of clients resulted in unwind of the request frame without unlocking the succeeded lock.

Comment 5 Pranith Kumar K 2016-05-12 07:00:36 UTC

This issue seems to be similar to 1330997. Please feel free to dup this bug to that issue.

Pranith

Comment 6 Nithya Balachandran 2016-05-12 08:30:52 UTC


*** This bug has been marked as a duplicate of bug 1330997 ***

Note You need to log in before you can comment on or make changes to this bug.