1305205 – NFS mount hangs on a tiered volume

Bug 1305205 - NFS mount hangs on a tiered volume

Summary: NFS mount hangs on a tiered volume

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jiffin
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:	1306930 1311002 1333645 1347524
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-06 05:11 UTC by Bhaskarakiran
Modified:	2018-11-19 04:01 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 04:01:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bhaskarakiran 2016-02-06 05:11:50 UTC

Description of problem:
=======================

on a 16 node setup, using 3 clients to run IO (dd, linux untar and mkdir, one from each client) and within a day one of the client is hitting throttling issue. rpc outstanding requests by default is 16. 

volume info :

[root@rhs-client17 ~]# gluster v info ec_tier
 
Volume Name: ec_tier
Type: Tier
Volume ID: 84855431-e6cf-41e9-9cfc-7a735f2685ed
Status: Started
Number of Bricks: 44
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.77:/rhs/brick4/ec-ht8
Brick2: 10.70.35.191:/rhs/brick4/ec-ht7
Brick3: 10.70.35.202:/rhs/brick4/ec-ht6
Brick4: 10.70.35.49:/rhs/brick4/ec-ht5
Brick5: 10.70.36.41:/rhs/brick4/ec-ht4
Brick6: 10.70.35.196:/rhs/brick4/ec-ht3
Brick7: 10.70.35.38:/rhs/brick4/ec-ht2
Brick8: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick4/ec-ht1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 3 x (8 + 4) = 36
Brick9: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick1/ec
Brick10: 10.70.35.38:/rhs/brick1/ec
Brick11: 10.70.35.196:/rhs/brick1/ec
Brick12: 10.70.36.41:/rhs/brick1/ec
Brick13: 10.70.35.49:/rhs/brick1/ec
Brick14: 10.70.35.202:/rhs/brick1/ec
Brick15: 10.70.35.191:/rhs/brick1/ec
Brick16: 10.70.35.77:/rhs/brick1/ec
Brick17: 10.70.35.98:/rhs/brick1/ec
Brick18: 10.70.35.132:/rhs/brick1/ec
Brick19: 10.70.35.35:/rhs/brick1/ec
Brick20: 10.70.35.51:/rhs/brick1/ec
Brick21: 10.70.35.138:/rhs/brick1/ec
Brick22: 10.70.35.122:/rhs/brick1/ec
Brick23: 10.70.36.43:/rhs/brick1/ec
Brick24: 10.70.36.42:/rhs/brick1/ec
Brick25: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick2/ec
Brick26: 10.70.35.38:/rhs/brick2/ec
Brick27: 10.70.35.196:/rhs/brick2/ec
Brick28: 10.70.36.41:/rhs/brick2/ec
Brick29: 10.70.35.49:/rhs/brick2/ec
Brick30: 10.70.35.202:/rhs/brick2/ec
Brick31: 10.70.35.191:/rhs/brick2/ec
Brick32: 10.70.35.77:/rhs/brick2/ec
Brick33: 10.70.35.98:/rhs/brick2/ec
Brick34: 10.70.35.132:/rhs/brick2/ec
Brick35: 10.70.35.35:/rhs/brick2/ec
Brick36: 10.70.35.51:/rhs/brick2/ec
Brick37: 10.70.35.138:/rhs/brick2/ec
Brick38: 10.70.35.122:/rhs/brick2/ec
Brick39: 10.70.36.43:/rhs/brick2/ec
Brick40: 10.70.36.42:/rhs/brick2/ec
Brick41: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick3/ec
Brick42: 10.70.35.38:/rhs/brick3/ec
Brick43: 10.70.35.196:/rhs/brick3/ec
Brick44: 10.70.36.41:/rhs/brick3/ec
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache
features.barrier: disable
features.bitrot: off
features.scrub: Inactive
nfs.outstanding-rpc-limit: 16
features.scrub-freq: hourly
[root@rhs-client17 ~]# 



Version-Release number of selected component (if applicable):
============================================================
3.7.5-18

How reproducible:
=================
100%

Steps to Reproduce:
1. Create an EC volume (8+4) and attach a dist-rep tier volume
2. NFS mount on the client and run the IO.(dd and linux untar)


Actual results:
==============
client mount hang

Expected results:
================
IO should not hang

Additional info:

Comment 2 Soumya Koduri 2016-02-10 07:46:19 UTC

This issue doesn't seem to exactly relate to throttling but most likely arises only when it is enabled. Have requested QE to confirm that  by reproducing this issue with throttling limit set to 'zero'.

Comment 3 Bhaskarakiran 2016-02-10 10:25:59 UTC

I tried 2 scenarios setting throttling to 0.

1. Serial IO from 2 clients (dd and linux untar)

The memory of nfs process has shot up to 7GB (in a 16GB node) within 30 mins and stays there though there's not much requests from clients. I am seeing NFS server not responding, still trying and OK messages.

2. Parallel IO from 1 client (dd and linux untar)

The memory consumption has gone up to 2GB and is constant at it.

There's no crash seen though in both cases but likely to be hit if the IO is smooth from clients.

Comment 4 Soumya Koduri 2016-02-10 16:54:43 UTC

it confirms that it is not related to throttling issue. Will take pkt trace and statedump of the processes and update further.

Comment 9 Soumya Koduri 2016-03-17 11:06:14 UTC

The work-around for this issue is to re-start gluster-NFS server.

Note You need to log in before you can comment on or make changes to this bug.