This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1305205 - NFS mount hangs on a tiered volume [NEEDINFO]
NFS mount hangs on a tiered volume
Status: ON_QA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-nfs (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Niels de Vos
Manisha Saini
: ZStream
Depends On: 1306930 1311002 1333645 1347524
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-06 00:11 EST by Bhaskarakiran
Modified: 2017-10-11 03:37 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ndevos: needinfo? (rhinduja)


Attachments (Terms of Use)

  None (edit)
Description Bhaskarakiran 2016-02-06 00:11:50 EST
Description of problem:
=======================

on a 16 node setup, using 3 clients to run IO (dd, linux untar and mkdir, one from each client) and within a day one of the client is hitting throttling issue. rpc outstanding requests by default is 16. 

volume info :

[root@rhs-client17 ~]# gluster v info ec_tier
 
Volume Name: ec_tier
Type: Tier
Volume ID: 84855431-e6cf-41e9-9cfc-7a735f2685ed
Status: Started
Number of Bricks: 44
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.77:/rhs/brick4/ec-ht8
Brick2: 10.70.35.191:/rhs/brick4/ec-ht7
Brick3: 10.70.35.202:/rhs/brick4/ec-ht6
Brick4: 10.70.35.49:/rhs/brick4/ec-ht5
Brick5: 10.70.36.41:/rhs/brick4/ec-ht4
Brick6: 10.70.35.196:/rhs/brick4/ec-ht3
Brick7: 10.70.35.38:/rhs/brick4/ec-ht2
Brick8: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick4/ec-ht1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 3 x (8 + 4) = 36
Brick9: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick1/ec
Brick10: 10.70.35.38:/rhs/brick1/ec
Brick11: 10.70.35.196:/rhs/brick1/ec
Brick12: 10.70.36.41:/rhs/brick1/ec
Brick13: 10.70.35.49:/rhs/brick1/ec
Brick14: 10.70.35.202:/rhs/brick1/ec
Brick15: 10.70.35.191:/rhs/brick1/ec
Brick16: 10.70.35.77:/rhs/brick1/ec
Brick17: 10.70.35.98:/rhs/brick1/ec
Brick18: 10.70.35.132:/rhs/brick1/ec
Brick19: 10.70.35.35:/rhs/brick1/ec
Brick20: 10.70.35.51:/rhs/brick1/ec
Brick21: 10.70.35.138:/rhs/brick1/ec
Brick22: 10.70.35.122:/rhs/brick1/ec
Brick23: 10.70.36.43:/rhs/brick1/ec
Brick24: 10.70.36.42:/rhs/brick1/ec
Brick25: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick2/ec
Brick26: 10.70.35.38:/rhs/brick2/ec
Brick27: 10.70.35.196:/rhs/brick2/ec
Brick28: 10.70.36.41:/rhs/brick2/ec
Brick29: 10.70.35.49:/rhs/brick2/ec
Brick30: 10.70.35.202:/rhs/brick2/ec
Brick31: 10.70.35.191:/rhs/brick2/ec
Brick32: 10.70.35.77:/rhs/brick2/ec
Brick33: 10.70.35.98:/rhs/brick2/ec
Brick34: 10.70.35.132:/rhs/brick2/ec
Brick35: 10.70.35.35:/rhs/brick2/ec
Brick36: 10.70.35.51:/rhs/brick2/ec
Brick37: 10.70.35.138:/rhs/brick2/ec
Brick38: 10.70.35.122:/rhs/brick2/ec
Brick39: 10.70.36.43:/rhs/brick2/ec
Brick40: 10.70.36.42:/rhs/brick2/ec
Brick41: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick3/ec
Brick42: 10.70.35.38:/rhs/brick3/ec
Brick43: 10.70.35.196:/rhs/brick3/ec
Brick44: 10.70.36.41:/rhs/brick3/ec
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache
features.barrier: disable
features.bitrot: off
features.scrub: Inactive
nfs.outstanding-rpc-limit: 16
features.scrub-freq: hourly
[root@rhs-client17 ~]# 



Version-Release number of selected component (if applicable):
============================================================
3.7.5-18

How reproducible:
=================
100%

Steps to Reproduce:
1. Create an EC volume (8+4) and attach a dist-rep tier volume
2. NFS mount on the client and run the IO.(dd and linux untar)


Actual results:
==============
client mount hang

Expected results:
================
IO should not hang

Additional info:
Comment 2 Soumya Koduri 2016-02-10 02:46:19 EST
This issue doesn't seem to exactly relate to throttling but most likely arises only when it is enabled. Have requested QE to confirm that  by reproducing this issue with throttling limit set to 'zero'.
Comment 3 Bhaskarakiran 2016-02-10 05:25:59 EST
I tried 2 scenarios setting throttling to 0.

1. Serial IO from 2 clients (dd and linux untar)

The memory of nfs process has shot up to 7GB (in a 16GB node) within 30 mins and stays there though there's not much requests from clients. I am seeing NFS server not responding, still trying and OK messages.

2. Parallel IO from 1 client (dd and linux untar)

The memory consumption has gone up to 2GB and is constant at it.

There's no crash seen though in both cases but likely to be hit if the IO is smooth from clients.
Comment 4 Soumya Koduri 2016-02-10 11:54:43 EST
it confirms that it is not related to throttling issue. Will take pkt trace and statedump of the processes and update further.
Comment 9 Soumya Koduri 2016-03-17 07:06:14 EDT
The work-around for this issue is to re-start gluster-NFS server.

Note You need to log in before you can comment on or make changes to this bug.