Bug 1305205

Summary: NFS mount hangs on a tiered volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Bhaskarakiran <byarlaga>
Component: gluster-nfsAssignee: Jiffin <jthottan>
Status: CLOSED CURRENTRELEASE QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: nchilaka, ndevos, rcyriac, rhinduja, rhs-bugs, sanandpa, sankarshan, skoduri, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 04:01:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1306930, 1311002, 1333645, 1347524    
Bug Blocks:    

Description Bhaskarakiran 2016-02-06 05:11:50 UTC
Description of problem:
=======================

on a 16 node setup, using 3 clients to run IO (dd, linux untar and mkdir, one from each client) and within a day one of the client is hitting throttling issue. rpc outstanding requests by default is 16. 

volume info :

[root@rhs-client17 ~]# gluster v info ec_tier
 
Volume Name: ec_tier
Type: Tier
Volume ID: 84855431-e6cf-41e9-9cfc-7a735f2685ed
Status: Started
Number of Bricks: 44
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.77:/rhs/brick4/ec-ht8
Brick2: 10.70.35.191:/rhs/brick4/ec-ht7
Brick3: 10.70.35.202:/rhs/brick4/ec-ht6
Brick4: 10.70.35.49:/rhs/brick4/ec-ht5
Brick5: 10.70.36.41:/rhs/brick4/ec-ht4
Brick6: 10.70.35.196:/rhs/brick4/ec-ht3
Brick7: 10.70.35.38:/rhs/brick4/ec-ht2
Brick8: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick4/ec-ht1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 3 x (8 + 4) = 36
Brick9: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick1/ec
Brick10: 10.70.35.38:/rhs/brick1/ec
Brick11: 10.70.35.196:/rhs/brick1/ec
Brick12: 10.70.36.41:/rhs/brick1/ec
Brick13: 10.70.35.49:/rhs/brick1/ec
Brick14: 10.70.35.202:/rhs/brick1/ec
Brick15: 10.70.35.191:/rhs/brick1/ec
Brick16: 10.70.35.77:/rhs/brick1/ec
Brick17: 10.70.35.98:/rhs/brick1/ec
Brick18: 10.70.35.132:/rhs/brick1/ec
Brick19: 10.70.35.35:/rhs/brick1/ec
Brick20: 10.70.35.51:/rhs/brick1/ec
Brick21: 10.70.35.138:/rhs/brick1/ec
Brick22: 10.70.35.122:/rhs/brick1/ec
Brick23: 10.70.36.43:/rhs/brick1/ec
Brick24: 10.70.36.42:/rhs/brick1/ec
Brick25: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick2/ec
Brick26: 10.70.35.38:/rhs/brick2/ec
Brick27: 10.70.35.196:/rhs/brick2/ec
Brick28: 10.70.36.41:/rhs/brick2/ec
Brick29: 10.70.35.49:/rhs/brick2/ec
Brick30: 10.70.35.202:/rhs/brick2/ec
Brick31: 10.70.35.191:/rhs/brick2/ec
Brick32: 10.70.35.77:/rhs/brick2/ec
Brick33: 10.70.35.98:/rhs/brick2/ec
Brick34: 10.70.35.132:/rhs/brick2/ec
Brick35: 10.70.35.35:/rhs/brick2/ec
Brick36: 10.70.35.51:/rhs/brick2/ec
Brick37: 10.70.35.138:/rhs/brick2/ec
Brick38: 10.70.35.122:/rhs/brick2/ec
Brick39: 10.70.36.43:/rhs/brick2/ec
Brick40: 10.70.36.42:/rhs/brick2/ec
Brick41: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick3/ec
Brick42: 10.70.35.38:/rhs/brick3/ec
Brick43: 10.70.35.196:/rhs/brick3/ec
Brick44: 10.70.36.41:/rhs/brick3/ec
Options Reconfigured:
performance.readdir-ahead: on
features.uss: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache
features.barrier: disable
features.bitrot: off
features.scrub: Inactive
nfs.outstanding-rpc-limit: 16
features.scrub-freq: hourly
[root@rhs-client17 ~]# 



Version-Release number of selected component (if applicable):
============================================================
3.7.5-18

How reproducible:
=================
100%

Steps to Reproduce:
1. Create an EC volume (8+4) and attach a dist-rep tier volume
2. NFS mount on the client and run the IO.(dd and linux untar)


Actual results:
==============
client mount hang

Expected results:
================
IO should not hang

Additional info:

Comment 2 Soumya Koduri 2016-02-10 07:46:19 UTC
This issue doesn't seem to exactly relate to throttling but most likely arises only when it is enabled. Have requested QE to confirm that  by reproducing this issue with throttling limit set to 'zero'.

Comment 3 Bhaskarakiran 2016-02-10 10:25:59 UTC
I tried 2 scenarios setting throttling to 0.

1. Serial IO from 2 clients (dd and linux untar)

The memory of nfs process has shot up to 7GB (in a 16GB node) within 30 mins and stays there though there's not much requests from clients. I am seeing NFS server not responding, still trying and OK messages.

2. Parallel IO from 1 client (dd and linux untar)

The memory consumption has gone up to 2GB and is constant at it.

There's no crash seen though in both cases but likely to be hit if the IO is smooth from clients.

Comment 4 Soumya Koduri 2016-02-10 16:54:43 UTC
it confirms that it is not related to throttling issue. Will take pkt trace and statedump of the processes and update further.

Comment 9 Soumya Koduri 2016-03-17 11:06:14 UTC
The work-around for this issue is to re-start gluster-NFS server.