1581587 – [Ganesha] Find's hung on client when new writes were running in parallel

Bug 1581587 - [Ganesha] Find's hung on client when new writes were running in parallel

Summary: [Ganesha] Find's hung on client when new writes were running in parallel

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1403757
TreeView+	depends on / blocked

Reported:	2018-05-23 07:30 UTC by Manisha Saini
Modified:	2019-10-25 04:31 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-29 11:59:27 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Manisha Saini 2018-05-23 07:30:10 UTC

Description of problem:

Find's are hunged on client for more then 12+ Hours when new writes are running in parallel 



Version-Release number of selected component (if applicable):

glusterfs-ganesha-3.12.2-11.el7rhgs.x86_64
nfs-ganesha-2.5.5-7.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-7.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-7.el7rhgs.x86_64


How reproducible:
2/2

Steps to Reproduce:
1.Create 6 node ganesha cluster
2.Create 6*(4+2) Distributed-Disperse Volume.Export the volume via ganesha
3.Mount the volume on 4 clients with 4 different VIP's

Client 1,Client 2,Client 3- Run dd command in loop from 3 clients
Client4 - Run find's in loop ( while true;do find . -mindepth 1 -type f;done)

Actual results:

After nearly around ~2 Hrs,Find got hung on client 4 for more then ~12 Hrs when new writes were running in parallel


Expected results:
Find should not hung when new writes are running in parallel


Additional info:

Attaching gstack,tcpdumps and sosreports shortly

Comment 4 Daniel Gryniewicz 2018-05-23 19:19:22 UTC

Based on the packet capture, I believe this is not a Ganesha issue.  The NFS traffic in that file consists almost entirely of READDIR and READDIR replies (with the odd RENEW).   The average time between READDIR and REPLY is 0.0003 (!) with occasional delays as long as 0.001.  However, there are many many delays of up to 2 seconds between the REPLY and the next READDIR, which is the client's fault, not Ganesha's fault.  Something on the client is causing huge delays.

This is likely to be Gluster traffic, as there's 915 NFS packets in the trace, and there are 1.2 million Gluster packets in the trace (!2k / second), so the network is spending most of it's time on Gluster traffic.

Comment 6 Daniel Gryniewicz 2018-07-09 11:32:12 UTC

(For the record, you can do either of the 2 workarounds, but don't need to do both.)

Note You need to log in before you can comment on or make changes to this bug.