Bug 1573089

Summary: [GSS] NFS-Ganesha server becoming unresponsive
Product: Red Hat Gluster Storage Reporter: nravinas
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED NOTABUG QA Contact: Manisha Saini <msaini>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.3CC: abhishku, amukherj, dang, ffilz, jthottan, kkeithle, mbenjamin, nravinas, pasik, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: kkeithle: needinfo? (mbenjamin)
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 10:50:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 17 Kaleb KEITHLEY 2018-05-17 14:39:19 UTC
These 
  dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
messages make me suspect that the box is low on available memory and is thrashing.

When this happens again please collect resident and virt size of the ganesha.nfsd process and the overall memory consumption on the machine.

Thanks

Comment 26 Daniel Gryniewicz 2018-06-07 13:38:59 UTC
NFSv4 runs over TCP, so jumbo frames should make no difference at all, unless one of the machines (or one of the switches between them) had a badly configured MTU.  If all the MTUs were the same (1500 is default for non-jumbo ethernet), then turning on jumbo frames won't make a difference, since TCP streams across multiple frames when packets are sent that are larger than the MTU.

The most likely scenarios are:

1) one machine has a lower MTU than the other, and when (rare) packets are sent that are over that machines MTU, they are silently dropped.  TCP will resend them over and over forever, and they will continue to be dropped.

2) A switch in the network has a lower MTU than the machines, and does not properly fragment large frames.  In this case, it's the switch that drops the frames, and again, TCP will resend forever.

Configuring jumbo frames would fix this by causing them to go through the entire network and set all the MTUs to the new, large value.

Comment 28 Yaniv Kaul 2018-10-25 11:43:53 UTC
What's the next step here?