Bug 825197

Summary: ping_pong hangs on nfs mount
Product: [Community] GlusterFS Reporter: Shwetha Panduranga <shwetha.h.panduranga>
Component: nfsAssignee: Vinayaga Raman <vraman>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3-betaCC: gluster-bugs, rajesh, rwheeler
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-30 03:09:00 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Shwetha Panduranga 2012-05-25 06:48:42 EDT
Version-Release number of selected component (if applicable):
------------------------------------------------------------
3.3.0qa43

How reproducible:
-----------------
often

Steps to Reproduce:
-------------------
1.Create a replicate volume with 3 bricks 
2.create 6 nfs mounts 
3.start executing "ping_pong file1 7" on each nfs mount. 
  
Actual results:
---------------
ping_pong hangs on each mount when we start executing ping_pong on the mounts.

Expected results:
-----------------
ping_pong should run successfully.
Comment 1 Krishna Srinivas 2012-05-26 10:35:57 EDT
There seems to be mem leak in NLM. The nfs process got killed after a while. In your setup was nfs process still alive? did you check? Is this hang reproducible in your setup without replicate?
Comment 2 Shwetha Panduranga 2012-05-28 01:50:53 EDT
ping_pong on a file hangs on plain distribute volume also. 

Valgrind logs:-
-------------
==7014==    Use --log-fd=<number> to select an alternative log fd.
==7014== Warning: invalid file descriptor 1017 in syscall close()
==7014== Warning: invalid file descriptor 1018 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Warning: invalid file descriptor -1 in syscall close()
==7006== Thread 7:
==7006== Syscall param write(buf) points to uninitialised byte(s)
==7006==    at 0x36386D846D: ??? (in /lib64/libc-2.12.so)
==7006==    by 0x363870EF0A: writetcp (in /lib64/libc-2.12.so)
==7006==    by 0x363871592D: xdrrec_endofrecord (in /lib64/libc-2.12.so)
==7006==    by 0x363870ECF3: clnttcp_call (in /lib64/libc-2.12.so)
==7006==    by 0x981DF2D: nsm_monitor (nlm4.c:551)
==7006==    by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so)
==7006==    by 0xCA266FF: ???
==7006==  Address 0x671acd8 is 88 bytes inside a block of size 8,004 alloc'd
==7006==    at 0x4A05FDE: malloc (vg_replace_malloc.c:236)
==7006==    by 0x36387151CD: xdrrec_create (in /lib64/libc-2.12.so)
==7006==    by 0x363870EA42: clnttcp_create (in /lib64/libc-2.12.so)
==7006==    by 0x363870D953: clnt_create (in /lib64/libc-2.12.so)
==7006==    by 0x981DE6F: nsm_monitor (nlm4.c:543)
==7006==    by 0x3638A077F0: start_thread (in /lib64/libpthread-2.12.so)
==7006==    by 0xCA266FF: ???
==7006==
Comment 3 Krishna Srinivas 2012-05-28 04:00:00 EDT
In your setup was nfs process still alive when ping_pong hangs?
Comment 4 Rajesh 2012-05-28 06:33:08 EDT
yes, the nfs process as well as the brick(s) are alive and listening (gdb bt showed them at epoll_wait). wireshark on one of the clients showed NLM_BLOCKED as the last reply from server.

I tried the same with 6 mounts on personal vm and local machine being the server. it worked fine. I suspect network issue, but ping-pong on fuse mounts contradict the same. need further investigation.
Comment 5 Krishna Srinivas 2012-07-30 03:09:00 EDT
ping_pong was being run on a client machine which was behind NAT. For locking to work fine the client machine's NLM service needs to be reachable by server machine's NLM service.