Bug 667770

Summary: rpc.gssd locks up and hangs nfs mount when idle for long time (ticket expires?)
Product: [Fedora] Fedora Reporter: Orion Poplawski <orion>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: bcodding, ender, jlayton, matt, steved, warren
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-13 15:47:13 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Orion Poplawski 2011-01-06 13:13:52 EST
Description of problem:

Just starting to test out nfsv4 with krb5.  I have the following mount:

saga:/ /mnt nfs4 rw,sec=krb5,addr=192.168.0.12,clientaddr=192.168.0.39 0 0

If I leave this for a while, access to the mount will hang.  No messages from rpc.gssd in /var/log/messages (I'm running with -vvv).  Do get:

Jan  6 11:11:07 orca rpc.idmapd[15362]: New client: 1d3
Jan  6 11:11:07 orca rpc.idmapd[15362]: Opened /var/lib/nfs/rpc_pipefs//nfs/clnt1d3/idmap
Jan  6 11:11:07 orca rpc.idmapd[15362]: New client: 1d4
Jan  6 11:11:07 orca rpc.idmapd[15362]: Stale client: 1d4
Jan  6 11:11:07 orca rpc.idmapd[15362]: #011-> closed /var/lib/nfs/rpc_pipefs//nfs/clnt1d4/idmap
Jan  6 11:11:07 orca rpc.idmapd[15362]: Stale client: 1d3
Jan  6 11:11:07 orca rpc.idmapd[15362]: #011-> closed /var/lib/nfs/rpc_pipefs//nfs/clnt1d3/idmap
Jan  6 11:11:07 orca rpc.idmapd[15362]: New client: 1d5
Jan  6 11:11:07 orca rpc.idmapd[15362]: New client: 1d6
Jan  6 11:11:07 orca rpc.idmapd[15362]: New client: 1d7

If I restart rpc.gssd, everything comes back.

Version-Release number of selected component (if applicable):
nfs-utils-1.2.3-2.fc14.i686

How reproducible:
Very.
Comment 1 Orion Poplawski 2011-01-10 17:40:39 EST
Back trace of hung process:

#0  0x00ae8416 in __kernel_vsyscall ()
#1  0x004be5d1 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x0044985c in _L_lock_12621 () from /lib/libc.so.6
#3  0x00447797 in malloc () from /lib/libc.so.6
#4  0x0043a398 in open_memstream () from /lib/libc.so.6
#5  0x004a9ae5 in __vsyslog_chk () from /lib/libc.so.6
#6  0x0017d15f in vsyslog (kind=512, 
    fmt=0x1806cc "dir_notify_handler: sig %d si %p data %p\n", args=0xbfc03938 "%")
    at /usr/include/bits/syslog.h:48
#7  xlog_backend (kind=512, fmt=0x1806cc "dir_notify_handler: sig %d si %p data %p\n", 
    args=0xbfc03938 "%") at xlog.c:150
#8  0x001777d4 in printerr (priority=2, 
    format=0x1806cc "dir_notify_handler: sig %d si %p data %p\n") at err_util.c:64
#9  0x00177c9e in dir_notify_handler (sig=37, si=0xbfc0396c, data=0xbfc039ec)
    at gssd_main_loop.c:66
#10 <signal handler called>
#11 0x00444984 in _int_malloc () from /lib/libc.so.6
#12 0x004477a0 in malloc () from /lib/libc.so.6
#13 0x0046db77 in __alloc_dir () from /lib/libc.so.6
#14 0x0046dc5a in opendir () from /lib/libc.so.6
#15 0x0046e7ef in scandir64@@GLIBC_2.2 () from /lib/libc.so.6
#16 0x00179285 in process_pipedir () at gssd_proc.c:565
#17 update_client_list () at gssd_proc.c:594
#18 0x00177f40 in gssd_run () at gssd_main_loop.c:216
#19 0x00177bf9 in main (argc=2, argv=0xbfc04134) at gssd.c:187
Comment 2 Orion Poplawski 2011-01-10 17:56:30 EST
Looks like malloc is getting called from a signal handler called while in a malloc call, which is verboten.  Not sure what the best way around this, but it looks like dir_notify_handler cannot call printerr.  I suppose this only occurs when -vv or greater is given.
Comment 3 Dirk Cummings 2011-05-28 23:42:58 EDT
What's even more hilarious is that when nfs hangs, your entire gnome session freezes.
Comment 4 Orion Poplawski 2012-03-13 16:01:04 EDT
I thought the solution was to drop the printerr call:

http://article.gmane.org/gmane.linux.nfs/45443
Comment 5 bcodding 2012-06-05 11:05:15 EDT
Why was this closed - cantfix?  We just ran into this one in RHEL6.
Comment 6 Ender 2016-03-14 18:29:10 EDT
For reference, this was fixed in 1.2.3-63 for RHEL 6:

* Mon May 18 2015 Steve Dickson <steved@redhat.com> 1.2.3-63
- Removed printerr from gssd (bz 949100)