Bug 152120

Summary: rpc.mountd spins endlessly serving closing sockets.
Product: [Fedora] Fedora Reporter: rakarra
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED RAWHIDE QA Contact: Ben Levenson <benl>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: davem, ianw, lars, mattdm
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-01 10:24:07 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description rakarra 2005-03-24 20:17:03 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.5) Gecko/20041203 Firefox/1.0

Description of problem:
From time to time some of our desktops have had problems with rpc.mountd freezing -- top shows it using 99% of the cpu, so it's stuck somehow. I ran it through a debugger and found that it's stuck in utils/mountd/svc_run.c, looping constantly through select():

select(1024, [3 4 5 6 7 13 14 15 17 32 33 34 35], NULL, NULL, NULL) = 4 (in [32 33 34 35])
select(1024, [3 4 5 6 7 13 14 15 17 32 33 34 35], NULL, NULL, NULL) = 4 (in [32 33 34 35])
select(1024, [3 4 5 6 7 13 14 15 17 32 33 34 35], NULL, NULL, NULL) = 4 (in [32 33 34 35])
select(1024, [3 4 5 6 7 13 14 15 17 32 33 34 35], NULL, NULL, NULL) = 4 (in [32 33 34 35])
select(1024, [3 4 5 6 7 13 14 15 17 32 33 34 35], NULL, NULL, NULL) = 4 (in [32 33 34 35])
select(1024, [3 4 5 6 <unfinished ...>

The problem sockets seem to be stuck in the CLOSE_WAIT state:

rpc.mount 22635 root   13u  IPv4            2053260                TCP fawkes.pixar.com:817->u3238.pixar.com:1021 (ESTABLISHED)
rpc.mount 22635 root   14u  IPv4            2053261                TCP fawkes.pixar.com:817->u4007.pixar.com:1021 (ESTABLISHED)
rpc.mount 22635 root   15u  IPv4            2053262                TCP fawkes.pixar.com:817->u3933.pixar.com:1022 (ESTABLISHED)
rpc.mount 22635 root   17u  IPv4            2053265                TCP fawkes.pixar.com:817->u3601.pixar.com:1022 (ESTABLISHED)
rpc.mount 22635 root   32u  IPv4            2014327                TCP fawkes.pixar.com:817->u5220.pixar.com:1022 (CLOSE_WAIT)
rpc.mount 22635 root   33u  IPv4            2014349                TCP fawkes.pixar.com:817->u4134.pixar.com:1022 (CLOSE_WAIT)
rpc.mount 22635 root   34u  IPv4            2014370                TCP fawkes.pixar.com:817->u5225.pixar.com:1023 (CLOSE_WAIT)
rpc.mount 22635 root   35u  sock                0,4            2014425 can't identify protocol

and so select() is always successful, which is why the process eats up so much CPU time, since these sockets are stuck or aren't being serviced properly.

The other machines (the u* machines) don't have any sockets open to fawkes, these are client-side sockets that aren't being destroyed like they should.

I don't know if this is a kernel bug for letting the sockets linger, or an NFS bug for not closing the sockets in the first place. 

This has happened with multiple versions, most recently with 1.0.6-52.

Any ideas?

Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-52

How reproducible:
Sometimes

Steps to Reproduce:
(See above)

Additional info:

Kernel version of nfs server and client:
2.6.8-1.521 (smp)
Comment 2 rakarra 2005-04-15 16:30:24 EDT
Any idea so far about this?
Comment 3 Steve Dickson 2005-04-16 09:00:24 EDT
No... I really don't.... Since this problem does not seem to be
wide spread (i.e. this is the only report of such behavior)
I have to wonder if it has something to do with your
network stack. Have there been any modification to
the stack or network driver?

Also does this happen with more recent kernels?
Comment 4 rakarra 2006-03-09 20:49:53 EST
Hello!
It's been awhile and with the introduction of kernel 2.6.12, we thought that it
had gone away, but it seems to have lingered on and flared up recently.

The kernels on the workstations involved are 2.6.12-1.1372_FC3 on the SMP x86_64
architecture. The server machines that rpc.mountd had lingering sockets to run
either the same kernel version or 2.6.8-1.521. I'll look into it a little more..
Comment 5 Matthew Miller 2006-06-29 23:07:59 EDT
rakarra -- Can you try this on FC5 or newer? Thanks!
Comment 6 rakarra 2006-06-30 19:38:21 EDT
We'll be updating to FC5 this summer, so we'll see if that fixes it. I haven't
gotten any reports of this in the past month or two; it seems to come and go.
u**** machines aren't being utilized as much at the moment, so the lower NFS
traffic would trigger this bug less.
Comment 7 Steve Dickson 2006-07-01 10:24:07 EDT
For now I'm going to close this bug since there has not 
been any reports on 2.6.16 and beyond kernels...