Bug 222421

Summary:	Memory leak in rpc.idmapd
Product:	Red Hat Enterprise Linux 4	Reporter:	Steve Dickson <steved>
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.4	CC:	leroy.vanlogchem
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-01-12 12:07:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	212547
Bug Blocks:

Description Steve Dickson 2007-01-12 11:25:12 UTC

+++ This bug was initially created as a clone of Bug #212547 +++

+++ This bug was initially created as a clone of Bug #157028 +++

Description of problem:

rpc.idmapd shows monotonic growth on a client accessing O(1000) mountpoints over
NFS:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root     16731  0.0 21.1 441160 438276 ?     Ss   Apr29   2:06 rpc.idmapd

Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-52
autofs-4.1.3-114

How reproducible:
always.

Steps to Reproduce:
1.Have each users' home directory as a separate mount point in autofs
2.access these home directories over NFS
3.
  
Actual results:
rpc.idmapd's memory resources grow by 40MB whenever all 1000 mountpoints are
accessed (some fail due to the 800 mountpoint limit). rpc.idmap never reduces
its memory footprint. 

Expected results:
rpc.idmapd should free the memory again, at the very least after automount
umounts the mountpoints.

Additional info:
The NFS server is a dated TruUnix cluster. mounts are performed over autofs as
for example
nukleon:/amd/nukleon/1/home/ag-hamprecht/thimm on /home/thimm type nfs
(rw,nosuid,nodev,intr,proplist,udp,addr=160.45.32.130)

-- Additional comment from Axel.Thimm on 2005-05-20 07:01 EST --
This also occurs under RHEL4/x86_64, and has a larger memory leak per mount
(~50-60kB). I'm moving this to RHEL.

-- Additional comment from mef on 2005-09-30 22:25 EST --
I appear to be suffering from this leak, with automounts to Solaris 8 servers
(aka SunOS 5.8).

I am using RHEL 4 (Update 1) i386 on a pentium 3 processor.

There are only O(10) mounted nfs filesystems, but after 2 days of uptime, the
rpc.idmapd process has grown to 5233 blocks (as reported in the SZ column of
'ps -l'). The size of this process appears to increase with the first read of
directories and files on just one of those mounts (it increased from 5170
blocks in a read traversal of O(8000) directories and files). It does not
appear to grow with repeated traversals of the same partition. However,
when the mount is unmounted through the action of automountd, accumulated
memory is not released.

Last week I had to reboot the box because it had become unusable (rpc.idmapd
had grown to over 25000 blocks over 2 weeks). This box only has 384 MB of
physical ram.


-- Additional comment from poelstra on 2005-10-11 20:42 EST --
QE ACK

-- Additional comment from kanderso on 2005-10-20 16:27 EST --
Devel ACK and move to CanFix for U3.

-- Additional comment from steved on 2006-02-09 21:27 EST --
Is this still a problem? I have not been able to reproduce it in my testing...

-- Additional comment from Axel.Thimm on 2006-02-10 05:34 EST --
We turned off rpc.idmapd, later merged all mount points to one and finally even
decomissioned the TrueCluster for a Linux NFS server, so I cannot provide any
useful feedback anymore. Maybe Michael Forrest still has a setup to test this.

-- Additional comment from steved on 2006-02-10 07:12 EST --
Ok for now, I'm going to put his bug in the DEFERRED state. So If
I come across this problems in my travels or if other people start
to see this problem again, please feel free to REOPEN the bug...

Thank you for your patience!

-- Additional comment from chad_gatesman on 2006-05-02 13:01 EST --
Please reopen this.

I am seeing this very same problem on our servers.  I am running RHEL ES 4
Update 2 (32-bit).  I have a little over 100 auto mounted file systems from a
variety of OS's, but mostly are from Solaris 8 (Sparc).  Here are my package
versions:

autofs-4.1.3-155
am-utils-6.0.9-15.RHEL4
nfs-utils-1.0.6-65.EL4

Let me know if there is anything else I can provide or do to help diagnose and
fix this problem.

-- Additional comment from pm-rhel on 2006-08-18 13:43 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

-- Additional comment from Kevin_M_Lange on 2006-10-23 08:23 EST --
Can this be escalated to be a fix for RHEL4 as errata or hotfix channel?  We're
seeing memory growth of 2.5GB after 2 days.  

-- Additional comment from jlayton on 2006-10-24 15:46 EST --
This upstream list post looks like it might be relevant:

http://linux-nfs.org/pipermail/nfsv4/2006-August/004917.html

The list post makes it sound like the problem occurs primarily due to NFSv2/3
usage when rpc.idmapd is running.


-- Additional comment from jlayton on 2006-10-25 16:16 EST --
Playing with rpc.idmapd on a client and doing 300 mounts and unmounts. After
this, when I kill rpc.idmapd, valgrind says this:

==6115== 1,499,216 (284,880 direct, 1,214,336 indirect) bytes in 397 blocks are
definitely lost in loss record 15 of 15
==6115==    at 0x400579F: realloc (vg_replace_malloc.c:306)
==6115==    by 0xA39BC1: scandir (in /lib/tls/libc-2.3.4.so)
==6115==    by 0x804AF1A: dirscancb (idmapd.c:308)
==6115==    by 0x804DB14: event_loop (event.c:210)
==6115==    by 0x804DBD8: event_dispatch (event.c:222)
==6115==    by 0x804C493: main (idmapd.c:293)

I'll have a closer look at this code tomorrow...


-- Additional comment from jlayton on 2006-10-26 11:44 EST --
Created an attachment (id=139476)
patch 1

The leak seems to be coming from the "scandir". scandir() allocates an array of
strings via malloc. dirscancb() is calling this function, but isn't freeing the
strings and the array when it's complete.

This patch seems like it should fix the problem, but with it, I'm getting a
reproducable segfault in idmapd once the last filesystem is unmounted. This is
*probably* an existing bug that's just now evident now that we're freeing
things properly.

The segfault is occurring in this line of code:

	TAILQ_FOREACH(ic, icq, ic_next) {

so it seems like something with the list handling here isn't right.


-- Additional comment from jlayton on 2006-10-26 13:43 EST --
Created an attachment (id=139493)
patch 2

Yes indeed. This line:

	TAILQ_FOREACH(ic, icq, ic_next) {

unrolls into:

	for(ic=icq->tqh_first; ic != NULL; ic=ic->ic_next.tqe_next) {

...and within this loop we are freeing "ic". The easist fix is to not use the
TAILQ_FOREACH macro so we can work around the free. This patch does that and
seems to avoid the segfault.


-- Additional comment from jlayton on 2006-10-26 14:18 EST --
I've placed i386, x86_64 and SRPM packages on my people page:

http://people.redhat.com/jlayton/bz157028/

Please test them and post here whether they seem to take care of the problem.
Also please post here if you need packages for other arches for testing.


-- Additional comment from jlayton on 2006-10-26 15:03 EST --
1.0.10 Patches posted to:

nfs.net

Subject: [NFS] [PATCH 1/2] idmapd: plug memory leak in dirscancb
Subject: [NFS] [PATCH 2/2] idmapd: fix use after free in dirscancb cleanup loop

-- Additional comment from jlayton on 2006-10-27 07:52 EST --
Going ahead and opening a RHEL5 bug on this. I've not actually tested RHEL5 to
make sure this bug is there, but this bug exists upstream in the latest
nfs-utils packages so I expect that it does.

I pushed these patches to the nfs list yesterday so I expect they'll make it
upstream soon. If we've frozen the nfs-utils version for RHEL5, however, we
should add these to it.


-- Additional comment from pm-rhel on 2006-10-27 08:00 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

-- Additional comment from jlayton on 2006-10-27 14:58 EST --
Tried to set devel_ack here, but I don't seem to have the right permissions. We
have the patches to fix this already, though.


-- Additional comment from jturner on 2006-12-01 15:34 EST --
QE ack for RHEL5.

-- Additional comment from jlayton on 2006-12-04 07:34 EST --
Committed in 1.0.9-12...

-- Additional comment from pm-rhel on 2006-12-22 20:46 EST --
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Comment 1 RHEL Program Management 2007-01-12 11:43:27 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Steve Dickson 2007-01-12 12:07:11 UTC


*** This bug has been marked as a duplicate of 157028 ***