Bug 440444

Summary:	LD_LIBRARY_PATH, NFS, and a major slowdown
Product:	Red Hat Enterprise Linux 5	Reporter:	Allyn Tennant <allyn.tennant>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED DUPLICATE	QA Contact:	Martin Jenner <mjenner>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.0	CC:	jakub, jan.kratochvil, staubach, steved
Target Milestone:	rc
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-08 13:09:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Allyn Tennant 2008-04-03 15:51:29 UTC

Description of problem:
The following script takes 0.24 sec to run on my recently booted system.
After several days (which included several login/logout cycles) the
run time was over 4 sec.  I did not notice a slow degradation, just
one morning the script is taking a lot longer to run.  In the slow
state I see a lot of network traffic (100 packets per iteration) that
was not there originally.  A logout/login cycle does not clear the
problem.  Only a reboot would restore the original performance (SOP
for certain other OSes but not Linux).

I tried restarting the netfs service.  This never restored performance
and would typically put the system into a even slower state (now taking
6.3 sec to run).

Running wireshark to see what is happening when my script is in slow
mode, I find NFS is doing lots of V3 LOOPUPs.  I noticed it was
transversing (many times) directories that are in LD_LIBRARY_PATH,
which for me lists two directories both of which are under /usr/local
which is NFS mounted.  Only these 2 directories are printed at the
end, so it is not a problem with the variable getting longer.  If
I "unset LD_LIBRARY_PATH" at the beginning of the script the run time
goes from 6.4 sec to 0.26 sec.  Not the original time, but close.

------ test script ------------------------------------------
#!/bin/sh

#unset LD_LIBRARY_PATH

icnt=1
while test $icnt -le 50 ; do
   echo $icnt
   tmp=`echo $icnt | /bin/grep 0`
   icnt=`/usr/bin/expr $icnt + 1`
done

echo $LD_LIBRARY_PATH
-------------------------------------------------------------
Version-Release number of selected component (if applicable):
bash 3.1.17
kernel 2.6.18-53.1.14.el5

How reproducible:
I've achieve 100 percent success with the steps below.

Steps to Reproduce:
1. Make sure that LD_LIBRARY_PATH points to at least one directory
on an NFS mounted disk.  Since the time is due to IO transversing
the directory tree, each directory only has to exist, no need for
it to actually contain libraries.  The more entries in the PATH
and the deeper they are, the more obvious the problem.
You can define LD_LIBRARY_PATH at the top of the script.

2. Stop netfs, try to access a NFS directory to get an error and then
start netfs.  This should cause the script to now be in the slow state. 
(If not, repeat this step a couple of times).
  
Actual results:
If the problem has been triggered then "time" should show the test script
with a runtime measured in seconds (real time).

Expected results:
Test script should always effectively take about a quarter second to run.

Additional info:
/usr/local is where I drop 3rd party software and the initialization
routines for one of those packages automatically added the two
directories to my LD_LIBRARY_PATH.

Comment 1 Jakub Jelinek 2008-04-04 18:31:01 UTC

You can use strace to see what the dynamic linker attempts to open.  Using
LD_LIBRARY_PATH has always been costly, because the dynamic linker needs to try
to open each library it wants to load in each of the LD_LIBRARY_PATH component
directories (and for the first library actually on i?86 in 7 subdirectories of
each such directory, but the non-existence of the subdirectories is afterwards
cached within each process).  It is much faster to stick your /usr/local/lib
or whatever directory you want to use into /etc/ld.so.conf.d/local.conf
and rerun ldconfig, unless you keep changing the libraries in that directory
often.  That said, if you don't change the NFS library directories, each program
with identical LD_LIBRARY_PATH will do the same lookups for the same libraries,
so if it works fast initially and after some time gets much slower, it must be
either NFS client problem, or NFS server problem, or both.

Comment 2 Allyn Tennant 2008-04-04 23:24:03 UTC

I'll try the strace on Monday.

Further testing has shown that LD_LIBRARY_PATH needs to be defined
in .bashrc (or maybe before entering the script) in order to see the
problem.  Just defining it in the test script is not good enough.
(I'd forgotten to unset LD_LIBRARY_PATH when I tested defining
it in the script).

Putting the following in the test script

SAVE="$LD_LIBRARY_PATH"
unset LD_LIBRARY_PATH
LD_LIBRARY_PATH="$SAVE"

Speeds it back up.  If I then comment out the unset command, it slows
down.  (This assumes the problem has already been triggered.)

I would not preclude a caching problem at this time...

With the problem triggered on my NFS client, I restarted the nfs service
on my NFS server and it did not help.  I also would have assumed that
restarting netfs on the client if it was an NFS client problem.

Comment 3 Allyn Tennant 2008-04-08 15:46:49 UTC

I ran strace on my script both when the system was running normally,
and again after the problem had been triggered and it was in the slow
state.  I see the system looking for libtercap.so.2 in 8 locations
in each directory in the LD_LIBRARY_PATH.  It then searches for
libdl.so.2 and libc.so.6 but only one time per directory listed.
Thus the non-existant subdirectories had been pruned.

Ignoring hex addresses and PIDs there appears to be no real difference
between the strace output file.  I.e. both files contained the same
number of lines doing the same things in the same order.

I then tried -ff -o sout to get all child processes (also set the
loop count to 2 for obvious reasons).  Each subprocess, that uses
libraries, goes through the process of looking for the first library
in the 8 locations (times two directories in the path).  There are
two subprocesses created each time through the loop.  One does 18
attempted open's of libraries and 16 stat64 commands on the directories
the other does 16 attempted opens and 16 stat64s.  Thus /usr/local
is hit 66 times per loop iteration.  Again the strace outputs shows
that both the normal and slowed systems were doing exactly the same
things in the same order.

Since the network/NFS is not getting hammered in the normal mode but
does get hit (very roughly 100 packets per loop iteration) it appears
to me that NFS is going to the cache in the normal state not after
the problem has been triggered.  Restarting the netfs service (or
doing a umount/mount by hand) does not clear the problem.

Comment 4 Patrick Gendron 2008-05-08 02:26:42 UTC

I encountered a similar problem with my install and was able to correct it by
updating to beta releases of the kernel and nfs-utils packages, namely:

kernel-2.6.18-84.el5
nfs-utils-1.0.9-32.el5

The extra packets are now gone and scripts run at normal speed.

Comment 5 Steve Dickson 2008-05-08 13:09:26 UTC


*** This bug has been marked as a duplicate of 445551 ***