Description of problem: The following script takes 0.24 sec to run on my recently booted system. After several days (which included several login/logout cycles) the run time was over 4 sec. I did not notice a slow degradation, just one morning the script is taking a lot longer to run. In the slow state I see a lot of network traffic (100 packets per iteration) that was not there originally. A logout/login cycle does not clear the problem. Only a reboot would restore the original performance (SOP for certain other OSes but not Linux). I tried restarting the netfs service. This never restored performance and would typically put the system into a even slower state (now taking 6.3 sec to run). Running wireshark to see what is happening when my script is in slow mode, I find NFS is doing lots of V3 LOOPUPs. I noticed it was transversing (many times) directories that are in LD_LIBRARY_PATH, which for me lists two directories both of which are under /usr/local which is NFS mounted. Only these 2 directories are printed at the end, so it is not a problem with the variable getting longer. If I "unset LD_LIBRARY_PATH" at the beginning of the script the run time goes from 6.4 sec to 0.26 sec. Not the original time, but close. ------ test script ------------------------------------------ #!/bin/sh #unset LD_LIBRARY_PATH icnt=1 while test $icnt -le 50 ; do echo $icnt tmp=`echo $icnt | /bin/grep 0` icnt=`/usr/bin/expr $icnt + 1` done echo $LD_LIBRARY_PATH ------------------------------------------------------------- Version-Release number of selected component (if applicable): bash 3.1.17 kernel 2.6.18-53.1.14.el5 How reproducible: I've achieve 100 percent success with the steps below. Steps to Reproduce: 1. Make sure that LD_LIBRARY_PATH points to at least one directory on an NFS mounted disk. Since the time is due to IO transversing the directory tree, each directory only has to exist, no need for it to actually contain libraries. The more entries in the PATH and the deeper they are, the more obvious the problem. You can define LD_LIBRARY_PATH at the top of the script. 2. Stop netfs, try to access a NFS directory to get an error and then start netfs. This should cause the script to now be in the slow state. (If not, repeat this step a couple of times). Actual results: If the problem has been triggered then "time" should show the test script with a runtime measured in seconds (real time). Expected results: Test script should always effectively take about a quarter second to run. Additional info: /usr/local is where I drop 3rd party software and the initialization routines for one of those packages automatically added the two directories to my LD_LIBRARY_PATH.
You can use strace to see what the dynamic linker attempts to open. Using LD_LIBRARY_PATH has always been costly, because the dynamic linker needs to try to open each library it wants to load in each of the LD_LIBRARY_PATH component directories (and for the first library actually on i?86 in 7 subdirectories of each such directory, but the non-existence of the subdirectories is afterwards cached within each process). It is much faster to stick your /usr/local/lib or whatever directory you want to use into /etc/ld.so.conf.d/local.conf and rerun ldconfig, unless you keep changing the libraries in that directory often. That said, if you don't change the NFS library directories, each program with identical LD_LIBRARY_PATH will do the same lookups for the same libraries, so if it works fast initially and after some time gets much slower, it must be either NFS client problem, or NFS server problem, or both.
I'll try the strace on Monday. Further testing has shown that LD_LIBRARY_PATH needs to be defined in .bashrc (or maybe before entering the script) in order to see the problem. Just defining it in the test script is not good enough. (I'd forgotten to unset LD_LIBRARY_PATH when I tested defining it in the script). Putting the following in the test script SAVE="$LD_LIBRARY_PATH" unset LD_LIBRARY_PATH LD_LIBRARY_PATH="$SAVE" Speeds it back up. If I then comment out the unset command, it slows down. (This assumes the problem has already been triggered.) I would not preclude a caching problem at this time... With the problem triggered on my NFS client, I restarted the nfs service on my NFS server and it did not help. I also would have assumed that restarting netfs on the client if it was an NFS client problem.
I ran strace on my script both when the system was running normally, and again after the problem had been triggered and it was in the slow state. I see the system looking for libtercap.so.2 in 8 locations in each directory in the LD_LIBRARY_PATH. It then searches for libdl.so.2 and libc.so.6 but only one time per directory listed. Thus the non-existant subdirectories had been pruned. Ignoring hex addresses and PIDs there appears to be no real difference between the strace output file. I.e. both files contained the same number of lines doing the same things in the same order. I then tried -ff -o sout to get all child processes (also set the loop count to 2 for obvious reasons). Each subprocess, that uses libraries, goes through the process of looking for the first library in the 8 locations (times two directories in the path). There are two subprocesses created each time through the loop. One does 18 attempted open's of libraries and 16 stat64 commands on the directories the other does 16 attempted opens and 16 stat64s. Thus /usr/local is hit 66 times per loop iteration. Again the strace outputs shows that both the normal and slowed systems were doing exactly the same things in the same order. Since the network/NFS is not getting hammered in the normal mode but does get hit (very roughly 100 packets per loop iteration) it appears to me that NFS is going to the cache in the normal state not after the problem has been triggered. Restarting the netfs service (or doing a umount/mount by hand) does not clear the problem.
I encountered a similar problem with my install and was able to correct it by updating to beta releases of the kernel and nfs-utils packages, namely: kernel-2.6.18-84.el5 nfs-utils-1.0.9-32.el5 The extra packets are now gone and scripts run at normal speed.
*** This bug has been marked as a duplicate of 445551 ***