Bug 193043 (anonmemleak)
| Summary: | memory leak in 1.0.2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] 389 | Reporter: | Alex Stuck <stucky101> | ||||||
| Component: | Performance | Assignee: | Rich Megginson <rmeggins> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 1.0 | CC: | nhosoi, nkinder | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | i386 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-12-07 16:58:52 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 152373, 208654, 240316 | ||||||||
| Attachments: |
|
||||||||
|
Description
Alex Stuck
2006-05-24 19:26:54 UTC
Created attachment 129960 [details]
output of logconf.pl
oops..forgot to mention OS is RHEL4 kernel 2.6.9-11.ELsmp What are the clients? Do you have any routers/switches/load balancer boxes that contact the LDAP server? Can you reproduce this with just plain ldapsearch? ldap stores posixaccounts, posixgroups, nisnetgroups and regular ldap groups clients are mostly rhel4 and some rhel3 via nss_ldap. pam_access is used for authentication here. One rhel4 box uses mod_auth_ldap from this guy: http://muquit.com/muquit/software/mod_auth_ldap/mod_auth_ldap.html for nagios and some other webapps auth. I remember I compiled this from source cause it supports nested groups which I like so I can have a statement in the config that says: require group cn=nagios,ou=Apps,ou=Groups but I dont have do be directly in the nagios group - instead the sysadmins group is member of the nagios group..well you know how this works. Cfengine has all its classes stored in ldap as netgroups. It runs on all clients once an hour. One nfs server uses netgroups to secure nfs via /etc/exports /etc/sudoers uses 2 netgroups On top of that we have 3 F5 bigip loadbalancers and 2 Alterpath serial console appliances pointing to ldap for userauth. I've just finished installing a 3rd fedora-ds server on a test box - same setup but no-one is pointing to it right now. I imported a backup from the production server to it. So far memory is at a stable 1.1%. I will try to hit it with traffic slowly and see what happens. I just narrowed it down a bit. My nagios box went down today for half a day and I noticed how memory consumption stopped. I bounced the daemon and cleaned the logfiles. Then I took these values : after 54s mem% was 1.2% (523520) after 02:09:27 it had only risen to 1.4% (527568) when usually it'd be around 3.4% by now. My nagios box was recovered and since it's prod I had to let it run again and sure enough now where nagios started it's thing again I took another value: After 03:17:44 memusage had risen to 2.3% logconv shows this : 49000 (&(objectclass=posixgroup)(memberuid=nagios)) 45014 (&(objectclass=posixaccount)(uid=nagios)) What I don't get is the second line. nagios is a local user that the nagios server uses to ssh into every box via an ssh-key. Why does ldap get queried for this uid ? And why are there sooo many queries in such a short time ? Secondly, even if the server gets thousands of those queries it still should be able to handle them w/o leaking memory like that right ? I was also able to reproduce this on my test fedora-ds box. It had been running with nothing pointing to it except itself for 2 days and stayed at 1.1% memusage. As soon as I turned on the nagios check it went up to 1.3%. I will let it run and see it it's gonna leak all the way to the end. Excellent detective work! This helps us considerably. What version of nagios are you running? We can setup nagios here and reproduce the problem. ok here are the details.
I have a rhel4 box that runs Nagios 2.0b4 compiled from source.
I used to work with a product called sitescope which I didn't like. However one
thing it did was cool I thought - it was agentless.
I borrowed that idea for nagios.
I image all my machines and this image has a local user already called nagios.
nagios:x:503:503::/usr/local/nagios/home:/bin/bash
It also has this set :
[root@ns1 stucky]# ls -l /usr/local/nagios/home/
total 152
-r-------- 1 nagios nagios 1066 May 27 14:00 acl
-r-x------ 1 nagios nagios 962 Oct 1 2005 acl_agent
-r-x------ 1 nagios nagios 58548 Jul 8 2005 check_disk
-r-x------ 1 nagios nagios 441 Oct 6 2005 check_duplex
-r-x------ 1 nagios nagios 37528 Jul 8 2005 check_load
-r-x------ 1 nagios nagios 686 Oct 12 2005 check_mailq
-r-x------ 1 nagios nagios 4695 Jul 8 2005 check_mem
-r-x------ 1 nagios nagios 19180 Jul 8 2005 check_procs
-r-x------ 1 nagios nagios 854 Jul 8 2005 check_swap
I also moved the authorized_keys2 file away from user control in sshd_config:
AuthorizedKeysFile /etc/ssh/keys/%u/authorized_keys2
so each machine has this in /etc/ssh/keys/nagios/authorized_keys2:
from="{nagiosip}",command="/usr/local/nagios/home/acl_agent",no-port-forwarding,no-X11-forwarding,no-agent-forwarding
ssh-dss ......
I wrote acl_agent as a wrapper to check what is passed to sshd :
#!/usr/bin/perl -w
# This script runs every time the nagios box uses its key to run a command remotely.
# Instead of running whatever command sshd execs this wrapper to do a sanity check.
# If the command matches one of the ones pre-defined in the acl this script runs it
# Otherwise it exits 2
use File::stat;
use User::pwent;
my $cmd = $ENV{"SSH_ORIGINAL_COMMAND"};
my $nagios_home = "/usr/local/nagios/home";
my $acl = "$nagios_home/acl";
my $st = stat($acl) || die "File 'acl' not found or inaccessable...";
my $pw = getpwnam ('nagios') || die "User 'nagios' doesn't exist...";
if ($st->mode != 33024 || $st->uid != $pw->uid) {
print "Check owner/permissions of file 'acl'...";
exit 2;
}
elsif ($pw->dir ne $nagios_home) {
print "Check homedir of user 'nagios'...";
exit 2;
}
open (ACL, $acl);
foreach (<ACL>) {
chomp;
if ($cmd eq $_) {
system ($cmd);
exit ($?>>8);
}
}
print "Check acl...";
close ACL;
exit 2;
as you can guess the acl contains the commands nagios is allowed to exec
remotely on this box.
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
and so forth..
One other thing that is prolly special about my nagios setup is that when I set
this up I wasn't aware that someone had already written a plugin called
check_by_ssh so I set up my own stuff.
basically in checkcommands.cfg I have a local and a remote definition for
plugins. Here is an example for the disk check:
define command{
command_name check_local_disk
command_line $USER1$/check_disk -M -w $ARG1$ -c $ARG2$ -p $ARG3$
}
define command{
command_name check_remote_disk
command_line /usr/bin/ssh nagios@$HOSTADDRESS$ $USER2$/check_disk -M
-w $ARG1$ -c $ARG2$ -p $ARG3$
}
in recource.cfg I have:
$USER1$=/usr/local/nagios/libexec
$USER2$=/usr/local/nagios/home
One last thing to do is logon to the nagios box, so a su - nagios and then ssh
to the box so that the hostfingerprint gets saved on the nagios server side:
After it askes you if you wanna save the hostfingerprint you should see
something like this:
[root@nagios stucky]# su - nagios
-bash-3.00$ ssh ns1
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Check acl...Connection to ns1 closed.
This is all I can think of.
Right now my test fedora-ds box is also checked by nagios but the difference is
that this box points to itself whereas all other boxes point to ldap1 and ldap2.
So we'd assume that memory consumption should rise slowere here since nagios
only hits this box for itself ever now and then. It seemsto be the case since
mem% has risen from 1.3% yesterday to 2.0%.
My production server however is already at 14.6% again.
btw. here is the output of a typical acl file. These are the checks I run on
every (about 30) machine:
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
/usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master
/usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ
/usr/local/nagios/home/check_procs -w 1 -c 2 -s Z
sudo /usr/local/nagios/home/check_duplex
sudo /etc/init.d/raidmon start
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd
sudo /etc/init.d/cfexecd start
you can prolly run any checks you want I don't think it matters which one as
long as you run about he same amount.
Also here is my services template from /usr/local/nagios/etc/services.cfg
define service{
name generic-service ; The 'name' of
this service template, referenced
; in other
service definitions
active_checks_enabled 1 ; Active service
checks are enabled
passive_checks_enabled 1 ; Passive service
checks are enabled/accepted
parallelize_check 1 ; Active service
checks should be parallelized
; (disabling this
can lead to major performance problems)
obsess_over_service 0 ; We should
obsess over this service (if necessary)
check_freshness 0 ; Default is to
NOT check service 'freshness'
notifications_enabled 1 ; Service
notifications are enabled
event_handler_enabled 1 ; Service event
handler is enabled
flap_detection_enabled 0 ; Flap detection
is enabled
process_perf_data 1 ; Process
performance data
retain_status_information 1 ; Retain status
information across program restarts
retain_nonstatus_information 1 ; Retain
non-status information across program restarts
is_volatile 0
retry_check_interval 1 ; Re-check every
minute if state has changed to non-ok
; non-ok state change
notification_options w,u,c,r ; Notify for all
state changes
check_period 24x7
notification_period 24x7
notification_interval 720 ; Sent
notifications every 12 hours
register 0 ; DONT REGISTER
THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
The check interval is mostly one minute but for some services it's 3 or 5.
Hope this gets you started..thx guys
just fyi...ns-slapd on my test server is now 4.7% memusage and ps -eo conn,vsz,etime | grep slapd says : ns-slapd 531352 5-17:32:23 so it seems to confirm my theory Does nagios use SNMP? no - it just execs the plugins in /usr/local/nagios/home via ssh Make sure you have the following entries in /usr/local/nagios/home/acl:
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
/usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master
/usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ
/usr/local/nagios/home/check_procs -w 1 -c 2 -s Z
sudo /usr/local/nagios/home/check_duplex
sudo /etc/init.d/raidmon start
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd
sudo /etc/init.d/cfexecd start
You already have the acl_agent perl script
all plugins are regular except check_duplex which I wrote :
#!/usr/bin/perl -w
my %EXIT = (OK => 0,
CRITICAL => 2);
my @output = `/sbin/ip add`;
foreach (@output) {
if (m/\w+ eth\d.*$/) {
s/.+ (eth\d.*)/$1/;
chomp;
$eth = $_;
$_ = `/sbin/ethtool $eth`;
if (m/Duplex: Half/) {
print "CRITICAL: $eth is on half duplex !";
exit ($EXIT{CRITICAL});
}
}
}
print "OK: all interfaces are on full duplex ";
exit ($EXIT{OK});
It needs to run via sudo w/o password since ethtool requires root priv even for
read ops. Hence we need this in sudoers:
User_Alias NAGIOS=nagios
Cmnd_Alias RAIDMON=/etc/init.d/raidmon start
Cmnd_Alias CFEXECD=/etc/init.d/cfexecd start
Cmnd_Alias DUPLEX=/usr/local/nagios/home/check_duplex
NAGIOS ALL=NOPASSWD: RAIDMON,CFEXECD,DUPLEX
raidmon and cfexecd is set here so that nagios can restart those services via an
eventhandler. I don't think you need to bother with that.
Here is the section from services.cfg that checks the basic stuff:
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /
check_command check_remote_disk!15%!5%!/
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /BOOT
check_command check_remote_disk!15%!5%!/boot
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /TMP
check_command check_remote_disk!15%!5%!/tmp
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /VAR
check_command check_remote_disk!15%!5%!/var
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /USR
check_command check_remote_disk!15%!5%!/usr
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /USR/LOCAL
check_command check_remote_disk!15%!5%!/usr/local
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DISK USAGE: /HOME
check_command check_remote_disk!15%!5%!/home
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description MEMORY USAGE
check_command check_remote_mem!10!5
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description NR. OF PROCS: SYSLOG-NG
check_command check_remote_procs!1:1!1:2!syslog-ng
max_check_attempts 2
normal_check_interval 1
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description NR. OF PROCS: POSTFIX
check_command check_remote_procs!1:1!1:2!master
max_check_attempts 2
normal_check_interval 1
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description NR. OF PROCS: CRON
check_command check_remote_procs!1:2!1:3!crond
max_check_attempts 2
normal_check_interval 1
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description NR. OF PROCS: RAID MONITOR
check_command check_remote_procs!1:1!1:2!MegaServ
event_handler raidmon_restart_remote
max_check_attempts 3
normal_check_interval 1
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description NR. OF ZOMBIES
check_command check_remote_zombies!1!2!Z
max_check_attempts 3
normal_check_interval 3
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description LOAD AVERAGE
check_command check_remote_load!'15 10 5'!'30 25 20'
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description SWAP
check_command check_remote_swap!15!30
max_check_attempts 5
normal_check_interval 5
contact_groups admins
}
define service{
use generic-service
hostgroup_name generic-linux
service_description DUPLEX SETTINGS
check_command check_remote_duplex
max_check_attempts 2
normal_check_interval 1
contact_groups admins
}
That should really be most of the info you need. Thx guys !!
one more thing:
/etc/ldap.conf on all rhel4 clients looks like this :
URI ldaps://host1.idf.net ldaps://host2.idf.net
bind_timelimit 5
bind_policy soft
pam_lookup_policy yes
BASE dc=idf,dc=net
TLS_CACERTDIR /etc/openldap/cacerts
start_tls no
ssl on
tls_checkpeer yes
TLS_REQCERT demand
and /etc/pam.d/system-auth looks like this:
auth required /lib/security/$ISA/pam_env.so
auth sufficient /lib/security/$ISA/pam_unix.so likeauth nullok
auth sufficient /lib/security/$ISA/pam_ldap.so use_first_pass
auth required /lib/security/$ISA/pam_deny.so
account required /lib/security/$ISA/pam_access.so
account required /lib/security/$ISA/pam_unix.so broken_shadow
account sufficient /lib/security/$ISA/pam_succeed_if.so uid < 100 quiet
account sufficient /lib/security/$ISA/pam_localuser.so
account [default=bad success=ok user_unknown=ignore service_err=ignore
system_err=ignore authinfo_unavail=ignore] \
/lib/security/$ISA/pam_ldap.so
account required /lib/security/$ISA/pam_permit.so
password requisite /lib/security/$ISA/pam_cracklib.so retry=3
password sufficient /lib/security/$ISA/pam_unix.so nullok use_authtok md5
shadow
password sufficient /lib/security/$ISA/pam_ldap.so use_authtok
password required /lib/security/$ISA/pam_deny.so
session required /lib/security/$ISA/pam_limits.so
session required /lib/security/$ISA/pam_unix.so
session optional /lib/security/$ISA/pam_ldap.so
Could there be any errors in slapd-<id>/logs/errors or on the mod_auth_ldap/nagios side? no, neither logs show errors. I don't think it has to do with mod_auth_ldap anyway. There is little traffic from that. We are thinking you might have hit this memory leak bug in NSS 3.11: > This leak was introduced in NSS 3.11 and has been fixed in NSS 3.11.1. See > https://bugzilla.mozilla.org/show_bug.cgi?id=336335#c9. To fix it, you need to replace NSPR 4.6 with 4.6.2 and NSS 3.11 with 3.11.1 in your Fedora Directory Server. The binaries are not available yet on the mozilla site. If you are interested in, could you checkout the libraries $ export CVSROOT=:pserver:anonymous.org:/cvsroot $ cvs -z3 co -r NSPR_4_6_2_RTM mozilla/nsprpub $ cvs -z3 co -r NSS_3_11_1_RTM mozilla/dbm mozilla/security/dbm mozilla/security/coreconf mozilla/security/nss And build them following the instructions found here? http://directory.fedora.redhat.com/wiki/Building#Mozilla.org_components } Then, shutdown the Directory Server and copy the built libraries to the server lib directory as follows: $ cd <mozilla_root>/mozilla/dist/<PlatformInfo_glibc_PTH_DBG_or_OPT.OBJ>/lib $ cp *.{so,chk} /opt/fedora-ds/bin/slapd/lib Unfortunately, this will "break" RPM if the files are replaced. So, please be careful and keep the backups of the files and run your test. Also, when the new NSPR / NSS libraries are ready at the mozilla ftp site to download, we will announce it on http://directory.fedora.redhat.com/wiki/. Thanks. Hosoi-san You hit the spot ! I compiled the new libs, replaced them and after 2 days ns-slapd is at a stable 1.8% memusage. one interesting thing I noticed it that both my test and prod ldap servers behaved the exact same way once I changed the libs. The prod server used to start at 1.5% but now goes to 1.7% almost right away. After about 12 hours it went to 1.8%. Then after another 10 hours it went to 1.9%. I started thinking the problem isn't fixed. However, just about 1 hour later it went back to 1.8%, something that had never happened before. It's been running for 2 days 23 hours at a stable 1.8%. The test server had shown the same behaviour except that the memusage was lower due to the fact that it only uses itself as a client. I feel pretty good about this at this point bit will keep watching for long term effects. thx so far guys. maybe I spoke too soon after all. I looked at my production server today and memusage changed again. After 4days and 10 hours it went to 1.9% again only to go back to 1.8% after 2 hours. Then after a total of 4 days and 13 hours it suddenly went to 2.2%. That's where it's been. At this point I'm not sure whether this is normal behaviour or if the bug is still present in a slighly different form. I'll keep you posted. That may very well be normal behavior, if it needs to allocate more memory for the OS for some cache. At any rate, it certainly doesn't seem to be the same bug that you originally reported. Created attachment 131385 [details]
hourly snapshots of ns-slapd mem and cpu usage after NSS and NSPR upgrade
Hmm - what happened between these two entries? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10880 ldap 15 0 545m 48m 14m S 0.0 2.4 1347:41 ns-slapd ns-slapd 558796 8-01:18:13 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10880 ldap 16 0 658m 145m 14m S 61.3 7.2 1354:59 ns-slapd ns-slapd 673864 8-02:18:12 guys please look at the latest attachment. It seems to me the new libs don't fix the leak but simply delay it. Also can you tell me if you think the CPU% values are normal ? I get this output by running the following via cron.hourly : #!/bin/bash top -bn1 | grep -e ns-slapd -e PID >> /tmp/ns-slapd_mem ps -eo comm,vsz,etime | grep ns-slapd >> /tmp/ns-slapd_mem echo -e "\n" >> /tmp/ns-slapd_mem oh damn Rich you're fast. Well that's exactly it - nothing happend. The server just keeps running and serving the same amount of clients in the same manner. Yet, as time passes memusage goes up again - only much slower now. This may just be normal caching behavior. Over time, as the number of entries grows, more entries will be added to the cache, and more memory will be required to cache those entries. It should eventually level out at a fairly low percentage of your system RAM. OK I just tested something. I usually use phpldapadmin to make simple changes like adding a netgroup etc...I loggen onto phpldapadmin and after about 30 seconds memusage changed from 7.2 to 7.4%. I admit I use phpldapadmin a lot and there seems to be a correlation. Are you saying, however that 7.4 is still an ok value ? At what point should I worry again 50% - 60% ? I know phpldapadmin does a _lot_ of searches, and probably hits a lot of entries that were not in the cache. I really can't say at what point you should start to worry. It's really a function of the number of entries, average size per entry, number of indexes, type of indexes, other database overhead, and replication overhead. That is, it should not be significantly more than the size of the database on disk, which is the size of all of the files under /opt/fedora-ds/slapd-instance/db, minus the __db.XXX files. ok ns-slapd uses 673864 k right now and the files in the db directory have the following size. [root@ldap1 db]# du -h ./* 12K ./__db.001 5.5M ./__db.002 548K ./__db.003 3.8M ./__db.004 28K ./__db.005 4.0K ./DBVERSION 4.0K ./guardian 8.4M ./log.0000000029 372K ./NetscapeRoot 492K ./userRoot I assume you mean the size of NetscapeRoot and userRoot combined make the database on disk but that's only 864 k. That versus 763 MB ??? It's not exactly that size, the size in memory is quite a bit larger than the size on disk. You also need to include the size of log.0000000029 in the db size - that adds another 8.4M. That file is the transaction log for your database. slapd also caches lots of other things other than just the straight database files, but it's usually some function of the size of the data. The point is that at some point slapd memory usage should level off, but I'm not exactly sure at what point that is. ok thanks Rich. I will keep watching and let you know if anything major happens. Just checking. Some other users have reported memory leakage. I just wanted to know if you have had any more problems since replacing nspr/nss. nope - I had actually forgotten about it since it's been fine. After 41 days 17 hours it's at a stable 2.7% which is fine with me. thx for checking in though ! New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 x86_64: http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-x86_64.tar.gz New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 i386 (32 bit): http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-i386.tar.gz The next version of Fedora DS will use newer versions of NSPR and NSS that fix this problem. Fedora DS 1.0.3 includes nss 3.11.3 Verified. |