Description of problem: ns-slapd seems to leak memory when the server is hit with lots of anonymous binds and a significant difference between binds and unbinds as well as more abnormal connection codes than cleanly closed connections. Bad Ber Tags may also have to do with it. Version-Release number of selected component (if applicable): 1.0.2 How reproducible: 100% Steps to Reproduce: 1. run ldaps 2. wait and watch mem% grow 3. bounce slapd and memory usage goes back to 1.7% 4. everything starts over again Actual results: time elapsed memusage ps -eo comm,vsz,etime | grep ns-slapd 00:19 1.7% 523152 01:12:21 2.7% 555908 02:01:58 3.4% 569820 03:08:03 4.2% 588000 08:34:44 8.5% 687532 19:40:37 17.4% 872868 23:50:47 20.8% 945332 1-03:08:53 23.4% 998068 1-19:11:58 36.2% 1265692 1-21:11:53 37.8% 1296752 This goes on till ns-slapd runs out of memory and spits out a malloc error. Expected results: ns-slapd remains at 1.7% -2% memory usage. Additional info: checking for current connections shows no more than about 5 -10 most of the time. Idle timeout is set to 120 seconds. dn: cn=config,cn=ldbm database,cn=plugins,cn=config nsslapd-dbcachesize: 10485760 nsslapd-import-cache-autosize: -1 nsslapd-import-cachesize: 20000000 dn: cn=NetscapeRoot,cn=ldbm database,cn=plugins,cn=config nsslapd-cachesize: -1 nsslapd-cachememsize: 10485760 dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config nsslapd-cachesize: -1 nsslapd-cachememsize: 10485760 box is PowerEdge 1850 with 2 gig physical memory and 4 gig swap.
Created attachment 129960 [details] output of logconf.pl
oops..forgot to mention OS is RHEL4 kernel 2.6.9-11.ELsmp
What are the clients? Do you have any routers/switches/load balancer boxes that contact the LDAP server? Can you reproduce this with just plain ldapsearch?
ldap stores posixaccounts, posixgroups, nisnetgroups and regular ldap groups clients are mostly rhel4 and some rhel3 via nss_ldap. pam_access is used for authentication here. One rhel4 box uses mod_auth_ldap from this guy: http://muquit.com/muquit/software/mod_auth_ldap/mod_auth_ldap.html for nagios and some other webapps auth. I remember I compiled this from source cause it supports nested groups which I like so I can have a statement in the config that says: require group cn=nagios,ou=Apps,ou=Groups but I dont have do be directly in the nagios group - instead the sysadmins group is member of the nagios group..well you know how this works. Cfengine has all its classes stored in ldap as netgroups. It runs on all clients once an hour. One nfs server uses netgroups to secure nfs via /etc/exports /etc/sudoers uses 2 netgroups On top of that we have 3 F5 bigip loadbalancers and 2 Alterpath serial console appliances pointing to ldap for userauth. I've just finished installing a 3rd fedora-ds server on a test box - same setup but no-one is pointing to it right now. I imported a backup from the production server to it. So far memory is at a stable 1.1%. I will try to hit it with traffic slowly and see what happens.
I just narrowed it down a bit. My nagios box went down today for half a day and I noticed how memory consumption stopped. I bounced the daemon and cleaned the logfiles. Then I took these values : after 54s mem% was 1.2% (523520) after 02:09:27 it had only risen to 1.4% (527568) when usually it'd be around 3.4% by now. My nagios box was recovered and since it's prod I had to let it run again and sure enough now where nagios started it's thing again I took another value: After 03:17:44 memusage had risen to 2.3% logconv shows this : 49000 (&(objectclass=posixgroup)(memberuid=nagios)) 45014 (&(objectclass=posixaccount)(uid=nagios)) What I don't get is the second line. nagios is a local user that the nagios server uses to ssh into every box via an ssh-key. Why does ldap get queried for this uid ? And why are there sooo many queries in such a short time ? Secondly, even if the server gets thousands of those queries it still should be able to handle them w/o leaking memory like that right ? I was also able to reproduce this on my test fedora-ds box. It had been running with nothing pointing to it except itself for 2 days and stayed at 1.1% memusage. As soon as I turned on the nagios check it went up to 1.3%. I will let it run and see it it's gonna leak all the way to the end.
Excellent detective work! This helps us considerably. What version of nagios are you running? We can setup nagios here and reproduce the problem.
ok here are the details. I have a rhel4 box that runs Nagios 2.0b4 compiled from source. I used to work with a product called sitescope which I didn't like. However one thing it did was cool I thought - it was agentless. I borrowed that idea for nagios. I image all my machines and this image has a local user already called nagios. nagios:x:503:503::/usr/local/nagios/home:/bin/bash It also has this set : [root@ns1 stucky]# ls -l /usr/local/nagios/home/ total 152 -r-------- 1 nagios nagios 1066 May 27 14:00 acl -r-x------ 1 nagios nagios 962 Oct 1 2005 acl_agent -r-x------ 1 nagios nagios 58548 Jul 8 2005 check_disk -r-x------ 1 nagios nagios 441 Oct 6 2005 check_duplex -r-x------ 1 nagios nagios 37528 Jul 8 2005 check_load -r-x------ 1 nagios nagios 686 Oct 12 2005 check_mailq -r-x------ 1 nagios nagios 4695 Jul 8 2005 check_mem -r-x------ 1 nagios nagios 19180 Jul 8 2005 check_procs -r-x------ 1 nagios nagios 854 Jul 8 2005 check_swap I also moved the authorized_keys2 file away from user control in sshd_config: AuthorizedKeysFile /etc/ssh/keys/%u/authorized_keys2 so each machine has this in /etc/ssh/keys/nagios/authorized_keys2: from="{nagiosip}",command="/usr/local/nagios/home/acl_agent",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-dss ...... I wrote acl_agent as a wrapper to check what is passed to sshd : #!/usr/bin/perl -w # This script runs every time the nagios box uses its key to run a command remotely. # Instead of running whatever command sshd execs this wrapper to do a sanity check. # If the command matches one of the ones pre-defined in the acl this script runs it # Otherwise it exits 2 use File::stat; use User::pwent; my $cmd = $ENV{"SSH_ORIGINAL_COMMAND"}; my $nagios_home = "/usr/local/nagios/home"; my $acl = "$nagios_home/acl"; my $st = stat($acl) || die "File 'acl' not found or inaccessable..."; my $pw = getpwnam ('nagios') || die "User 'nagios' doesn't exist..."; if ($st->mode != 33024 || $st->uid != $pw->uid) { print "Check owner/permissions of file 'acl'..."; exit 2; } elsif ($pw->dir ne $nagios_home) { print "Check homedir of user 'nagios'..."; exit 2; } open (ACL, $acl); foreach (<ACL>) { chomp; if ($cmd eq $_) { system ($cmd); exit ($?>>8); } } print "Check acl..."; close ACL; exit 2; as you can guess the acl contains the commands nagios is allowed to exec remotely on this box. /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home /usr/local/nagios/home/check_mem -f -w 10 -c 5 /usr/local/nagios/home/check_swap 15 30 and so forth.. One other thing that is prolly special about my nagios setup is that when I set this up I wasn't aware that someone had already written a plugin called check_by_ssh so I set up my own stuff. basically in checkcommands.cfg I have a local and a remote definition for plugins. Here is an example for the disk check: define command{ command_name check_local_disk command_line $USER1$/check_disk -M -w $ARG1$ -c $ARG2$ -p $ARG3$ } define command{ command_name check_remote_disk command_line /usr/bin/ssh nagios@$HOSTADDRESS$ $USER2$/check_disk -M -w $ARG1$ -c $ARG2$ -p $ARG3$ } in recource.cfg I have: $USER1$=/usr/local/nagios/libexec $USER2$=/usr/local/nagios/home One last thing to do is logon to the nagios box, so a su - nagios and then ssh to the box so that the hostfingerprint gets saved on the nagios server side: After it askes you if you wanna save the hostfingerprint you should see something like this: [root@nagios stucky]# su - nagios -bash-3.00$ ssh ns1 Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line 29, <ACL> line 20. Check acl...Connection to ns1 closed. This is all I can think of. Right now my test fedora-ds box is also checked by nagios but the difference is that this box points to itself whereas all other boxes point to ldap1 and ldap2. So we'd assume that memory consumption should rise slowere here since nagios only hits this box for itself ever now and then. It seemsto be the case since mem% has risen from 1.3% yesterday to 2.0%. My production server however is already at 14.6% again.
btw. here is the output of a typical acl file. These are the checks I run on every (about 30) machine: /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p / /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home /usr/local/nagios/home/check_mem -f -w 10 -c 5 /usr/local/nagios/home/check_swap 15 30 /usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20 /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master /usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ /usr/local/nagios/home/check_procs -w 1 -c 2 -s Z sudo /usr/local/nagios/home/check_duplex sudo /etc/init.d/raidmon start /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd sudo /etc/init.d/cfexecd start you can prolly run any checks you want I don't think it matters which one as long as you run about he same amount. Also here is my services template from /usr/local/nagios/etc/services.cfg define service{ name generic-service ; The 'name' of this service template, referenced ; in other service definitions active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized ; (disabling this can lead to major performance problems) obsess_over_service 0 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 0 ; Flap detection is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 retry_check_interval 1 ; Re-check every minute if state has changed to non-ok ; non-ok state change notification_options w,u,c,r ; Notify for all state changes check_period 24x7 notification_period 24x7 notification_interval 720 ; Sent notifications every 12 hours register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } The check interval is mostly one minute but for some services it's 3 or 5. Hope this gets you started..thx guys
just fyi...ns-slapd on my test server is now 4.7% memusage and ps -eo conn,vsz,etime | grep slapd says : ns-slapd 531352 5-17:32:23 so it seems to confirm my theory
Does nagios use SNMP?
no - it just execs the plugins in /usr/local/nagios/home via ssh
Make sure you have the following entries in /usr/local/nagios/home/acl: /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p / /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local /usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home /usr/local/nagios/home/check_mem -f -w 10 -c 5 /usr/local/nagios/home/check_swap 15 30 /usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20 /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master /usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ /usr/local/nagios/home/check_procs -w 1 -c 2 -s Z sudo /usr/local/nagios/home/check_duplex sudo /etc/init.d/raidmon start /usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd sudo /etc/init.d/cfexecd start You already have the acl_agent perl script all plugins are regular except check_duplex which I wrote : #!/usr/bin/perl -w my %EXIT = (OK => 0, CRITICAL => 2); my @output = `/sbin/ip add`; foreach (@output) { if (m/\w+ eth\d.*$/) { s/.+ (eth\d.*)/$1/; chomp; $eth = $_; $_ = `/sbin/ethtool $eth`; if (m/Duplex: Half/) { print "CRITICAL: $eth is on half duplex !"; exit ($EXIT{CRITICAL}); } } } print "OK: all interfaces are on full duplex "; exit ($EXIT{OK}); It needs to run via sudo w/o password since ethtool requires root priv even for read ops. Hence we need this in sudoers: User_Alias NAGIOS=nagios Cmnd_Alias RAIDMON=/etc/init.d/raidmon start Cmnd_Alias CFEXECD=/etc/init.d/cfexecd start Cmnd_Alias DUPLEX=/usr/local/nagios/home/check_duplex NAGIOS ALL=NOPASSWD: RAIDMON,CFEXECD,DUPLEX raidmon and cfexecd is set here so that nagios can restart those services via an eventhandler. I don't think you need to bother with that. Here is the section from services.cfg that checks the basic stuff: define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: / check_command check_remote_disk!15%!5%!/ max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /BOOT check_command check_remote_disk!15%!5%!/boot max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /TMP check_command check_remote_disk!15%!5%!/tmp max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /VAR check_command check_remote_disk!15%!5%!/var max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /USR check_command check_remote_disk!15%!5%!/usr max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /USR/LOCAL check_command check_remote_disk!15%!5%!/usr/local max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DISK USAGE: /HOME check_command check_remote_disk!15%!5%!/home max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description MEMORY USAGE check_command check_remote_mem!10!5 max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description NR. OF PROCS: SYSLOG-NG check_command check_remote_procs!1:1!1:2!syslog-ng max_check_attempts 2 normal_check_interval 1 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description NR. OF PROCS: POSTFIX check_command check_remote_procs!1:1!1:2!master max_check_attempts 2 normal_check_interval 1 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description NR. OF PROCS: CRON check_command check_remote_procs!1:2!1:3!crond max_check_attempts 2 normal_check_interval 1 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description NR. OF PROCS: RAID MONITOR check_command check_remote_procs!1:1!1:2!MegaServ event_handler raidmon_restart_remote max_check_attempts 3 normal_check_interval 1 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description NR. OF ZOMBIES check_command check_remote_zombies!1!2!Z max_check_attempts 3 normal_check_interval 3 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description LOAD AVERAGE check_command check_remote_load!'15 10 5'!'30 25 20' max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description SWAP check_command check_remote_swap!15!30 max_check_attempts 5 normal_check_interval 5 contact_groups admins } define service{ use generic-service hostgroup_name generic-linux service_description DUPLEX SETTINGS check_command check_remote_duplex max_check_attempts 2 normal_check_interval 1 contact_groups admins } That should really be most of the info you need. Thx guys !!
one more thing: /etc/ldap.conf on all rhel4 clients looks like this : URI ldaps://host1.idf.net ldaps://host2.idf.net bind_timelimit 5 bind_policy soft pam_lookup_policy yes BASE dc=idf,dc=net TLS_CACERTDIR /etc/openldap/cacerts start_tls no ssl on tls_checkpeer yes TLS_REQCERT demand and /etc/pam.d/system-auth looks like this: auth required /lib/security/$ISA/pam_env.so auth sufficient /lib/security/$ISA/pam_unix.so likeauth nullok auth sufficient /lib/security/$ISA/pam_ldap.so use_first_pass auth required /lib/security/$ISA/pam_deny.so account required /lib/security/$ISA/pam_access.so account required /lib/security/$ISA/pam_unix.so broken_shadow account sufficient /lib/security/$ISA/pam_succeed_if.so uid < 100 quiet account sufficient /lib/security/$ISA/pam_localuser.so account [default=bad success=ok user_unknown=ignore service_err=ignore system_err=ignore authinfo_unavail=ignore] \ /lib/security/$ISA/pam_ldap.so account required /lib/security/$ISA/pam_permit.so password requisite /lib/security/$ISA/pam_cracklib.so retry=3 password sufficient /lib/security/$ISA/pam_unix.so nullok use_authtok md5 shadow password sufficient /lib/security/$ISA/pam_ldap.so use_authtok password required /lib/security/$ISA/pam_deny.so session required /lib/security/$ISA/pam_limits.so session required /lib/security/$ISA/pam_unix.so session optional /lib/security/$ISA/pam_ldap.so
Could there be any errors in slapd-<id>/logs/errors or on the mod_auth_ldap/nagios side?
no, neither logs show errors. I don't think it has to do with mod_auth_ldap anyway. There is little traffic from that.
We are thinking you might have hit this memory leak bug in NSS 3.11: > This leak was introduced in NSS 3.11 and has been fixed in NSS 3.11.1. See > https://bugzilla.mozilla.org/show_bug.cgi?id=336335#c9. To fix it, you need to replace NSPR 4.6 with 4.6.2 and NSS 3.11 with 3.11.1 in your Fedora Directory Server. The binaries are not available yet on the mozilla site. If you are interested in, could you checkout the libraries $ export CVSROOT=:pserver:anonymous.org:/cvsroot $ cvs -z3 co -r NSPR_4_6_2_RTM mozilla/nsprpub $ cvs -z3 co -r NSS_3_11_1_RTM mozilla/dbm mozilla/security/dbm mozilla/security/coreconf mozilla/security/nss And build them following the instructions found here? http://directory.fedora.redhat.com/wiki/Building#Mozilla.org_components } Then, shutdown the Directory Server and copy the built libraries to the server lib directory as follows: $ cd <mozilla_root>/mozilla/dist/<PlatformInfo_glibc_PTH_DBG_or_OPT.OBJ>/lib $ cp *.{so,chk} /opt/fedora-ds/bin/slapd/lib Unfortunately, this will "break" RPM if the files are replaced. So, please be careful and keep the backups of the files and run your test. Also, when the new NSPR / NSS libraries are ready at the mozilla ftp site to download, we will announce it on http://directory.fedora.redhat.com/wiki/. Thanks.
Hosoi-san You hit the spot ! I compiled the new libs, replaced them and after 2 days ns-slapd is at a stable 1.8% memusage. one interesting thing I noticed it that both my test and prod ldap servers behaved the exact same way once I changed the libs. The prod server used to start at 1.5% but now goes to 1.7% almost right away. After about 12 hours it went to 1.8%. Then after another 10 hours it went to 1.9%. I started thinking the problem isn't fixed. However, just about 1 hour later it went back to 1.8%, something that had never happened before. It's been running for 2 days 23 hours at a stable 1.8%. The test server had shown the same behaviour except that the memusage was lower due to the fact that it only uses itself as a client. I feel pretty good about this at this point bit will keep watching for long term effects. thx so far guys.
maybe I spoke too soon after all. I looked at my production server today and memusage changed again. After 4days and 10 hours it went to 1.9% again only to go back to 1.8% after 2 hours. Then after a total of 4 days and 13 hours it suddenly went to 2.2%. That's where it's been. At this point I'm not sure whether this is normal behaviour or if the bug is still present in a slighly different form. I'll keep you posted.
That may very well be normal behavior, if it needs to allocate more memory for the OS for some cache. At any rate, it certainly doesn't seem to be the same bug that you originally reported.
Created attachment 131385 [details] hourly snapshots of ns-slapd mem and cpu usage after NSS and NSPR upgrade
Hmm - what happened between these two entries? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10880 ldap 15 0 545m 48m 14m S 0.0 2.4 1347:41 ns-slapd ns-slapd 558796 8-01:18:13 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10880 ldap 16 0 658m 145m 14m S 61.3 7.2 1354:59 ns-slapd ns-slapd 673864 8-02:18:12
guys please look at the latest attachment. It seems to me the new libs don't fix the leak but simply delay it. Also can you tell me if you think the CPU% values are normal ? I get this output by running the following via cron.hourly : #!/bin/bash top -bn1 | grep -e ns-slapd -e PID >> /tmp/ns-slapd_mem ps -eo comm,vsz,etime | grep ns-slapd >> /tmp/ns-slapd_mem echo -e "\n" >> /tmp/ns-slapd_mem
oh damn Rich you're fast. Well that's exactly it - nothing happend. The server just keeps running and serving the same amount of clients in the same manner. Yet, as time passes memusage goes up again - only much slower now.
This may just be normal caching behavior. Over time, as the number of entries grows, more entries will be added to the cache, and more memory will be required to cache those entries. It should eventually level out at a fairly low percentage of your system RAM.
OK I just tested something. I usually use phpldapadmin to make simple changes like adding a netgroup etc...I loggen onto phpldapadmin and after about 30 seconds memusage changed from 7.2 to 7.4%. I admit I use phpldapadmin a lot and there seems to be a correlation. Are you saying, however that 7.4 is still an ok value ? At what point should I worry again 50% - 60% ?
I know phpldapadmin does a _lot_ of searches, and probably hits a lot of entries that were not in the cache. I really can't say at what point you should start to worry. It's really a function of the number of entries, average size per entry, number of indexes, type of indexes, other database overhead, and replication overhead. That is, it should not be significantly more than the size of the database on disk, which is the size of all of the files under /opt/fedora-ds/slapd-instance/db, minus the __db.XXX files.
ok ns-slapd uses 673864 k right now and the files in the db directory have the following size. [root@ldap1 db]# du -h ./* 12K ./__db.001 5.5M ./__db.002 548K ./__db.003 3.8M ./__db.004 28K ./__db.005 4.0K ./DBVERSION 4.0K ./guardian 8.4M ./log.0000000029 372K ./NetscapeRoot 492K ./userRoot I assume you mean the size of NetscapeRoot and userRoot combined make the database on disk but that's only 864 k. That versus 763 MB ???
It's not exactly that size, the size in memory is quite a bit larger than the size on disk. You also need to include the size of log.0000000029 in the db size - that adds another 8.4M. That file is the transaction log for your database. slapd also caches lots of other things other than just the straight database files, but it's usually some function of the size of the data. The point is that at some point slapd memory usage should level off, but I'm not exactly sure at what point that is.
ok thanks Rich. I will keep watching and let you know if anything major happens.
Just checking. Some other users have reported memory leakage. I just wanted to know if you have had any more problems since replacing nspr/nss.
nope - I had actually forgotten about it since it's been fine. After 41 days 17 hours it's at a stable 2.7% which is fine with me. thx for checking in though !
New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 x86_64: http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-x86_64.tar.gz
New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 i386 (32 bit): http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-i386.tar.gz
The next version of Fedora DS will use newer versions of NSPR and NSS that fix this problem.
Fedora DS 1.0.3 includes nss 3.11.3
Verified.