193043 – (anonmemleak) memory leak in 1.0.2

Bug 193043 (anonmemleak) - memory leak in 1.0.2

Summary: memory leak in 1.0.2

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	anonmemleak
Product:	389
Classification:	Retired
Component:	Performance
Sub Component:
Version:	1.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Rich Megginson
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	152373 fds103trackingbug 240316
TreeView+	depends on / blocked

Reported:	2006-05-24 19:26 UTC by Alex Stuck
Modified:	2015-12-07 16:58 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 16:58:52 UTC
Embargoed:

Attachments	(Terms of Use)
output of logconf.pl (8.37 KB, text/plain) 2006-05-24 19:26 UTC, Alex Stuck	no flags	Details
hourly snapshots of ns-slapd mem and cpu usage after NSS and NSPR upgrade (41.59 KB, text/plain) 2006-06-22 20:21 UTC, Alex Stuck	no flags	Details
View All

Description Alex Stuck 2006-05-24 19:26:54 UTC

Description of problem:
ns-slapd seems to leak memory when the server is hit with lots of anonymous
binds and a significant difference between binds and unbinds as well as more
abnormal connection codes than cleanly closed connections. Bad Ber Tags may also
have to do with it.


Version-Release number of selected component (if applicable):
1.0.2


How reproducible:
100%

Steps to Reproduce:
1. run ldaps
2. wait and watch mem% grow
3. bounce slapd and memory usage goes back to 1.7%
4. everything starts over again
  

Actual results:

time elapsed            memusage        ps -eo comm,vsz,etime | grep ns-slapd

00:19                   1.7%            523152
01:12:21                2.7%            555908
02:01:58                3.4%            569820
03:08:03                4.2%            588000
08:34:44                8.5%            687532
19:40:37                17.4%           872868
23:50:47                20.8%           945332
1-03:08:53              23.4%           998068
1-19:11:58              36.2%           1265692
1-21:11:53              37.8%           1296752

This goes on till ns-slapd runs out of memory and spits out a malloc error.

Expected results:

ns-slapd remains at 1.7% -2% memory usage.

Additional info: checking for current connections shows no more than about 5 -10
most of the time. 
Idle timeout is set to 120 seconds.

dn: cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-dbcachesize: 10485760
nsslapd-import-cache-autosize: -1
nsslapd-import-cachesize: 20000000

dn: cn=NetscapeRoot,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachesize: -1
nsslapd-cachememsize: 10485760

dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config
nsslapd-cachesize: -1
nsslapd-cachememsize: 10485760

box is PowerEdge 1850 with 2 gig physical memory and 4 gig swap.

Comment 1 Alex Stuck 2006-05-24 19:26:54 UTC

Created attachment 129960 [details]
output of logconf.pl

Comment 2 Alex Stuck 2006-05-24 19:30:43 UTC

oops..forgot to mention OS is RHEL4 kernel 2.6.9-11.ELsmp

Comment 3 Rich Megginson 2006-05-24 21:40:16 UTC

What are the clients?  Do you have any routers/switches/load balancer boxes that
contact the LDAP server?  Can you reproduce this with just plain ldapsearch?

Comment 4 Alex Stuck 2006-05-24 23:39:25 UTC

ldap stores posixaccounts, posixgroups, nisnetgroups and regular ldap groups

clients are mostly rhel4 and some rhel3 via nss_ldap. pam_access is used for
authentication here.

One rhel4 box uses mod_auth_ldap from this guy:
http://muquit.com/muquit/software/mod_auth_ldap/mod_auth_ldap.html
for nagios and some other webapps auth. I remember I compiled this from source
cause it supports nested groups which I like so I can have a statement in the
config that says:
  require group cn=nagios,ou=Apps,ou=Groups
but I dont have do be directly in the nagios group - instead the sysadmins group
is member of the nagios group..well you know how this works.

Cfengine has all its classes stored in ldap as netgroups. It runs on all clients
once an hour.

One nfs server uses netgroups to secure nfs via /etc/exports

/etc/sudoers uses 2 netgroups

On top of that we have 3 F5 bigip loadbalancers and 2 Alterpath serial console
appliances pointing to ldap for userauth.

I've just finished installing a 3rd fedora-ds server on a test box - same setup
but no-one is pointing to it right now. I imported a backup from the production
server to it. So far memory is at a stable 1.1%.
I will try to hit it with traffic slowly and see what happens.

Comment 5 Alex Stuck 2006-05-27 05:59:10 UTC

I just narrowed it down a bit. My nagios box went down today for half a day and
I noticed how memory consumption stopped. I bounced the daemon and cleaned the
logfiles. Then I took these values :

after 54s mem% was 1.2% (523520)
after 02:09:27 it had only risen to 1.4% (527568) when usually it'd be around
3.4% by now.
My nagios box was recovered and since it's prod I had to let it run again and
sure enough now where nagios started it's thing again I took another value:
After 03:17:44 memusage had risen to 2.3%

logconv shows this : 

49000           (&(objectclass=posixgroup)(memberuid=nagios))
45014           (&(objectclass=posixaccount)(uid=nagios))

What I don't get is the second line. nagios is a local user that the nagios
server uses to ssh into every box via an ssh-key. Why does ldap get queried for
this uid ? And why are there sooo many queries in such a short time ? 
Secondly, even if the server gets thousands of those queries it still should be
able to handle them w/o leaking memory like that right ?

I was also able to reproduce this on my test fedora-ds box. It had been running
with nothing pointing to it except itself for 2 days and stayed at 1.1%
memusage. As soon as I turned on the nagios check it went up to 1.3%. I will let
it run and see it it's gonna leak all the way to the end.

Comment 6 Rich Megginson 2006-05-27 14:22:39 UTC

Excellent detective work!  This helps us considerably.  What version of nagios
are you running?  We can setup nagios here and reproduce the problem.

Comment 7 Alex Stuck 2006-05-27 21:22:22 UTC

ok here are the details.
I have a rhel4 box that runs Nagios 2.0b4 compiled from source.
I used to work with a product called sitescope which I didn't like. However one
thing it did was cool I thought - it was agentless.
I borrowed that idea for nagios.
I image all my machines and this image has a local user already called nagios.

nagios:x:503:503::/usr/local/nagios/home:/bin/bash

It also has this set :

[root@ns1 stucky]# ls -l /usr/local/nagios/home/
total 152
-r--------  1 nagios nagios  1066 May 27 14:00 acl
-r-x------  1 nagios nagios   962 Oct  1  2005 acl_agent
-r-x------  1 nagios nagios 58548 Jul  8  2005 check_disk
-r-x------  1 nagios nagios   441 Oct  6  2005 check_duplex
-r-x------  1 nagios nagios 37528 Jul  8  2005 check_load
-r-x------  1 nagios nagios   686 Oct 12  2005 check_mailq
-r-x------  1 nagios nagios  4695 Jul  8  2005 check_mem
-r-x------  1 nagios nagios 19180 Jul  8  2005 check_procs
-r-x------  1 nagios nagios   854 Jul  8  2005 check_swap

I also moved the authorized_keys2 file away from user control in sshd_config:

AuthorizedKeysFile      /etc/ssh/keys/%u/authorized_keys2

so each machine has this in /etc/ssh/keys/nagios/authorized_keys2:

from="{nagiosip}",command="/usr/local/nagios/home/acl_agent",no-port-forwarding,no-X11-forwarding,no-agent-forwarding
ssh-dss ......

I wrote acl_agent as a wrapper to check what is passed to sshd :

#!/usr/bin/perl -w

# This script runs every time the nagios box uses its key to run a command remotely.
# Instead of running whatever command sshd execs this wrapper to do a sanity check.
# If the command matches one of the ones pre-defined in the acl this script runs it
# Otherwise it exits 2

use File::stat;
use User::pwent;

my $cmd = $ENV{"SSH_ORIGINAL_COMMAND"};
my $nagios_home = "/usr/local/nagios/home";
my $acl = "$nagios_home/acl";
my $st = stat($acl) || die "File 'acl' not found or inaccessable...";
my $pw = getpwnam ('nagios') || die "User 'nagios' doesn't exist...";

if ($st->mode != 33024 || $st->uid != $pw->uid) {
   print "Check owner/permissions of file 'acl'...";
   exit 2;
}
elsif ($pw->dir ne $nagios_home) {
   print "Check homedir of user 'nagios'...";
   exit 2;
}
    
open (ACL, $acl);
foreach (<ACL>) {
    chomp;
    if ($cmd eq $_) {
       system ($cmd);
       exit ($?>>8);
    }
}
print "Check acl...";
close ACL;
exit 2;

as you can guess the acl contains the commands nagios is allowed to exec
remotely on this box.

/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
and so forth..

One other thing that is prolly special about my nagios setup is that when I set
this up I wasn't aware that someone had already written a plugin called
check_by_ssh so I set up my own stuff.

basically in checkcommands.cfg I have a local and a remote definition for
plugins. Here is an example for the disk check:

define command{
        command_name    check_local_disk
        command_line    $USER1$/check_disk -M -w $ARG1$ -c $ARG2$ -p $ARG3$
        }

define command{
        command_name    check_remote_disk
        command_line    /usr/bin/ssh nagios@$HOSTADDRESS$ $USER2$/check_disk -M
-w $ARG1$ -c $ARG2$ -p $ARG3$
        }

in recource.cfg I have:

$USER1$=/usr/local/nagios/libexec
$USER2$=/usr/local/nagios/home

One last thing to do is logon to the nagios box, so a su - nagios and then ssh
to the box so that the hostfingerprint gets saved on the nagios server side:
After it askes you if you wanna save the hostfingerprint you should see
something like this:

[root@nagios stucky]# su - nagios
-bash-3.00$ ssh ns1
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Use of uninitialized value in string eq at /usr/local/nagios/home/acl_agent line
29, <ACL> line 20.
Check acl...Connection to ns1 closed.

This is all I can think of.

Right now my test fedora-ds box is also checked by nagios but the difference is
that this box points to itself whereas all other boxes point to ldap1 and ldap2.
So we'd assume that memory consumption should rise slowere here since nagios
only hits this box for itself ever now and then. It seemsto be the case since
mem% has risen from 1.3% yesterday to 2.0%.

My production server however is already at 14.6% again.

Comment 8 Alex Stuck 2006-05-27 21:35:20 UTC

btw. here is the output of a typical acl file. These are the checks I run on
every (about 30) machine:

/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
/usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master
/usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ
/usr/local/nagios/home/check_procs -w 1 -c 2 -s Z
sudo /usr/local/nagios/home/check_duplex
sudo /etc/init.d/raidmon start
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd
sudo /etc/init.d/cfexecd start

you can prolly run any checks you want I don't think it matters which one as
long as you run about he same amount.
Also here is my services template from /usr/local/nagios/etc/services.cfg

define service{
               name                            generic-service ; The 'name' of
this service template, referenced
                                                               ; in other
service definitions
               active_checks_enabled           1               ; Active service
checks are enabled
               passive_checks_enabled          1               ; Passive service
checks are enabled/accepted
               parallelize_check               1               ; Active service
checks should be parallelized
                                                               ; (disabling this
can lead to major performance problems)
               obsess_over_service             0               ; We should
obsess over this service (if necessary)
               check_freshness                 0               ; Default is to
NOT check service 'freshness'
               notifications_enabled           1               ; Service
notifications are enabled
               event_handler_enabled           1               ; Service event
handler is enabled
               flap_detection_enabled          0               ; Flap detection
is enabled
               process_perf_data               1               ; Process
performance data
               retain_status_information       1               ; Retain status
information across program restarts
               retain_nonstatus_information    1               ; Retain
non-status information across program restarts
               is_volatile                     0
               retry_check_interval            1               ; Re-check every
minute if state has changed to non-ok
                                                               ; non-ok state change
               notification_options            w,u,c,r         ; Notify for all
state changes
               check_period                    24x7
               notification_period             24x7
               notification_interval           720             ; Sent
notifications every 12 hours

               register                        0               ; DONT REGISTER
THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
              }


The check interval is mostly one minute but for some services it's 3 or 5.

Hope this gets you started..thx guys

Comment 9 Alex Stuck 2006-05-30 17:46:18 UTC

just fyi...ns-slapd on my test server is now 4.7% memusage and

ps -eo conn,vsz,etime | grep slapd says :

ns-slapd       531352  5-17:32:23

so it seems to confirm my theory

Comment 10 Rich Megginson 2006-05-30 21:36:41 UTC

Does nagios use SNMP?

Comment 11 Alex Stuck 2006-05-30 21:43:22 UTC

no - it just execs the plugins in /usr/local/nagios/home via ssh

Comment 12 Alex Stuck 2006-05-31 00:49:52 UTC

Make sure you have the following entries in /usr/local/nagios/home/acl:

/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /boot
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /var
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /tmp
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /usr/local
/usr/local/nagios/home/check_disk -M -w 15% -c 5% -p /home
/usr/local/nagios/home/check_mem -f -w 10 -c 5
/usr/local/nagios/home/check_swap 15 30
/usr/local/nagios/home/check_load -w 15 10 5 -c 30 25 20
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C syslog-ng
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C master
/usr/local/nagios/home/check_procs -w 1:2 -c 1:3 -C crond
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C MegaServ
/usr/local/nagios/home/check_procs -w 1 -c 2 -s Z
sudo /usr/local/nagios/home/check_duplex
sudo /etc/init.d/raidmon start
/usr/local/nagios/home/check_procs -w 1:1 -c 1:2 -C cfexecd
sudo /etc/init.d/cfexecd start

You already have the acl_agent perl script

all plugins are regular except check_duplex which I wrote :

#!/usr/bin/perl -w

my %EXIT = (OK       => 0,
            CRITICAL => 2);

my @output = `/sbin/ip add`;

foreach (@output) {
   if (m/\w+ eth\d.*$/) {
      s/.+ (eth\d.*)/$1/;
      chomp;
      $eth = $_;
      $_ = `/sbin/ethtool $eth`;
      if (m/Duplex: Half/) {
           print "CRITICAL: $eth is on half duplex !";
           exit ($EXIT{CRITICAL});
      }
   }
}
print "OK: all interfaces are on full duplex ";
exit ($EXIT{OK});

It needs to run via sudo w/o password since ethtool requires root priv even for
read ops. Hence we need this in sudoers:

User_Alias      NAGIOS=nagios

Cmnd_Alias      RAIDMON=/etc/init.d/raidmon start
Cmnd_Alias      CFEXECD=/etc/init.d/cfexecd start
Cmnd_Alias      DUPLEX=/usr/local/nagios/home/check_duplex

NAGIOS          ALL=NOPASSWD: RAIDMON,CFEXECD,DUPLEX

raidmon and cfexecd is set here so that nagios can restart those services via an
eventhandler. I don't think you need to bother with that.

Here is the section from services.cfg that checks the basic stuff:

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /
               check_command            check_remote_disk!15%!5%!/
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /BOOT
               check_command            check_remote_disk!15%!5%!/boot
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }


define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /TMP
               check_command            check_remote_disk!15%!5%!/tmp
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /VAR
               check_command            check_remote_disk!15%!5%!/var
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /USR
               check_command            check_remote_disk!15%!5%!/usr
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /USR/LOCAL
               check_command            check_remote_disk!15%!5%!/usr/local
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DISK USAGE: /HOME
               check_command            check_remote_disk!15%!5%!/home
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      MEMORY USAGE
               check_command            check_remote_mem!10!5
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      NR. OF PROCS: SYSLOG-NG
               check_command            check_remote_procs!1:1!1:2!syslog-ng
               max_check_attempts       2
               normal_check_interval    1
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      NR. OF PROCS: POSTFIX
               check_command            check_remote_procs!1:1!1:2!master
               max_check_attempts       2
               normal_check_interval    1
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      NR. OF PROCS: CRON
               check_command            check_remote_procs!1:2!1:3!crond
               max_check_attempts       2
               normal_check_interval    1
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      NR. OF PROCS: RAID MONITOR
               check_command            check_remote_procs!1:1!1:2!MegaServ
               event_handler            raidmon_restart_remote
               max_check_attempts       3
               normal_check_interval    1
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      NR. OF ZOMBIES
               check_command            check_remote_zombies!1!2!Z
               max_check_attempts       3
               normal_check_interval    3
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      LOAD AVERAGE
               check_command            check_remote_load!'15 10 5'!'30 25 20'
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      SWAP
               check_command            check_remote_swap!15!30
               max_check_attempts       5
               normal_check_interval    5
               contact_groups           admins
              }

define service{
               use                      generic-service
               hostgroup_name           generic-linux
               service_description      DUPLEX SETTINGS
               check_command            check_remote_duplex
               max_check_attempts       2
               normal_check_interval    1
               contact_groups           admins
              }

That should really be most of the info you need. Thx guys !!

Comment 13 Alex Stuck 2006-05-31 17:40:08 UTC

one more thing:

/etc/ldap.conf on all rhel4 clients looks like this :

URI ldaps://host1.idf.net ldaps://host2.idf.net
bind_timelimit 5                
bind_policy soft                
pam_lookup_policy yes           
BASE dc=idf,dc=net              
TLS_CACERTDIR /etc/openldap/cacerts
start_tls no                    
ssl on                          
tls_checkpeer yes               
TLS_REQCERT demand

and /etc/pam.d/system-auth looks like this:

auth        required      /lib/security/$ISA/pam_env.so
auth        sufficient    /lib/security/$ISA/pam_unix.so likeauth nullok
auth        sufficient    /lib/security/$ISA/pam_ldap.so use_first_pass
auth        required      /lib/security/$ISA/pam_deny.so

account     required      /lib/security/$ISA/pam_access.so
account     required      /lib/security/$ISA/pam_unix.so broken_shadow
account     sufficient    /lib/security/$ISA/pam_succeed_if.so uid < 100 quiet
account     sufficient    /lib/security/$ISA/pam_localuser.so
account     [default=bad success=ok user_unknown=ignore service_err=ignore
system_err=ignore authinfo_unavail=ignore] \
                          /lib/security/$ISA/pam_ldap.so
account     required      /lib/security/$ISA/pam_permit.so

password    requisite     /lib/security/$ISA/pam_cracklib.so retry=3
password    sufficient    /lib/security/$ISA/pam_unix.so nullok use_authtok md5
shadow
password    sufficient    /lib/security/$ISA/pam_ldap.so use_authtok
password    required      /lib/security/$ISA/pam_deny.so

session     required      /lib/security/$ISA/pam_limits.so
session     required      /lib/security/$ISA/pam_unix.so
session     optional      /lib/security/$ISA/pam_ldap.so

Comment 14 Noriko Hosoi 2006-05-31 17:51:32 UTC

Could there be any errors in slapd-<id>/logs/errors or on the
mod_auth_ldap/nagios side?

Comment 15 Alex Stuck 2006-05-31 18:48:24 UTC

no, neither logs show errors. I don't think it has to do with mod_auth_ldap
anyway. There is little traffic from that.

Comment 16 Noriko Hosoi 2006-06-01 22:14:38 UTC

We are thinking you might have hit this memory leak bug in NSS 3.11:
> This leak was introduced in NSS 3.11 and has been fixed in NSS 3.11.1.  See
> https://bugzilla.mozilla.org/show_bug.cgi?id=336335#c9.

To fix it, you need to replace NSPR 4.6 with 4.6.2 and NSS 3.11 with 3.11.1 in
your Fedora Directory Server.  The binaries are not available yet on the mozilla
site.  If you are interested in, could you checkout the libraries
  $ export CVSROOT=:pserver:anonymous.org:/cvsroot
  $ cvs -z3 co -r NSPR_4_6_2_RTM mozilla/nsprpub
  $ cvs -z3 co -r NSS_3_11_1_RTM mozilla/dbm mozilla/security/dbm
mozilla/security/coreconf mozilla/security/nss

And build them following the instructions found here?
http://directory.fedora.redhat.com/wiki/Building#Mozilla.org_components
} 
Then, shutdown the Directory Server and copy the built libraries to the server
lib directory as follows:
  $ cd <mozilla_root>/mozilla/dist/<PlatformInfo_glibc_PTH_DBG_or_OPT.OBJ>/lib
  $ cp *.{so,chk} /opt/fedora-ds/bin/slapd/lib

Unfortunately, this will "break" RPM if the files are replaced.  So, please be
careful and keep the backups of the files and run your test.

Also, when the new NSPR / NSS libraries are ready at the mozilla ftp site to
download, we will announce it on http://directory.fedora.redhat.com/wiki/.
Thanks.

Comment 17 Alex Stuck 2006-06-16 22:14:53 UTC

Hosoi-san

You hit the spot ! I compiled the new libs, replaced them and after 2 days
ns-slapd is at a stable 1.8% memusage.

one interesting thing I noticed it that both my test and prod ldap servers
behaved the exact same way once I changed the libs.

The prod server used to start at 1.5% but now goes to 1.7% almost right away.
After about 12 hours it went to 1.8%. Then after another 10 hours it went to 1.9%.
I started thinking the problem isn't fixed. However, just about 1 hour later it
went back to 1.8%, something that had never happened before.
It's been running for 2 days 23 hours at a stable 1.8%.

The test server had shown the same behaviour except that the memusage was lower
due to the fact that it only uses itself as a client.

I feel pretty good about this at this point bit will keep watching for long term
effects.

thx so far guys.

Comment 18 Alex Stuck 2006-06-19 17:39:48 UTC

maybe I spoke too soon after all.
I looked at my production server today and memusage changed again.
After 4days and 10 hours it went to 1.9% again only to go back to 1.8% after 2
hours. Then after a total of 4 days and 13 hours it suddenly went to 2.2%.
That's where it's  been. 
At this point I'm not sure whether this is normal behaviour or if the bug is
still present in a slighly different form.
I'll keep you posted.

Comment 19 Rich Megginson 2006-06-19 17:42:36 UTC

That may very well be normal behavior, if it needs to allocate more memory for
the OS for some cache.  At any rate, it certainly doesn't seem to be the same
bug that you originally reported.

Comment 20 Alex Stuck 2006-06-22 20:21:01 UTC

Created attachment 131385 [details]
hourly snapshots of ns-slapd mem and cpu usage after NSS and NSPR upgrade

Comment 21 Rich Megginson 2006-06-22 20:22:43 UTC

Hmm - what happened between these two entries?
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
10880 ldap      15   0  545m  48m  14m S  0.0  2.4   1347:41 ns-slapd           
ns-slapd         558796 8-01:18:13


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
10880 ldap      16   0  658m 145m  14m S 61.3  7.2   1354:59 ns-slapd           
ns-slapd         673864 8-02:18:12

Comment 22 Alex Stuck 2006-06-22 20:25:12 UTC

guys please look at the latest attachment. It seems to me the new libs don't fix
the leak but simply delay it. Also can you tell me if you think the CPU% values
are normal ?
I get this output by running the following via cron.hourly :

#!/bin/bash

top -bn1 | grep -e ns-slapd -e PID >> /tmp/ns-slapd_mem
ps -eo comm,vsz,etime | grep ns-slapd >> /tmp/ns-slapd_mem
echo -e "\n" >> /tmp/ns-slapd_mem

Comment 23 Alex Stuck 2006-06-22 20:27:10 UTC

oh damn Rich you're fast. Well that's exactly it - nothing happend. The server
just keeps running and serving the same amount of clients in the same manner.
Yet, as time passes memusage goes up again - only much slower now.

Comment 24 Rich Megginson 2006-06-22 20:32:56 UTC

This may just be normal caching behavior.  Over time, as the number of entries
grows, more entries will be added to the cache, and more memory will be required
to cache those entries.  It should eventually level out at a fairly low
percentage of your system RAM.

Comment 25 Alex Stuck 2006-06-22 20:55:11 UTC

OK I just tested something. I usually use phpldapadmin to make simple changes
like adding a netgroup etc...I loggen onto phpldapadmin and after about 30
seconds memusage changed from 7.2 to 7.4%. I admit I use phpldapadmin a lot and
there seems to be a correlation. Are you saying, however that 7.4 is still an ok
value ?
At what point should I worry again 50% - 60%  ?

Comment 26 Rich Megginson 2006-06-22 21:01:56 UTC

I know phpldapadmin does a _lot_ of searches, and probably hits a lot of entries
that were not in the cache.

I really can't say at what point you should start to worry.  It's really a
function of the number of entries, average size per entry, number of indexes,
type of indexes, other database overhead, and replication overhead.  That is, it
should not be significantly more than the size of the database on disk, which is
the size of all of the files under /opt/fedora-ds/slapd-instance/db, minus the
__db.XXX files.

Comment 27 Alex Stuck 2006-06-22 23:19:34 UTC

ok ns-slapd uses 673864 k right now and the files in the db directory have the
following size.

[root@ldap1 db]# du -h ./*
12K     ./__db.001
5.5M    ./__db.002
548K    ./__db.003
3.8M    ./__db.004
28K     ./__db.005
4.0K    ./DBVERSION
4.0K    ./guardian
8.4M    ./log.0000000029
372K    ./NetscapeRoot
492K    ./userRoot

I assume you mean the size of NetscapeRoot and userRoot combined make the
database on disk but that's only 864 k. That versus 763 MB ???

Comment 28 Rich Megginson 2006-06-22 23:54:36 UTC

It's not exactly that size, the size in memory is quite a bit larger than the
size on disk.  You also need to include the size of log.0000000029 in the db
size - that adds another 8.4M.  That file is the transaction log for your database.
slapd also caches lots of other things other than just the straight database
files, but it's usually some function of the size of the data.

The point is that at some point slapd memory usage should level off, but I'm not
exactly sure at what point that is.

Comment 29 Alex Stuck 2006-06-23 00:11:40 UTC

ok thanks Rich. I will keep watching and let you know if anything major happens.

Comment 30 Rich Megginson 2006-08-21 20:24:36 UTC

Just checking.  Some other users have reported memory leakage.  I just wanted to
know if you have had any more problems since replacing nspr/nss.

Comment 31 Alex Stuck 2006-08-21 20:31:08 UTC

nope - I had actually forgotten about it since it's been fine.
After 41 days 17 hours it's at a stable 2.7% which is fine with me.
thx for checking in though !

Comment 32 Rich Megginson 2006-08-21 20:54:31 UTC

New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 x86_64:
http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-x86_64.tar.gz

Comment 33 Rich Megginson 2006-08-22 18:27:30 UTC

New NSPR 4.6.2/NSS 3.11.1 binaries for RHEL4 i386 (32 bit):
http://directory.fedora.redhat.com/download/nspr-4.6.2-nss-3.11.1-RHEL4-i386.tar.gz

Comment 34 Rich Megginson 2006-10-04 21:22:51 UTC

The next version of Fedora DS will use newer versions of NSPR and NSS that fix
this problem.

Comment 35 Rich Megginson 2006-10-16 19:45:51 UTC

Fedora DS 1.0.3 includes nss 3.11.3

Comment 36 Rich Megginson 2008-01-03 20:55:17 UTC

Verified.

Note You need to log in before you can comment on or make changes to this bug.