Bug 668619 - slapd stops responding
slapd stops responding
Status: CLOSED CURRENTRELEASE
Product: 389
Classification: Community
Component: Directory Server (Show other bugs)
1.2.7
i686 Linux
high Severity high
: ---
: ---
Assigned To: Rich Megginson
Viktor Ashirov
:
: 668548 (view as bug list)
Depends On:
Blocks: 434915 CVE-2011-1067 389_1.2.9 711513
  Show dependency treegraph
 
Reported: 2011-01-10 17:20 EST by Jeremy Mates
Modified: 2015-12-07 12:02 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 683250 711513 (view as bug list)
Environment:
Last Closed: 2015-12-07 12:02:36 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
replay script, configs, access log, strace results (88.67 KB, application/x-tar)
2011-01-10 17:20 EST, Jeremy Mates
no flags Details
git patch file (master) (2.68 KB, patch)
2011-01-31 13:57 EST, Noriko Hosoi
nhosoi: review+
rmeggins: review+
Details | Diff
gdb logs (8.51 KB, application/octet-stream)
2011-02-10 16:05 EST, Jeremy Mates
no flags Details
gdb trace from 1.2.8 a3 segfault (89.54 KB, text/plain)
2011-02-25 20:10 EST, Jeremy Mates
no flags Details

  None (edit)
Description Jeremy Mates 2011-01-10 17:20:37 EST
Created attachment 472701 [details]
replay script, configs, access log, strace results

Description of problem:

slapd stops responding to queries; the TCP port remains open, though new
connections are unable to bind() or issue LDAP queries. This condition
persists until slapd is shutdown and restarted.

Version-Release number of selected component (if applicable):

Stock and up-to-date RHEL system; RPM on clean install from EPEL.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
# rpm -qa | grep 389
389-console-1.1.4-1.el5
389-dsgw-1.1.5-1.el5
389-ds-base-1.2.7.5-1.el5
389-admin-1.1.13-1.el5
389-admin-console-1.1.5-1.el5
389-admin-console-doc-1.1.5-1.el5
389-ds-1.2.1-1.el5
389-adminutil-1.1.8-4.el5
389-ds-console-1.2.3-1.el5
389-ds-console-doc-1.2.3-1.el5

How reproducible:

Difficult: requires multiple processes replaying production TCP sessions
saved from tcpdump to cause the wedge. A single replay process causes no
issue, nor does multiple processes replicating the slapd log via
ldapsearch against the server.

Steps to Reproduce:

1. Clean install of a standalone LDAP server, minor configuration tweaks
   to enable Kerberos support, increase file descriptors, allow
   anonymous binds. (See dirsrv.sysconfig, dse.ldif, attached.)
2. Collect production LDAP traffic (actual client requests against
   the current LDAP servers, which run an older fedora-ds release
   that do not exhibit this bug): `tcpdump -vv -c 100000 -s 0 -w ldap-
   traffic port 389`
3. Run strace against slapd-* process on server.
4. Replay production LDAP traffic against test standalone server (via
   ldap-tcpdump-replay script, attached). Requires multiple runs of the
   script, single run does not suffice?
5. Wait for server to stop responding...

Started strace at 20:45 UTC and the test traffic at around then; server
stopped responding and was automatically restarted by an every-15-
seconds health check connecting via 127.0.0.1 that saw no problems at
~20:47:30 but found an unresponsive server at ~20:47:45.
Comment 1 Rich Megginson 2011-01-10 17:38:58 EST
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=668548
Comment 2 Todd Hendricks 2011-01-10 17:41:14 EST
I have experienced this issue as well under production load.  I would also add that while the TCP port does remain open, connections to it are immediately disconnected.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.5 (Tikanga)

# rpm -qa | grep 389
389-admin-1.1.13-1.el5
389-ds-console-doc-1.2.3-1.el5
389-adminutil-1.1.8-4.el5
389-console-1.1.4-1.el5
389-dsgw-1.1.5-1.el5
389-ds-console-1.2.3-1.el5
389-admin-console-doc-1.1.5-1.el5
389-ds-base-1.2.7.5-1.el5
389-admin-console-1.1.5-1.el5
389-ds-1.2.1-1.el5

access:

[10/Jan/2011:12:04:26 -0600] conn=187 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:26 -0600] conn=187 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:26 -0600] conn=188 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:26 -0600] conn=188 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:26 -0600] conn=189 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:26 -0600] conn=189 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:27 -0600] conn=190 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:27 -0600] conn=190 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:27 -0600] conn=191 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:27 -0600] conn=191 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:30 -0600] conn=192 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:30 -0600] conn=192 op=-1 fd=69 closed - T2
[10/Jan/2011:12:04:31 -0600] conn=193 fd=69 slot=69 connection from 172.16.x.x to 172.16.x.x
[10/Jan/2011:12:04:31 -0600] conn=193 op=-1 fd=69 closed - T2

error:

[10/Jan/2011:12:04:26 -0600] - new connection on 69
[10/Jan/2011:12:04:26 -0600] - activity on 69r
[10/Jan/2011:12:04:26 -0600] - new connection on 69
[10/Jan/2011:12:04:26 -0600] - activity on 69r
[10/Jan/2011:12:04:26 -0600] - new connection on 69
[10/Jan/2011:12:04:26 -0600] - activity on 69r
[10/Jan/2011:12:04:27 -0600] - new connection on 69
[10/Jan/2011:12:04:27 -0600] - activity on 69r
[10/Jan/2011:12:04:27 -0600] - new connection on 69
[10/Jan/2011:12:04:27 -0600] - activity on 69r
[10/Jan/2011:12:04:30 -0600] - new connection on 69
[10/Jan/2011:12:04:31 -0600] - new connection on 69
[10/Jan/2011:12:04:31 -0600] - activity on 69r
Comment 3 Noriko Hosoi 2011-01-31 12:24:51 EST
*** Bug 668548 has been marked as a duplicate of this bug. ***
Comment 4 Noriko Hosoi 2011-01-31 13:57:19 EST
Created attachment 476252 [details]
git patch file (master)

Description: The server uses c_timelimit field in the connection
table element to check the staled Simple Paged Results request.
The field is supposed to be cleaned up when the staled Simple
Paged Results connection is disconnected.  But the cleanup was
sometimes incomplete.  It causes the following requests which
happens to acquire the same connection table element timed out.
This patch forces to clean up the c_timelimit every time the
connection table is acquired and renewed.  Also, the timeout
check is done only for the Simple Paged Results connection.
Comment 6 Noriko Hosoi 2011-01-31 14:06:19 EST
Reviewed by Rich (Thank you!!!)

Pushed to master.

$ git merge 668619
Updating 57dcdf1..68d53e3
Fast-forward
 ldap/servers/slapd/connection.c   |    2 ++
 ldap/servers/slapd/daemon.c       |    3 ++-
 ldap/servers/slapd/pagedresults.c |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

$ git push
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.11 KiB, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   57dcdf1..68d53e3  master -> master
Comment 7 Diego Woitasen 2011-02-08 06:59:35 EST
I had the same problem. Slapd stops responding after a few minutes of started. I've applied the patch and it's been running for 24 hours without any problem.
Comment 8 Noriko Hosoi 2011-02-08 11:56:09 EST
(In reply to comment #7)
> I had the same problem. Slapd stops responding after a few minutes of started.
> I've applied the patch and it's been running for 24 hours without any problem.

Thank you for your input!  Please let us know if you run into the problem again.
Comment 9 Sam Hartsfield 2011-02-10 13:33:01 EST
I've also tested this on Fedora 13 (I installed 1.2.8.a2), and it seems to be running fine. Previously it was "crashing" very frequently (i.e. not responding, with the T2 messages); RHEL4 clients seemed especially hard on it.
Comment 10 Jeremy Mates 2011-02-10 13:41:41 EST
No hang on RHEL 5.5 with 1.2.8 a2, but instead segmentation faults when tested with replayed production traffic. Gathering gdb output...
Comment 11 Jeremy Mates 2011-02-10 16:05:13 EST
Created attachment 478119 [details]
gdb logs

gdb traces from two crashes of 389-ds-base-1.2.8-0.2.a2.el5 in response to replayed production traffic.
Comment 12 Rich Megginson 2011-02-10 16:21:52 EST
(In reply to comment #11)
> Created attachment 478119 [details]
> gdb logs
> 
> gdb traces from two crashes of 389-ds-base-1.2.8-0.2.a2.el5 in response to
> replayed production traffic.

Thanks.  The stack looks corrupted.  Are you using SASL/GSSAPI Kerberos authentication?  Can you attach the last few lines from your errors and access logs before the crash?
Comment 13 Jeremy Mates 2011-02-10 16:51:54 EST
Yes, the servers are configured for Kerberos auth (via http://directory.fedoraproject.org/wiki/Howto:Kerberos). Our existing fedora-ds-1.0.4-1.RHEL4 production LDAP servers also use Kerberos auth, and do not exhibit any problems (they were setup well before my time here, though).

Errors shows just various peer rests, nothing unique at time of the crash.

[10/Feb/2011:18:18:01 +0000] - PR_Write(118) Netscape Portable Runtime error -59
61 (TCP connection reset by peer.)
[10/Feb/2011:18:18:02 +0000] - PR_Write(79) Netscape Portable Runtime error -596
1 (TCP connection reset by peer.)

Access shows nothing exciting, though that is buffered. Trying again with unbuffered logs and a higher error log level.
Comment 14 Jeremy Mates 2011-02-11 13:26:22 EST
Unbuffered logging appears to avoid the segmentation faults (for an overnight run of test traffic; buffered logging leads to crashes within minutes), though hobbles performance to the log write speed. If production load is near or beyond that limit, client requests block or timeout, which is not ideal. Shutting down the unbuffered logging ns-slapd took 10 (!) minutes after all test traffic was cut off.
Comment 15 Rich Megginson 2011-02-11 13:36:17 EST
Yeah, that's why we buffer access logging by default.  You can use a named pipe script for the access log instead - http://directory.fedoraproject.org/wiki/Named_Pipe_Log_Script
Comment 16 Jeremy Mates 2011-02-25 20:10:58 EST
Created attachment 481105 [details]
gdb trace from 1.2.8 a3 segfault

gdb trace from 1.2.8 a3 segfault added (alpha 3 is harder to segfault than alpha 2).
Comment 17 Rich Megginson 2011-02-25 21:20:40 EST
Centos5?  32-bit or 64-bit?  How are you generating the load?  If you are using some scripts or load client, is it possible we could get a copy?
Comment 18 Nathan Kinder 2011-03-08 17:13:07 EST
I've opened bug 683250 to track the crashing issue initially reported in
comment#10.  Please use that bug instead of this one.
Comment 19 Amita Sharma 2011-05-20 08:40:55 EDT
using the steps mentioned at comment 5.
and... I am facing the same T2 error.

Actually,first I tested the given script against a suffix which is having only 6-7 entries.. No issues found there.

But when I went ahead and executed the script attached by you in the bug with the suffix having entries more than 50,000.. It crashes..

here are the few lines from access log :
[20/May/2011:15:20:02 +051800] conn=13 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(cn=*)" attrs="distinguishedName"
[20/May/2011:15:20:07 +051800] conn=13 op=1 RESULT err=3 tag=101 nentries=0 etime=8 notes=U,P
[20/May/2011:15:20:00 +051800] conn=94 fd=156 slot=156 connection from 127.0.0.1 to 127.0.0.1
[20/May/2011:15:20:07 +051800] conn=63 op=-1 fd=125 closed - T2
[20/May/2011:15:20:07 +051800] conn=59 op=-1 fd=121 closed - T2
[20/May/2011:15:20:07 +051800] conn=55 op=-1 fd=117 closed - T2
[20/May/2011:15:20:07 +051800] conn=51 op=-1 fd=113 closed - T2
[20/May/2011:15:20:07 +051800] conn=49 op=-1 fd=111 closed - T2
[20/May/2011:15:20:07 +051800] conn=47 op=-1 fd=109 closed - T2
[20/May/2011:15:20:07 +051800] conn=38 op=-1 fd=100 closed - T2
[20/May/2011:15:20:07 +051800] conn=95 fd=68 slot=68 connection from 127.0.0.1 to 127.0.0.1
[20/May/2011:15:20:07 +051800] conn=96 fd=73 slot=73 connection from 127.0.0.1 to 127.0.0.1
[20/May/2011:15:20:00 +051800] conn=47 op=1 RESULT err=0 tag=101 nentries=5 etime=1 notes=U,P
[20/May/2011:15:20:07 +051800] conn=74 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(cn=*)" attrs="distinguishedName"
[20/May/2011:15:20:07 +051800] conn=74 op=1 RESULT err=0 tag=101 nentries=5 etime=0 notes=U,P
[20/May/2011:15:20:07 +051800] conn=73 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(cn=*)" attrs="distinguishedName"
[20/May/2011:15:20:07 +051800] conn=73 op=1 RESULT err=0 tag=101 nentries=5 etime=0 notes=U,P


[20/May/2011:15:20:08 +051800] conn=65 op=2 UNBIND
[20/May/2011:15:20:08 +051800] conn=65 op=2 fd=127 closed - U1
[20/May/2011:15:20:08 +051800] conn=64 op=2 UNBIND
[20/May/2011:15:20:08 +051800] conn=64 op=2 fd=126 closed - U1
[20/May/2011:15:20:08 +051800] conn=62 op=2 UNBIND
[20/May/2011:15:20:08 +051800] conn=62 op=2 fd=124 closed - U1
[20/May/2011:15:20:08 +051800] conn=61 op=3 UNBIND
[20/May/2011:15:20:08 +051800] conn=61 op=3 fd=123 closed - U1
[20/May/2011:15:20:08 +051800] conn=58 op=2 UNBIND
[20/May/2011:15:20:08 +051800] conn=58 op=2 fd=120 closed - U1
[20/May/2011:15:20:07 +051800] conn=97 fd=77 slot=77 connection from 127.0.0.1 to 1


[20/May/2011:15:20:08 +051800] conn=109 fd=125 slot=125 connection from 127.0.0.1 to 127.0.0.1
[20/May/2011:15:20:09 +051800] conn=91 op=-1 fd=153 closed - T2
[20/May/2011:15:20:09 +051800] conn=86 op=-1 fd=148 closed - T2
[20/May/2011:15:20:09 +051800] conn=74 op=-1 fd=136 closed - T2
[20/May/2011:15:20:09 +051800] conn=73 op=-1 fd=135 closed - T2
[20/May/2011:15:20:09 +051800] conn=71 op=-1 fd=133 closed - T2
[20/May/2011:15:20:09 +051800] conn=70 op=-1 fd=132 closed - T2
[20/May/2011:15:20:09 +051800] conn=69 op=-1 fd=131 closed - T2
[20/May/2011:15:20:09 +051800] conn=68 op=-1 fd=130 closed - T2


20/May/2011:15:20:11 +051800] conn=28 op=1 RESULT err=3 tag=101 nentries=0 etime=12 notes=U,P
[20/May/2011:15:20:11 +051800] conn=14 op=2 UNBIND
[20/May/2011:15:20:11 +051800] conn=14 op=2 fd=76 closed - U1
[20/May/2011:15:20:11 +051800] conn=13 op=2 UNBIND
[20/May/2011:15:20:11 +051800] conn=13 op=2 fd=75 closed - U1
[20/May/2011:15:20:11 +051800] conn=12 op=2 UNBIND
[20/May/2011:15:20:11 +051800] conn=12 op=2 fd=74 closed - U1
[20/May/2011:15:20:11 +051800] conn=10 op=2 UNBIND
[20/May/2011:15:20:11 +051800] conn=10 op=2 fd=72 closed - U1


From errors log ;
[20/May/2011:14:36:24 +051800] - PR_Write(64) Netscape Portable Runtime error -5961 (TCP connection reset by peer.)
[20/May/2011:15:13:50 +051800] - slapd shutting down - signaling operation threads
[20/May/2011:15:13:50 +051800] - slapd shutting down - closing down internal subsystems and plugins
[20/May/2011:15:13:52 +051800] - Waiting for 4 database threads to stop
[20/May/2011:15:13:52 +051800] - All database threads now stopped
[20/May/2011:15:13:53 +051800] - slapd stopped.
[20/May/2011:15:13:56 +051800] - 389-Directory/1.2.8.3 B2011.123.1759 starting up
[20/May/2011:15:13:56 +051800] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[20/May/2011:15:13:56 +051800] - Listening on All Interfaces port 636 for LDAPS requests


Request you to please guide here.

Thanks and Regards,
Amita
Comment 20 Rich Megginson 2011-05-20 10:18:07 EDT
(In reply to comment #19)
> using the steps mentioned at comment 5.
> and... I am facing the same T2 error.
> 
> Actually,first I tested the given script against a suffix which is having only
> 6-7 entries.. No issues found there.
> 
> But when I went ahead and executed the script attached by you in the bug with
> the suffix having entries more than 50,000.. It crashes..

What makes you think it crashes?  Are you talking about the test program or the directory server?  From the errors log below, the server shuts down cleanly and starts up cleanly - no crash there.

The original bug was that the server would stop responding to new requests after some period of activity.  That is, ldapsearch would fail, but the server would be running.  Are you able to reproduce that problem?

Note that it is ok to get T2 errors - if the client is hung or there is a lot of network congestion, the server can timeout attempting to read/write to the client.  If this is a common occurrence for the customer, we usually recommend increasing the idle timeout.

> 
> here are the few lines from access log :
> [20/May/2011:15:20:02 +051800] conn=13 op=1 SRCH base="dc=example,dc=com"
> scope=2 filter="(cn=*)" attrs="distinguishedName"
> [20/May/2011:15:20:07 +051800] conn=13 op=1 RESULT err=3 tag=101 nentries=0
> etime=8 notes=U,P
> [20/May/2011:15:20:00 +051800] conn=94 fd=156 slot=156 connection from
> 127.0.0.1 to 127.0.0.1
> [20/May/2011:15:20:07 +051800] conn=63 op=-1 fd=125 closed - T2
> [20/May/2011:15:20:07 +051800] conn=59 op=-1 fd=121 closed - T2
> [20/May/2011:15:20:07 +051800] conn=55 op=-1 fd=117 closed - T2
> [20/May/2011:15:20:07 +051800] conn=51 op=-1 fd=113 closed - T2
> [20/May/2011:15:20:07 +051800] conn=49 op=-1 fd=111 closed - T2
> [20/May/2011:15:20:07 +051800] conn=47 op=-1 fd=109 closed - T2
> [20/May/2011:15:20:07 +051800] conn=38 op=-1 fd=100 closed - T2
> [20/May/2011:15:20:07 +051800] conn=95 fd=68 slot=68 connection from 127.0.0.1
> to 127.0.0.1
> [20/May/2011:15:20:07 +051800] conn=96 fd=73 slot=73 connection from 127.0.0.1
> to 127.0.0.1
> [20/May/2011:15:20:00 +051800] conn=47 op=1 RESULT err=0 tag=101 nentries=5
> etime=1 notes=U,P
> [20/May/2011:15:20:07 +051800] conn=74 op=1 SRCH base="dc=example,dc=com"
> scope=2 filter="(cn=*)" attrs="distinguishedName"
> [20/May/2011:15:20:07 +051800] conn=74 op=1 RESULT err=0 tag=101 nentries=5
> etime=0 notes=U,P
> [20/May/2011:15:20:07 +051800] conn=73 op=1 SRCH base="dc=example,dc=com"
> scope=2 filter="(cn=*)" attrs="distinguishedName"
> [20/May/2011:15:20:07 +051800] conn=73 op=1 RESULT err=0 tag=101 nentries=5
> etime=0 notes=U,P
> 
> 
> [20/May/2011:15:20:08 +051800] conn=65 op=2 UNBIND
> [20/May/2011:15:20:08 +051800] conn=65 op=2 fd=127 closed - U1
> [20/May/2011:15:20:08 +051800] conn=64 op=2 UNBIND
> [20/May/2011:15:20:08 +051800] conn=64 op=2 fd=126 closed - U1
> [20/May/2011:15:20:08 +051800] conn=62 op=2 UNBIND
> [20/May/2011:15:20:08 +051800] conn=62 op=2 fd=124 closed - U1
> [20/May/2011:15:20:08 +051800] conn=61 op=3 UNBIND
> [20/May/2011:15:20:08 +051800] conn=61 op=3 fd=123 closed - U1
> [20/May/2011:15:20:08 +051800] conn=58 op=2 UNBIND
> [20/May/2011:15:20:08 +051800] conn=58 op=2 fd=120 closed - U1
> [20/May/2011:15:20:07 +051800] conn=97 fd=77 slot=77 connection from 127.0.0.1
> to 1
> 
> 
> [20/May/2011:15:20:08 +051800] conn=109 fd=125 slot=125 connection from
> 127.0.0.1 to 127.0.0.1
> [20/May/2011:15:20:09 +051800] conn=91 op=-1 fd=153 closed - T2
> [20/May/2011:15:20:09 +051800] conn=86 op=-1 fd=148 closed - T2
> [20/May/2011:15:20:09 +051800] conn=74 op=-1 fd=136 closed - T2
> [20/May/2011:15:20:09 +051800] conn=73 op=-1 fd=135 closed - T2
> [20/May/2011:15:20:09 +051800] conn=71 op=-1 fd=133 closed - T2
> [20/May/2011:15:20:09 +051800] conn=70 op=-1 fd=132 closed - T2
> [20/May/2011:15:20:09 +051800] conn=69 op=-1 fd=131 closed - T2
> [20/May/2011:15:20:09 +051800] conn=68 op=-1 fd=130 closed - T2
> 
> 
> 20/May/2011:15:20:11 +051800] conn=28 op=1 RESULT err=3 tag=101 nentries=0
> etime=12 notes=U,P
> [20/May/2011:15:20:11 +051800] conn=14 op=2 UNBIND
> [20/May/2011:15:20:11 +051800] conn=14 op=2 fd=76 closed - U1
> [20/May/2011:15:20:11 +051800] conn=13 op=2 UNBIND
> [20/May/2011:15:20:11 +051800] conn=13 op=2 fd=75 closed - U1
> [20/May/2011:15:20:11 +051800] conn=12 op=2 UNBIND
> [20/May/2011:15:20:11 +051800] conn=12 op=2 fd=74 closed - U1
> [20/May/2011:15:20:11 +051800] conn=10 op=2 UNBIND
> [20/May/2011:15:20:11 +051800] conn=10 op=2 fd=72 closed - U1
> 
> 
> From errors log ;
> [20/May/2011:14:36:24 +051800] - PR_Write(64) Netscape Portable Runtime error
> -5961 (TCP connection reset by peer.)
> [20/May/2011:15:13:50 +051800] - slapd shutting down - signaling operation
> threads
> [20/May/2011:15:13:50 +051800] - slapd shutting down - closing down internal
> subsystems and plugins
> [20/May/2011:15:13:52 +051800] - Waiting for 4 database threads to stop
> [20/May/2011:15:13:52 +051800] - All database threads now stopped
> [20/May/2011:15:13:53 +051800] - slapd stopped.
> [20/May/2011:15:13:56 +051800] - 389-Directory/1.2.8.3 B2011.123.1759 starting
> up
> [20/May/2011:15:13:56 +051800] - slapd started.  Listening on All Interfaces
> port 389 for LDAP requests
> [20/May/2011:15:13:56 +051800] - Listening on All Interfaces port 636 for LDAPS
> requests
> 
> 
> Request you to please guide here.
> 
> Thanks and Regards,
> Amita
Comment 22 Rich Megginson 2011-05-23 22:08:38 EDT
When you hit the bug, are you able to, in another window, run ldapsearch against the directory server?  If so, then the server is not hung, and you have not reproduced the problem.

I tried running the script.  For one, on my system, I got a lot of errors like this:
/share/scripts/bug668619.sh: fork: retry: Resource temporarily unavailable

For another - the script runs with -E pr=5 - by default, ldapsearch will prompt for the next page, so I have several hundred ldapsearch processes waiting for input on stdin - is this right?  I guess the purpose of the test is to leave several hundred paged searches unfinished?  Otherwise, they might finish before the timeout is hit.
Comment 30 Amita Sharma 2011-07-07 04:58:06 EDT
Hi Rich and Noriko,

I am using attached script - "script to reproduce the problem" to verify.

1). I am getting below after some time :
ldap_result: Can't contact LDAP server (-1)
# extended LDIF
#
# LDAPv3
# base <dc=example,dc=com> with scope subtree
# filter: (cn=*)
# requesting: dn 
# with pagedResults control: size=5
#

ldap_result: Can't contact LDAP server (-1)
# extended LDIF
#
# LDAPv3
# base <dc=example,dc=com> with scope subtree
# filter: (cn=*)
# requesting: dn 
# with pagedResults control: size=5
#

ldap_result: Can't contact LDAP server (-1)
# extended LDIF
#
# LDAPv3
# base <dc=example,dc=com> with scope subtree
# filter: (cn=*)
# requesting: dn 
# with pagedResults control: size=5
#

ldap_result: Can't contact LDAP server (-1)

Also I was not able to come out of this loop, It was infinite.
Control+C was fail.

2. I am able to execute ldapsearch on another window successfully.

3. Access logs shows some T2 :
[07/Jul/2011:14:15:22 +051800] conn=142 op=-1 fd=164 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=141 op=-1 fd=163 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=140 op=-1 fd=162 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=139 op=-1 fd=136 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=138 op=-1 fd=161 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=137 op=-1 fd=160 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=136 op=-1 fd=159 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=135 op=-1 fd=158 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=134 op=-1 fd=157 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=133 op=-1 fd=156 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=131 op=-1 fd=154 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=130 op=-1 fd=153 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=129 op=-1 fd=152 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=128 op=-1 fd=151 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=127 op=-1 fd=150 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=126 op=-1 fd=149 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=125 op=-1 fd=148 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=124 op=-1 fd=147 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=123 op=-1 fd=146 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=121 op=-1 fd=145 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=120 op=-1 fd=144 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=119 op=-1 fd=143 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=118 op=-1 fd=142 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=117 op=-1 fd=141 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=116 op=-1 fd=140 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=115 op=-1 fd=139 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=114 op=-1 fd=138 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=113 op=-1 fd=137 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=112 op=-1 fd=112 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=110 op=-1 fd=135 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=109 op=-1 fd=134 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=108 op=-1 fd=133 closed - T2
[07/Jul/2011:14:15:22 +051800] conn=106 op=-1 fd=131 closed - T2
                                                                  
Is it as expected and can be marked as VERIFIED?

I am testing with below build --
[root@rheltest ~]# rpm -qa | grep 389
389-console-1.1.7-1.el6.noarch
389-ds-console-doc-1.2.6-1.el6.noarch
389-ds-base-devel-1.2.8.2-1.el6_1.6.x86_64
389-adminutil-1.1.14-1.el6.x86_64
389-admin-console-1.1.7-1.el6.noarch
389-adminutil-devel-1.1.14-1.el6.x86_64
389-ds-base-debuginfo-1.2.8.2-1.el6_1.3.x86_64
389-ds-base-libs-1.2.8.2-1.el6_1.6.x86_64
389-ds-base-1.2.8.2-1.el6_1.6.x86_64
389-ds-1.2.1-2.el6.noarch
389-admin-1.1.20-1.el6.x86_64
389-admin-console-doc-1.1.7-1.el6.noarch
389-ds-console-1.2.6-1.el6.noarch
[root@rheltest ~]# rpm -qa | grep redhat
redhat-ds-admin-9.0.0-0.2.el6.x86_64
redhat-idm-console-9.0.0-1.el6.x86_64
redhat-logos-60.0.14-1.el6.noarch
redhat-menus-14.0.0-3.el6.noarch
redhat-ds-console-9.0.0-1.el6.noarch
redhat-ds-console-doc-9.0.0-1.el6.noarch
redhat-lsb-4.0-3.el6.x86_64
redhat-ds-base-9.0.0-0.17.el6dsrv.x86_64
redhat-admin-console-9.0.0-1.el6.noarch
redhat-admin-console-doc-9.0.0-1.el6.noarch
redhat-bookmarks-6-1.el6.noarch
redhat-lsb-graphics-4.0-3.el6.x86_64
redhat-lsb-printing-4.0-3.el6.x86_64
redhat-ds-9.0.0-1.el6.x86_64
redhat-indexhtml-6-1.el6.noarch
redhat-release-server-6Server-6.1.0.2.el6.x86_64
Comment 31 Rich Megginson 2011-07-07 10:38:07 EDT
(In reply to comment #30)
> Hi Rich and Noriko,
> 
> I am using attached script - "script to reproduce the problem" to verify.
> 
> 1). I am getting below after some time :
> ldap_result: Can't contact LDAP server (-1)
> # extended LDIF
> #
> # LDAPv3
> # base <dc=example,dc=com> with scope subtree
> # filter: (cn=*)
> # requesting: dn 
> # with pagedResults control: size=5
> #
> 
> ldap_result: Can't contact LDAP server (-1)
> # extended LDIF
> #
> # LDAPv3
> # base <dc=example,dc=com> with scope subtree
> # filter: (cn=*)
> # requesting: dn 
> # with pagedResults control: size=5
> #
> 
> ldap_result: Can't contact LDAP server (-1)
> # extended LDIF
> #
> # LDAPv3
> # base <dc=example,dc=com> with scope subtree
> # filter: (cn=*)
> # requesting: dn 
> # with pagedResults control: size=5
> #
> 
> ldap_result: Can't contact LDAP server (-1)
> 
> Also I was not able to come out of this loop, It was infinite.
> Control+C was fail.
> 
> 2. I am able to execute ldapsearch on another window successfully.

Ok.  Then the bug is fixed.  The other things are ok, as long as you can connect to the server from another process.

> 
> 3. Access logs shows some T2 :
> [07/Jul/2011:14:15:22 +051800] conn=142 op=-1 fd=164 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=141 op=-1 fd=163 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=140 op=-1 fd=162 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=139 op=-1 fd=136 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=138 op=-1 fd=161 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=137 op=-1 fd=160 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=136 op=-1 fd=159 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=135 op=-1 fd=158 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=134 op=-1 fd=157 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=133 op=-1 fd=156 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=131 op=-1 fd=154 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=130 op=-1 fd=153 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=129 op=-1 fd=152 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=128 op=-1 fd=151 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=127 op=-1 fd=150 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=126 op=-1 fd=149 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=125 op=-1 fd=148 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=124 op=-1 fd=147 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=123 op=-1 fd=146 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=121 op=-1 fd=145 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=120 op=-1 fd=144 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=119 op=-1 fd=143 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=118 op=-1 fd=142 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=117 op=-1 fd=141 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=116 op=-1 fd=140 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=115 op=-1 fd=139 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=114 op=-1 fd=138 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=113 op=-1 fd=137 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=112 op=-1 fd=112 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=110 op=-1 fd=135 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=109 op=-1 fd=134 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=108 op=-1 fd=133 closed - T2
> [07/Jul/2011:14:15:22 +051800] conn=106 op=-1 fd=131 closed - T2
> 
> Is it as expected and can be marked as VERIFIED?
> 
> I am testing with below build --
> [root@rheltest ~]# rpm -qa | grep 389
> 389-console-1.1.7-1.el6.noarch
> 389-ds-console-doc-1.2.6-1.el6.noarch
> 389-ds-base-devel-1.2.8.2-1.el6_1.6.x86_64
> 389-adminutil-1.1.14-1.el6.x86_64
> 389-admin-console-1.1.7-1.el6.noarch
> 389-adminutil-devel-1.1.14-1.el6.x86_64
> 389-ds-base-debuginfo-1.2.8.2-1.el6_1.3.x86_64
> 389-ds-base-libs-1.2.8.2-1.el6_1.6.x86_64
> 389-ds-base-1.2.8.2-1.el6_1.6.x86_64
> 389-ds-1.2.1-2.el6.noarch
> 389-admin-1.1.20-1.el6.x86_64
> 389-admin-console-doc-1.1.7-1.el6.noarch
> 389-ds-console-1.2.6-1.el6.noarch
> [root@rheltest ~]# rpm -qa | grep redhat
> redhat-ds-admin-9.0.0-0.2.el6.x86_64
> redhat-idm-console-9.0.0-1.el6.x86_64
> redhat-logos-60.0.14-1.el6.noarch
> redhat-menus-14.0.0-3.el6.noarch
> redhat-ds-console-9.0.0-1.el6.noarch
> redhat-ds-console-doc-9.0.0-1.el6.noarch
> redhat-lsb-4.0-3.el6.x86_64
> redhat-ds-base-9.0.0-0.17.el6dsrv.x86_64
> redhat-admin-console-9.0.0-1.el6.noarch
> redhat-admin-console-doc-9.0.0-1.el6.noarch
> redhat-bookmarks-6-1.el6.noarch
> redhat-lsb-graphics-4.0-3.el6.x86_64
> redhat-lsb-printing-4.0-3.el6.x86_64
> redhat-ds-9.0.0-1.el6.x86_64
> redhat-indexhtml-6-1.el6.noarch
> redhat-release-server-6Server-6.1.0.2.el6.x86_64
Comment 32 Amita Sharma 2011-07-08 02:43:04 EDT
Based on comment#31, I am marking the bug as VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.