RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1022500 - Winsync plugin segfault during incremental backoff
Summary: Winsync plugin segfault during incremental backoff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base
Version: 6.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Rich Megginson
QA Contact: Sankar Ramalingam
URL:
Whiteboard:
Depends On:
Blocks: 1027047 1061410
TreeView+ depends on / blocked
 
Reported: 2013-10-23 12:25 UTC by Ján Rusnačko
Modified: 2020-09-13 20:49 UTC (History)
5 users (show)

Fixed In Version: 389-ds-base-1.2.11.15-34.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1027047 (view as bug list)
Environment:
Last Closed: 2014-10-14 07:51:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Stacktraces (39.94 KB, text/plain)
2013-10-23 12:25 UTC, Ján Rusnačko
no flags Details
Valgrind output from crashed DS (3.17 MB, text/plain)
2013-11-01 09:21 UTC, Ján Rusnačko
no flags Details
Reproducer (3.12 KB, application/x-sh)
2013-11-01 15:07 UTC, Ján Rusnačko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 918 0 None closed Winsync plugin segfault during incremental backoff 2021-02-04 22:36:13 UTC
Red Hat Product Errata RHBA-2014:1385 0 normal SHIPPED_LIVE 389-ds-base bug fix and enhancement update 2014-10-14 01:27:42 UTC

Description Ján Rusnačko 2013-10-23 12:25:49 UTC
Created attachment 815376 [details]
Stacktraces

Description of problem:
Directory server segfaults when synchronization agreement is set up with Active Directory. Crash occurs when DS instance cannot contact AD and goes into incremental backoff.

Version-Release number of selected component (if applicable):
389-ds-base-1.2.11.15-22.el6_4.i686

How reproducible:
always

Steps to Reproduce:
0. (run under gdb)
1. Set up winsync agreement via STARTTLS but dont import AD certificate (DS will go into incremental backoff since certificate cannot be verified and thus connection cannot be established)
2. Wait for a few minutes, then delete agreement
3. Set up new agreement using SSL/TLS

Error log:

[23/Oct/2013:13:39:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:39:16 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsync:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.)
[23/Oct/2013:13:39:16 +0200] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[23/Oct/2013:13:39:16 +0200] - Listening on All Interfaces port 636 for LDAPS requests
[23/Oct/2013:13:39:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:39:20 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:39:27 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:39:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:40:02 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:40:50 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:42:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:44:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:44:19 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:44:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:44:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:45:01 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:45:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:45:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:47:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:49:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:49:20 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:49:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:49:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:50:02 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:50:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:50:50 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:52:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:54:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:54:19 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:54:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:54:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:55:01 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:55:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:55:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:55:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:57:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[23/Oct/2013:13:59:05 +0200] NSMMReplicationPlugin - agmt_delete: begin
[23/Oct/2013:14:00:21 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[23/Oct/2013:14:00:21 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsync:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS error -8179:Peer's Certificate issuer is not recognized.)
[23/Oct/2013:14:00:24 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[23/Oct/2013:14:00:30 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)

Output from GDB:

(gdb) cont
Continuing.
[Thread 0xb20feb70 (LWP 6986) exited]
[New Thread 0xb20feb70 (LWP 10246)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xadaf7b70 (LWP 6993)]
__pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50
50        unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
(gdb) bt
#0  __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50
#1  0x04752e63 in PR_Lock (lock=0x0)
    at ../../../mozilla/nsprpub/pr/src/pthreads/ptsynch.c:174
#2  0x00623dd6 in event_notify (prp=0x89e1508, event=<value optimized out>)
    at ldap/servers/plugins/replication/windows_inc_protocol.c:1069
#3  0x006240a0 in windows_inc_backoff_expired (timer_fire_time=1382529637, arg=0x89e1508)
    at ldap/servers/plugins/replication/windows_inc_protocol.c:1656
#4  0x00c12f11 in eq_call_all (arg=0x0) at ldap/servers/slapd/eventq.c:312
#5  eq_loop (arg=0x0) at ldap/servers/slapd/eventq.c:359
#6  0x04759342 in _pt_root (arg=0x87d4f60)
    at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:156
#7  0x009f4a49 in start_thread (arg=0xadaf7b70) at pthread_create.c:301
#8  0x00930aae in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133

Comment 2 Noriko Hosoi 2013-10-24 23:10:26 UTC
Hi Ján,

I'm trying to reproduce the problem.  If possible, I'd like to fix the bug in the newest branch with which the problem occurs.

1) You tested using the 6.4 build:
389-ds-base-1.2.11.15-22.el6_4.i686

Are there any reason why you chose the 6.4 version?  This is the build for 6.5.  You don't see the problem with this server?
389-ds-base-1.2.11.15-29.el6

2) In the steps, you did not mention the AD certificate after deleting it in the step 1.  Is it correct?  Or did you add it back between the step 2 and 3?  Also, did you restart the server at some point?
  Steps to Reproduce:
  0. (run under gdb)
  1. Set up winsync agreement via STARTTLS but dont import AD certificate (DS will go into incremental backoff since certificate cannot be verified and thus connection cannot be established)
  2. Wait for a few minutes, then delete agreement
  3. Set up new agreement using SSL/TLS

Comment 3 Ján Rusnačko 2013-10-25 09:51:49 UTC
(In reply to Noriko Hosoi from comment #2)
> Hi Ján,
> 
> I'm trying to reproduce the problem.  If possible, I'd like to fix the bug
> in the newest branch with which the problem occurs.
> 
> 1) You tested using the 6.4 build:
> 389-ds-base-1.2.11.15-22.el6_4.i686
> 
> Are there any reason why you chose the 6.4 version?  
I did upgrade testing, so first I had to set up winsync on RHEL64 and then verify it works after upgrade. I ran into this by accident.
>This is the build for
> 6.5.  You don't see the problem with this server?
> 389-ds-base-1.2.11.15-29.el6
I am able to reproduce on RHEL 6.5
> 
> 2) In the steps, you did not mention the AD certificate after deleting it in
> the step 1.  Is it correct?  Or did you add it back between the step 2 and
> 3?  Also, did you restart the server at some point?
I did not restart server at any point. 
>   Steps to Reproduce:
>   0. (run under gdb)
>   1. Set up winsync agreement via STARTTLS but dont import AD certificate
> (DS will go into incremental backoff since certificate cannot be verified
> and thus connection cannot be established)
>   2. Wait for a few minutes, then delete agreement
>   3. Set up new agreement using SSL/TLS

For RHEL 6.5 I followed these steps:
1) Set up STARTTLS without importing AD certificate - DS will complain about missing cert
2) Wait a bit, then delete agreement
3) Create new agreement with SSL/TLS - DS will complain about missing cert
4) Import AD certificate - replication will start but complain that replica has not been initialized.
5) Wait a bit -> server crashes

This is error log from RHEL 6.5:
[25/Oct/2013:11:10:07 +0200] - 389-Directory/1.2.11.15 B2013.289.33 starting up
[25/Oct/2013:11:10:07 +0200] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[25/Oct/2013:11:10:07 +0200] - Listening on All Interfaces port 636 for LDAPS requests
[25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:10:14 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.)
[25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:10:18 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:10:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:10:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:11:00 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:11:48 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:13:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:15:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:15:18 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:15:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:15:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:16:00 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:16:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:16:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:18:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:19:12 +0200] NSMMReplicationPlugin - agmt_delete: begin
[25/Oct/2013:11:19:57 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[25/Oct/2013:11:19:57 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS error -8179:Peer's Certificate issuer is not recognized.)
[25/Oct/2013:11:20:00 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[25/Oct/2013:11:20:06 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[25/Oct/2013:11:20:18 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress)
[25/Oct/2013:11:20:42 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replication bind with SIMPLE auth resumed
[25/Oct/2013:11:20:42 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:20:45 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:20:49 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:20:53 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:20:57 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:00 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:03 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:06 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:09 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:12 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:15 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:19 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:23 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:27 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:31 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:21:34 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:34:57 +0200] - 389-Directory/1.2.11.15 B2013.289.33 starting up
[25/Oct/2013:11:34:57 +0200] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
[25/Oct/2013:11:34:59 +0200] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[25/Oct/2013:11:34:59 +0200] - Listening on All Interfaces port 636 for LDAPS requests
[25/Oct/2013:11:34:59 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.
[25/Oct/2013:11:34:59 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized.

Comment 4 Noriko Hosoi 2013-10-25 16:20:29 UTC
Thank you, Ján, for testing with rhel-6.5 bits and providing the detailed steps!

Comment 5 Noriko Hosoi 2013-10-25 21:38:55 UTC
Hi, Ján.  I'm trying to reproduce the problem with rhel-6.5 build, but so far no luck.

I'm wondering if the crash problem occurs on the x86_64 system, as well?  Or is it only on the 32 bit box?
389-ds-base-1.2.11.15-22.el6_4.i686

If okay, could you let me access your test machine/vm?  I'd like to see how prp->lock could get turned to NULL...

Comment 6 Ján Rusnačko 2013-10-29 09:55:06 UTC
(In reply to Noriko Hosoi from comment #5)
> Hi, Ján.  I'm trying to reproduce the problem with rhel-6.5 build, but so
> far no luck.
> 
> I'm wondering if the crash problem occurs on the x86_64 system, as well?  Or
> is it only on the 32 bit box?
> 389-ds-base-1.2.11.15-22.el6_4.i686
> 
> If okay, could you let me access your test machine/vm?  I'd like to see how
> prp->lock could get turned to NULL...
Yes, I am able to reproduce also on 64bit RHEL65.

I have sent you details to machine with VMs via mail.

Comment 7 Noriko Hosoi 2013-10-30 22:58:43 UTC
Hi Ján, there should be some magic I'm missing.  I still cannot reproduce the problem on your VM.  Here's what I did in details...

1. Make sure no WinSync agreement nor AD cert.
2. Add a startTLS agreement
$ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << EOF
dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDSWindowsReplicationAgreement
description: M1toAd
cn: M1toAD
nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com
nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com
nsds7NewWinUserSyncEnabled: on
nsds7NewWinGroupSyncEnabled: on
nsds7WindowsDomain: adrelm.com
nsDS5ReplicaRoot: dc=example,dc=com
nsDS5ReplicaHost: winsyncad.adrelm.com
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com
nsDS5ReplicaTransportInfo: startTLS
nsDS5ReplicaBindMethod: SIMPLE
nsDS5ReplicaCredentials: Secret123
EOF
3.Wait a bit, then delete the agreement
$ ldapdelete ...
cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
4. Create new agreement with SSL/TLS - DS will complain about missing cert
$ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << EOF
dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDSWindowsReplicationAgreement
description: M1toAd
cn: M1toAD
nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com
nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com
nsds7NewWinUserSyncEnabled: on
nsds7NewWinGroupSyncEnabled: on
nsds7WindowsDomain: adrelm.com
nsDS5ReplicaRoot: dc=example,dc=com
nsDS5ReplicaHost: winsyncad.adrelm.com
nsDS5ReplicaPort: 636
nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com
nsDS5ReplicaTransportInfo: SSL
nsDS5ReplicaBindMethod: SIMPLE
nsDS5ReplicaCredentials: Secret123
5. Import AD certificate - replication will start but complain that replica has not been initialized.
$ certutil -A -d . -n "adrelm-WINSYNCAD-CA" -t "CT,," -a -i /home/jrusnack/certificates/adcert.cer

I've been waiting for about 10 minutes, but the server does not crash...  Please advice...

Comment 8 Ján Rusnačko 2013-10-31 08:55:20 UTC
(In reply to Noriko Hosoi from comment #7)
> Hi Ján, there should be some magic I'm missing.  I still cannot reproduce
> the problem on your VM.  Here's what I did in details...
> 
> 1. Make sure no WinSync agreement nor AD cert.
> 2. Add a startTLS agreement
> $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a <<
> EOF
> dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
> objectClass: top
> objectClass: nsDSWindowsReplicationAgreement
> description: M1toAd
> cn: M1toAD
> nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com
> nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com
> nsds7NewWinUserSyncEnabled: on
> nsds7NewWinGroupSyncEnabled: on
> nsds7WindowsDomain: adrelm.com
> nsDS5ReplicaRoot: dc=example,dc=com
> nsDS5ReplicaHost: winsyncad.adrelm.com
> nsDS5ReplicaPort: 389
> nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com
> nsDS5ReplicaTransportInfo: startTLS
> nsDS5ReplicaBindMethod: SIMPLE
> nsDS5ReplicaCredentials: Secret123
> EOF
> 3.Wait a bit, then delete the agreement
> $ ldapdelete ...
> cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
> 4. Create new agreement with SSL/TLS - DS will complain about missing cert
> $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a <<
> EOF
> dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
> objectClass: top
> objectClass: nsDSWindowsReplicationAgreement
> description: M1toAd
> cn: M1toAD
> nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com
> nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com
> nsds7NewWinUserSyncEnabled: on
> nsds7NewWinGroupSyncEnabled: on
> nsds7WindowsDomain: adrelm.com
> nsDS5ReplicaRoot: dc=example,dc=com
> nsDS5ReplicaHost: winsyncad.adrelm.com
> nsDS5ReplicaPort: 636
> nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com
> nsDS5ReplicaTransportInfo: SSL
> nsDS5ReplicaBindMethod: SIMPLE
> nsDS5ReplicaCredentials: Secret123
> 5. Import AD certificate - replication will start but complain that replica
> has not been initialized.
> $ certutil -A -d . -n "adrelm-WINSYNCAD-CA" -t "CT,," -a -i
> /home/jrusnack/certificates/adcert.cer
> 
> I've been waiting for about 10 minutes, but the server does not crash... 
> Please advice...
Steps look OK. The only thing I can think of is that the AD cert is not fully removed. Please note that if you have running M1 and do
1) remove winsync agreement
2) remove AD cert
3) setup new winsync agreement

then M1 will NOT complain that certificate is not recognized. It somehow caches the cert (so connection will be OK) and will complain that replica has not been initialized yet.

So please try your steps again, but after step 1) delete winsync agreement and AD cert also restart M1. That will flush the cert. After step 2), please make sure you see

[25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)
[25/Oct/2013:11:10:14 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.)
[25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success)

in error logs. 

I have logged to the machine I shared with you and reproduced the bug. I left there opened terminals, one is tailing M1 error log, the other one has opened gdb attached to M1 with segfault.

Let me know if this helped.

Comment 9 Noriko Hosoi 2013-11-01 01:36:58 UTC
Hi Ján,

Thanks for the gdb.  From your gdb, I could tell a winsync protocol is in the event queue, which was already freed, and the staled protocol causes the crash.  But I still cannot find why it is left in the queue.  Reading the code, when a protocol is freed, it's removed from the queue (by slapi_eq_cancel).

I'd like to see the valgrind output.  I set it up and ran the steps, but I still cannot see the crash. :(

I see this Invalid read.  It should be fixed, but it's observed in the shutdown phase, so it's not the one we are chasing...
==12197== Thread 8:
==12197== Invalid read of size 8
==12197==    at 0x9E18869: agmt_has_protocol (repl5_agmt.c:2504)
==12197==    by 0x9E26ECC: prot_thread_main (repl5_protocol.c:373)
==12197==    by 0x36AD829995: _pt_root (ptthread.c:204)
==12197==    by 0x36A24079D0: start_thread (pthread_create.c:301)
==12197==    by 0x36A20E8B3C: clone (clone.S:115)
==12197==  Address 0x837cef8 is 104 bytes inside a block of size 2,808 free'd
==12197==    at 0x4A063F0: free (vg_replace_malloc.c:446)
==12197==    by 0x36A4848865: slapi_ch_free (ch_malloc.c:363)
==12197==    by 0x36A4884DD5: object_release (object.c:113)
==12197==    by 0x36A48850AF: objset_remove_obj (objset.c:271)
==12197==    by 0x9E1C228: agmtlist_shutdown (repl5_agmtlist.c:687)
==12197==    by 0x9E238C6: multimaster_stop (repl5_init.c:675)
==12197==    by 0x36A4890949: plugin_call_func (plugin.c:1453)
==12197==    by 0x36A4890A6A: plugin_dependency_closeall (plugin.c:1421)
==12197==    by 0x419023: slapd_daemon (daemon.c:1333)
==12197==    by 0x41F4EE: main (main.c:1259)

Could you please use the valgrind'ed ns-slapd and make the crash happen?  (Or if you could write a simple script to do it, it'd be nice, too...)

To start M1 server via valgrind, run this command line:
/usr/lib64/dirsrv/slapd-M1/start-slapd.val

It logs the output in /tmp/val.out.

Thank you soooo much for your help, Ján!

Comment 10 Ján Rusnačko 2013-11-01 09:21:35 UTC
Created attachment 818210 [details]
Valgrind output from crashed DS

Comment 11 Ján Rusnačko 2013-11-01 09:25:47 UTC
(In reply to Noriko Hosoi from comment #9)
> Hi Ján,
> 
> Thanks for the gdb.  From your gdb, I could tell a winsync protocol is in
> the event queue, which was already freed, and the staled protocol causes the
> crash.  But I still cannot find why it is left in the queue.  Reading the
> code, when a protocol is freed, it's removed from the queue (by
> slapi_eq_cancel).
> 
> I'd like to see the valgrind output.  I set it up and ran the steps, but I
> still cannot see the crash. :(
> 
> I see this Invalid read.  It should be fixed, but it's observed in the
> shutdown phase, so it's not the one we are chasing...
> ==12197== Thread 8:
> ==12197== Invalid read of size 8
> ==12197==    at 0x9E18869: agmt_has_protocol (repl5_agmt.c:2504)
> ==12197==    by 0x9E26ECC: prot_thread_main (repl5_protocol.c:373)
> ==12197==    by 0x36AD829995: _pt_root (ptthread.c:204)
> ==12197==    by 0x36A24079D0: start_thread (pthread_create.c:301)
> ==12197==    by 0x36A20E8B3C: clone (clone.S:115)
> ==12197==  Address 0x837cef8 is 104 bytes inside a block of size 2,808 free'd
> ==12197==    at 0x4A063F0: free (vg_replace_malloc.c:446)
> ==12197==    by 0x36A4848865: slapi_ch_free (ch_malloc.c:363)
> ==12197==    by 0x36A4884DD5: object_release (object.c:113)
> ==12197==    by 0x36A48850AF: objset_remove_obj (objset.c:271)
> ==12197==    by 0x9E1C228: agmtlist_shutdown (repl5_agmtlist.c:687)
> ==12197==    by 0x9E238C6: multimaster_stop (repl5_init.c:675)
> ==12197==    by 0x36A4890949: plugin_call_func (plugin.c:1453)
> ==12197==    by 0x36A4890A6A: plugin_dependency_closeall (plugin.c:1421)
> ==12197==    by 0x419023: slapd_daemon (daemon.c:1333)
> ==12197==    by 0x41F4EE: main (main.c:1259)
> 
> Could you please use the valgrind'ed ns-slapd and make the crash happen? 
> (Or if you could write a simple script to do it, it'd be nice, too...)
> 
> To start M1 server via valgrind, run this command line:
> /usr/lib64/dirsrv/slapd-M1/start-slapd.val
> 
> It logs the output in /tmp/val.out.
> 
> Thank you soooo much for your help, Ján!
I have reproduced crash with valgrind running, even though this time server crashed even sooner - right after creating first winsync agreement. 

However, bad news is I cannot start M1 instance anymore:
[01/Nov/2013:10:16:57 +0100] - 389-Directory/1.2.11.15 B2013.289.33 starting up
[01/Nov/2013:10:16:57 +0100] - Failed to delete old semaphore for stats file (/var/run/dirsrv/slapd-M1.stats). Error 13 (Permission denied).

due to SElinux:
[root@dstet ~]# ausearch -m avc -ts recent
----
time->Fri Nov  1 10:16:57 2013
type=SYSCALL msg=audit(1383297417.613:1866): arch=c000003e syscall=87 success=no exit=-13 a0=7ffff0f625d0 a1=36a4b08130 a2=0 a3=1 items=0 ppid=1 pid=17075 auid=0 uid=500 gid=500 euid=500 suid=500 fsuid=500 egid=500 sgid=500 fsgid=500 tty=(none) ses=274 comm="ns-slapd" exe="/usr/sbin/ns-slapd" subj=unconfined_u:system_r:dirsrv_t:s0 key=(null)
type=AVC msg=audit(1383297417.613:1866): avc:  denied  { unlink } for  pid=17075 comm="ns-slapd" name="sem.slapd-M1.stats" dev=tmpfs ino=76357 scontext=unconfined_u:system_r:dirsrv_t:s0 tcontext=unconfined_u:object_r:user_tmpfs_t:s0 tclass=file

but I have no idea why:
[root@dstet ~]# ll -Z /var/run/dirsrv/slapd-M1.stats
-rw-r--r--. jrusnack jrusnack unconfined_u:object_r:dirsrv_var_run_t:s0 /var/run/dirsrv/slapd-M1.stats

Comment 12 Rich Megginson 2013-11-01 13:23:54 UTC
It was not able to clean up the file in /dev/shm, and now it can't create a new one because the old one still exists.  see if there is a file in /dev/shm called slapd-*

Comment 13 Ján Rusnačko 2013-11-01 13:32:41 UTC
(In reply to Rich Megginson from comment #12)
> It was not able to clean up the file in /dev/shm, and now it can't create a
> new one because the old one still exists.  see if there is a file in
> /dev/shm called slapd-*
Correct, I deleted sem.slapd-M1.stats, but the instance was still broken:

[01/Nov/2013:14:28:15 +0100] - 389-Directory/1.2.11.15 B2013.289.33 starting up
[01/Nov/2013:14:28:15 +0100] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
[01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: failed to create semaphore /var/lib/dirsrv/slapd-M1/changelogdb/1b09d882-3b1c11e3-ac858fd0-58bbfde7.sema; NSPR error - -5943
[01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - cl5Open: failed to open changelog
[01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - changelog5_init: failed to start changelog at /var/lib/dirsrv/slapd-M1/changelogdb
[01/Nov/2013:14:28:15 +0100] - Failed to start object plugin Multimaster Replication Plugin

[root@dstet shm]# ausearch -m avc -ts recent
----
time->Fri Nov  1 14:27:24 2013
type=SYSCALL msg=audit(1383312444.618:2091): arch=c000003e syscall=64 success=no exit=-13 a0=6100f5c6 a1=1 a2=1b6 a3=7fff3e27afc0 items=0 ppid=1 pid=17884 auid=0 uid=500 gid=500 euid=500 suid=500 fsuid=500 egid=500 sgid=500 fsgid=500 tty=(none) ses=309 comm="ns-slapd" exe="/usr/sbin/ns-slapd" subj=unconfined_u:system_r:dirsrv_t:s0 key=(null)
type=AVC msg=audit(1383312444.618:2091): avc:  denied  { unix_read unix_write } for  pid=17884 comm="ns-slapd" key=1627452870  scontext=unconfined_u:system_r:dirsrv_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_execmem_t:s0-s0:c0.c1023 tclass=sem

I set SELinux into permissive mode, started M1 and set enforcing mode back again .. That seems to have solved the issue.

Thank you !

Comment 14 Ján Rusnačko 2013-11-01 15:07:09 UTC
Created attachment 818321 [details]
Reproducer

Hello Noriko,

I am attaching reproducer for the bug, which works reliably for me. Note 300 second sleep - I did not reproduce crash reliably with shorter one.

Also, I found typo in your steps, instead of

nsDS5ReplicaTransportInfo: startTLS

you should have used

nsDS5ReplicaTransportInfo: TLS

If you specify 'startTLS', simple ldap connection to port 389 will be used (bug or not ?). 

Hope this helps. Let me know.

Comment 15 Noriko Hosoi 2013-11-01 16:45:41 UTC
Thanks for the valgrind output, Ján!  This is exactly what I wanted to see...

==16425== Thread 14:
==16425== Invalid read of size 8
==16425==    at 0x9E425B6: event_notify (windows_inc_protocol.c:1070)
==16425==    by 0x9E428D9: windows_inc_backoff_expired (windows_inc_protocol.c:1657)
==16425==    by 0x36A485E106: eq_loop (eventq.c:312)
==16425==    by 0x36AD829995: _pt_root (ptthread.c:204)
==16425==    by 0x36A24079D0: start_thread (pthread_create.c:301)
==16425==    by 0x36A20E8B3C: clone (clone.S:115)
==16425==  Address 0x4d3e778 is 72 bytes inside a block of size 160 free'd
==16425==    at 0x4A063F0: free (vg_replace_malloc.c:446)
==16425==    by 0x36A4848865: slapi_ch_free (ch_malloc.c:363)
==16425==    by 0x9E42DC5: windows_inc_delete (windows_inc_protocol.c:178)
==16425==    by 0x9E2709F: prot_free (repl5_protocol.c:188)
==16425==    by 0x9E1AD94: agmt_stop (repl5_agmt.c:725)
==16425==    by 0x9E1C48B: agmtlist_delete_callback (repl5_agmtlist.c:581)
==16425==    by 0x36A4852337: dse_call_callback (dse.c:2411)
==16425==    by 0x36A48527C7: dse_delete (dse.c:2307)
==16425==    by 0x36A484C522: op_shared_delete (delete.c:364)
==16425==    by 0x36A484C860: do_delete (delete.c:128)
==16425==    by 0x414173: connection_threadmain (connection.c:584)
==16425==    by 0x36AD829995: _pt_root (ptthread.c:204)
==16425==    by 0x36A24079D0: start_thread (pthread_create.c:301)
==16425==    by 0x36A20E8B3C: clone (clone.S:115)

Also, thanks for the reproducer and the bug in my agreement :p.  I also noticed it yesterday, but still it did not solve my issue by fixing it...  Let me use your reproducer.

Comment 16 Noriko Hosoi 2013-11-01 21:04:19 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/47581

Comment 17 Noriko Hosoi 2013-11-01 22:11:31 UTC
Once again, thank you so much, Ján.  I could not figure out the cause of the problem without your helps -- stacktraces, valgrind, and the reproducer!  The key was the timing of removing an old agreement and adding a new one.  Most likely, I was pausing too long between the 2 events...

Comment 19 Milan Kubík 2014-07-30 15:01:38 UTC
I ran Jan's reproducer on an older build, successfully reproducing the crash. With the latest build, server didn't crash. The sync is working as expected after initialization. Thus verified.

Build: 
389-ds-base-1.2.11.15-38.el6
Windows Server 2008 R2, up to date.

Comment 20 errata-xmlrpc 2014-10-14 07:51:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1385.html


Note You need to log in before you can comment on or make changes to this bug.