| Summary: | Winsync plugin segfault during incremental backoff | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Ján Rusnačko <jrusnack> | ||||||||
| Component: | 389-ds-base | Assignee: | Rich Megginson <rmeggins> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Sankar Ramalingam <sramling> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 6.4 | CC: | jgalipea, jrusnack, mkubik, nhosoi, nkinder | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | 389-ds-base-1.2.11.15-34.el6 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1027047 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2014-10-14 07:51:53 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1027047, 1061410 | ||||||||||
| Attachments: |
|
||||||||||
Hi Ján, I'm trying to reproduce the problem. If possible, I'd like to fix the bug in the newest branch with which the problem occurs. 1) You tested using the 6.4 build: 389-ds-base-1.2.11.15-22.el6_4.i686 Are there any reason why you chose the 6.4 version? This is the build for 6.5. You don't see the problem with this server? 389-ds-base-1.2.11.15-29.el6 2) In the steps, you did not mention the AD certificate after deleting it in the step 1. Is it correct? Or did you add it back between the step 2 and 3? Also, did you restart the server at some point? Steps to Reproduce: 0. (run under gdb) 1. Set up winsync agreement via STARTTLS but dont import AD certificate (DS will go into incremental backoff since certificate cannot be verified and thus connection cannot be established) 2. Wait for a few minutes, then delete agreement 3. Set up new agreement using SSL/TLS (In reply to Noriko Hosoi from comment #2) > Hi Ján, > > I'm trying to reproduce the problem. If possible, I'd like to fix the bug > in the newest branch with which the problem occurs. > > 1) You tested using the 6.4 build: > 389-ds-base-1.2.11.15-22.el6_4.i686 > > Are there any reason why you chose the 6.4 version? I did upgrade testing, so first I had to set up winsync on RHEL64 and then verify it works after upgrade. I ran into this by accident. >This is the build for > 6.5. You don't see the problem with this server? > 389-ds-base-1.2.11.15-29.el6 I am able to reproduce on RHEL 6.5 > > 2) In the steps, you did not mention the AD certificate after deleting it in > the step 1. Is it correct? Or did you add it back between the step 2 and > 3? Also, did you restart the server at some point? I did not restart server at any point. > Steps to Reproduce: > 0. (run under gdb) > 1. Set up winsync agreement via STARTTLS but dont import AD certificate > (DS will go into incremental backoff since certificate cannot be verified > and thus connection cannot be established) > 2. Wait for a few minutes, then delete agreement > 3. Set up new agreement using SSL/TLS For RHEL 6.5 I followed these steps: 1) Set up STARTTLS without importing AD certificate - DS will complain about missing cert 2) Wait a bit, then delete agreement 3) Create new agreement with SSL/TLS - DS will complain about missing cert 4) Import AD certificate - replication will start but complain that replica has not been initialized. 5) Wait a bit -> server crashes This is error log from RHEL 6.5: [25/Oct/2013:11:10:07 +0200] - 389-Directory/1.2.11.15 B2013.289.33 starting up [25/Oct/2013:11:10:07 +0200] - slapd started. Listening on All Interfaces port 389 for LDAP requests [25/Oct/2013:11:10:07 +0200] - Listening on All Interfaces port 636 for LDAPS requests [25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:10:14 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.) [25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:10:18 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:10:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:10:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:11:00 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:11:48 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:13:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:15:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:15:18 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:15:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:15:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:16:00 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:16:36 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:16:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:18:24 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:19:12 +0200] NSMMReplicationPlugin - agmt_delete: begin [25/Oct/2013:11:19:57 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [25/Oct/2013:11:19:57 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS error -8179:Peer's Certificate issuer is not recognized.) [25/Oct/2013:11:20:00 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [25/Oct/2013:11:20:06 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [25/Oct/2013:11:20:18 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [25/Oct/2013:11:20:42 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replication bind with SIMPLE auth resumed [25/Oct/2013:11:20:42 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:20:45 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:20:49 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:20:53 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:20:57 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:00 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:03 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:06 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:09 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:12 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:15 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:19 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:23 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:27 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:31 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:21:34 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:34:57 +0200] - 389-Directory/1.2.11.15 B2013.289.33 starting up [25/Oct/2013:11:34:57 +0200] - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [25/Oct/2013:11:34:59 +0200] - slapd started. Listening on All Interfaces port 389 for LDAP requests [25/Oct/2013:11:34:59 +0200] - Listening on All Interfaces port 636 for LDAPS requests [25/Oct/2013:11:34:59 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. [25/Oct/2013:11:34:59 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:636): Replica has no update vector. It has never been initialized. Thank you, Ján, for testing with rhel-6.5 bits and providing the detailed steps! Hi, Ján. I'm trying to reproduce the problem with rhel-6.5 build, but so far no luck. I'm wondering if the crash problem occurs on the x86_64 system, as well? Or is it only on the 32 bit box? 389-ds-base-1.2.11.15-22.el6_4.i686 If okay, could you let me access your test machine/vm? I'd like to see how prp->lock could get turned to NULL... (In reply to Noriko Hosoi from comment #5) > Hi, Ján. I'm trying to reproduce the problem with rhel-6.5 build, but so > far no luck. > > I'm wondering if the crash problem occurs on the x86_64 system, as well? Or > is it only on the 32 bit box? > 389-ds-base-1.2.11.15-22.el6_4.i686 > > If okay, could you let me access your test machine/vm? I'd like to see how > prp->lock could get turned to NULL... Yes, I am able to reproduce also on 64bit RHEL65. I have sent you details to machine with VMs via mail. Hi Ján, there should be some magic I'm missing. I still cannot reproduce the problem on your VM. Here's what I did in details... 1. Make sure no WinSync agreement nor AD cert. 2. Add a startTLS agreement $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << EOF dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config objectClass: top objectClass: nsDSWindowsReplicationAgreement description: M1toAd cn: M1toAD nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com nsds7NewWinUserSyncEnabled: on nsds7NewWinGroupSyncEnabled: on nsds7WindowsDomain: adrelm.com nsDS5ReplicaRoot: dc=example,dc=com nsDS5ReplicaHost: winsyncad.adrelm.com nsDS5ReplicaPort: 389 nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com nsDS5ReplicaTransportInfo: startTLS nsDS5ReplicaBindMethod: SIMPLE nsDS5ReplicaCredentials: Secret123 EOF 3.Wait a bit, then delete the agreement $ ldapdelete ... cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config 4. Create new agreement with SSL/TLS - DS will complain about missing cert $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << EOF dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config objectClass: top objectClass: nsDSWindowsReplicationAgreement description: M1toAd cn: M1toAD nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com nsds7NewWinUserSyncEnabled: on nsds7NewWinGroupSyncEnabled: on nsds7WindowsDomain: adrelm.com nsDS5ReplicaRoot: dc=example,dc=com nsDS5ReplicaHost: winsyncad.adrelm.com nsDS5ReplicaPort: 636 nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com nsDS5ReplicaTransportInfo: SSL nsDS5ReplicaBindMethod: SIMPLE nsDS5ReplicaCredentials: Secret123 5. Import AD certificate - replication will start but complain that replica has not been initialized. $ certutil -A -d . -n "adrelm-WINSYNCAD-CA" -t "CT,," -a -i /home/jrusnack/certificates/adcert.cer I've been waiting for about 10 minutes, but the server does not crash... Please advice... (In reply to Noriko Hosoi from comment #7) > Hi Ján, there should be some magic I'm missing. I still cannot reproduce > the problem on your VM. Here's what I did in details... > > 1. Make sure no WinSync agreement nor AD cert. > 2. Add a startTLS agreement > $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << > EOF > dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config > objectClass: top > objectClass: nsDSWindowsReplicationAgreement > description: M1toAd > cn: M1toAD > nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com > nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com > nsds7NewWinUserSyncEnabled: on > nsds7NewWinGroupSyncEnabled: on > nsds7WindowsDomain: adrelm.com > nsDS5ReplicaRoot: dc=example,dc=com > nsDS5ReplicaHost: winsyncad.adrelm.com > nsDS5ReplicaPort: 389 > nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com > nsDS5ReplicaTransportInfo: startTLS > nsDS5ReplicaBindMethod: SIMPLE > nsDS5ReplicaCredentials: Secret123 > EOF > 3.Wait a bit, then delete the agreement > $ ldapdelete ... > cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config > 4. Create new agreement with SSL/TLS - DS will complain about missing cert > $ ldapmodify -x -h localhost -p 389 -D 'cn=directory manager' -w <pw> -a << > EOF > dn: cn=M1toAD,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config > objectClass: top > objectClass: nsDSWindowsReplicationAgreement > description: M1toAd > cn: M1toAD > nsds7WindowsReplicaSubtree: ou=people,dc=adrelm,dc=com > nsds7DirectoryReplicaSubtree: ou=People, dc=example,dc=com > nsds7NewWinUserSyncEnabled: on > nsds7NewWinGroupSyncEnabled: on > nsds7WindowsDomain: adrelm.com > nsDS5ReplicaRoot: dc=example,dc=com > nsDS5ReplicaHost: winsyncad.adrelm.com > nsDS5ReplicaPort: 636 > nsDS5ReplicaBindDN: cn=administrator,cn=users,dc=adrelm,dc=com > nsDS5ReplicaTransportInfo: SSL > nsDS5ReplicaBindMethod: SIMPLE > nsDS5ReplicaCredentials: Secret123 > 5. Import AD certificate - replication will start but complain that replica > has not been initialized. > $ certutil -A -d . -n "adrelm-WINSYNCAD-CA" -t "CT,," -a -i > /home/jrusnack/certificates/adcert.cer > > I've been waiting for about 10 minutes, but the server does not crash... > Please advice... Steps look OK. The only thing I can think of is that the AD cert is not fully removed. Please note that if you have running M1 and do 1) remove winsync agreement 2) remove AD cert 3) setup new winsync agreement then M1 will NOT complain that certificate is not recognized. It somehow caches the cert (so connection will be OK) and will complain that replica has not been initialized yet. So please try your steps again, but after step 1) delete winsync agreement and AD cert also restart M1. That will flush the cert. After step 2), please make sure you see [25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [25/Oct/2013:11:10:14 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsyncad:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.) [25/Oct/2013:11:10:14 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) in error logs. I have logged to the machine I shared with you and reproduced the bug. I left there opened terminals, one is tailing M1 error log, the other one has opened gdb attached to M1 with segfault. Let me know if this helped. Hi Ján, Thanks for the gdb. From your gdb, I could tell a winsync protocol is in the event queue, which was already freed, and the staled protocol causes the crash. But I still cannot find why it is left in the queue. Reading the code, when a protocol is freed, it's removed from the queue (by slapi_eq_cancel). I'd like to see the valgrind output. I set it up and ran the steps, but I still cannot see the crash. :( I see this Invalid read. It should be fixed, but it's observed in the shutdown phase, so it's not the one we are chasing... ==12197== Thread 8: ==12197== Invalid read of size 8 ==12197== at 0x9E18869: agmt_has_protocol (repl5_agmt.c:2504) ==12197== by 0x9E26ECC: prot_thread_main (repl5_protocol.c:373) ==12197== by 0x36AD829995: _pt_root (ptthread.c:204) ==12197== by 0x36A24079D0: start_thread (pthread_create.c:301) ==12197== by 0x36A20E8B3C: clone (clone.S:115) ==12197== Address 0x837cef8 is 104 bytes inside a block of size 2,808 free'd ==12197== at 0x4A063F0: free (vg_replace_malloc.c:446) ==12197== by 0x36A4848865: slapi_ch_free (ch_malloc.c:363) ==12197== by 0x36A4884DD5: object_release (object.c:113) ==12197== by 0x36A48850AF: objset_remove_obj (objset.c:271) ==12197== by 0x9E1C228: agmtlist_shutdown (repl5_agmtlist.c:687) ==12197== by 0x9E238C6: multimaster_stop (repl5_init.c:675) ==12197== by 0x36A4890949: plugin_call_func (plugin.c:1453) ==12197== by 0x36A4890A6A: plugin_dependency_closeall (plugin.c:1421) ==12197== by 0x419023: slapd_daemon (daemon.c:1333) ==12197== by 0x41F4EE: main (main.c:1259) Could you please use the valgrind'ed ns-slapd and make the crash happen? (Or if you could write a simple script to do it, it'd be nice, too...) To start M1 server via valgrind, run this command line: /usr/lib64/dirsrv/slapd-M1/start-slapd.val It logs the output in /tmp/val.out. Thank you soooo much for your help, Ján! Created attachment 818210 [details]
Valgrind output from crashed DS
(In reply to Noriko Hosoi from comment #9) > Hi Ján, > > Thanks for the gdb. From your gdb, I could tell a winsync protocol is in > the event queue, which was already freed, and the staled protocol causes the > crash. But I still cannot find why it is left in the queue. Reading the > code, when a protocol is freed, it's removed from the queue (by > slapi_eq_cancel). > > I'd like to see the valgrind output. I set it up and ran the steps, but I > still cannot see the crash. :( > > I see this Invalid read. It should be fixed, but it's observed in the > shutdown phase, so it's not the one we are chasing... > ==12197== Thread 8: > ==12197== Invalid read of size 8 > ==12197== at 0x9E18869: agmt_has_protocol (repl5_agmt.c:2504) > ==12197== by 0x9E26ECC: prot_thread_main (repl5_protocol.c:373) > ==12197== by 0x36AD829995: _pt_root (ptthread.c:204) > ==12197== by 0x36A24079D0: start_thread (pthread_create.c:301) > ==12197== by 0x36A20E8B3C: clone (clone.S:115) > ==12197== Address 0x837cef8 is 104 bytes inside a block of size 2,808 free'd > ==12197== at 0x4A063F0: free (vg_replace_malloc.c:446) > ==12197== by 0x36A4848865: slapi_ch_free (ch_malloc.c:363) > ==12197== by 0x36A4884DD5: object_release (object.c:113) > ==12197== by 0x36A48850AF: objset_remove_obj (objset.c:271) > ==12197== by 0x9E1C228: agmtlist_shutdown (repl5_agmtlist.c:687) > ==12197== by 0x9E238C6: multimaster_stop (repl5_init.c:675) > ==12197== by 0x36A4890949: plugin_call_func (plugin.c:1453) > ==12197== by 0x36A4890A6A: plugin_dependency_closeall (plugin.c:1421) > ==12197== by 0x419023: slapd_daemon (daemon.c:1333) > ==12197== by 0x41F4EE: main (main.c:1259) > > Could you please use the valgrind'ed ns-slapd and make the crash happen? > (Or if you could write a simple script to do it, it'd be nice, too...) > > To start M1 server via valgrind, run this command line: > /usr/lib64/dirsrv/slapd-M1/start-slapd.val > > It logs the output in /tmp/val.out. > > Thank you soooo much for your help, Ján! I have reproduced crash with valgrind running, even though this time server crashed even sooner - right after creating first winsync agreement. However, bad news is I cannot start M1 instance anymore: [01/Nov/2013:10:16:57 +0100] - 389-Directory/1.2.11.15 B2013.289.33 starting up [01/Nov/2013:10:16:57 +0100] - Failed to delete old semaphore for stats file (/var/run/dirsrv/slapd-M1.stats). Error 13 (Permission denied). due to SElinux: [root@dstet ~]# ausearch -m avc -ts recent ---- time->Fri Nov 1 10:16:57 2013 type=SYSCALL msg=audit(1383297417.613:1866): arch=c000003e syscall=87 success=no exit=-13 a0=7ffff0f625d0 a1=36a4b08130 a2=0 a3=1 items=0 ppid=1 pid=17075 auid=0 uid=500 gid=500 euid=500 suid=500 fsuid=500 egid=500 sgid=500 fsgid=500 tty=(none) ses=274 comm="ns-slapd" exe="/usr/sbin/ns-slapd" subj=unconfined_u:system_r:dirsrv_t:s0 key=(null) type=AVC msg=audit(1383297417.613:1866): avc: denied { unlink } for pid=17075 comm="ns-slapd" name="sem.slapd-M1.stats" dev=tmpfs ino=76357 scontext=unconfined_u:system_r:dirsrv_t:s0 tcontext=unconfined_u:object_r:user_tmpfs_t:s0 tclass=file but I have no idea why: [root@dstet ~]# ll -Z /var/run/dirsrv/slapd-M1.stats -rw-r--r--. jrusnack jrusnack unconfined_u:object_r:dirsrv_var_run_t:s0 /var/run/dirsrv/slapd-M1.stats It was not able to clean up the file in /dev/shm, and now it can't create a new one because the old one still exists. see if there is a file in /dev/shm called slapd-* (In reply to Rich Megginson from comment #12) > It was not able to clean up the file in /dev/shm, and now it can't create a > new one because the old one still exists. see if there is a file in > /dev/shm called slapd-* Correct, I deleted sem.slapd-M1.stats, but the instance was still broken: [01/Nov/2013:14:28:15 +0100] - 389-Directory/1.2.11.15 B2013.289.33 starting up [01/Nov/2013:14:28:15 +0100] - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - _cl5NewDBFile: failed to create semaphore /var/lib/dirsrv/slapd-M1/changelogdb/1b09d882-3b1c11e3-ac858fd0-58bbfde7.sema; NSPR error - -5943 [01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - cl5Open: failed to open changelog [01/Nov/2013:14:28:15 +0100] NSMMReplicationPlugin - changelog program - changelog5_init: failed to start changelog at /var/lib/dirsrv/slapd-M1/changelogdb [01/Nov/2013:14:28:15 +0100] - Failed to start object plugin Multimaster Replication Plugin [root@dstet shm]# ausearch -m avc -ts recent ---- time->Fri Nov 1 14:27:24 2013 type=SYSCALL msg=audit(1383312444.618:2091): arch=c000003e syscall=64 success=no exit=-13 a0=6100f5c6 a1=1 a2=1b6 a3=7fff3e27afc0 items=0 ppid=1 pid=17884 auid=0 uid=500 gid=500 euid=500 suid=500 fsuid=500 egid=500 sgid=500 fsgid=500 tty=(none) ses=309 comm="ns-slapd" exe="/usr/sbin/ns-slapd" subj=unconfined_u:system_r:dirsrv_t:s0 key=(null) type=AVC msg=audit(1383312444.618:2091): avc: denied { unix_read unix_write } for pid=17884 comm="ns-slapd" key=1627452870 scontext=unconfined_u:system_r:dirsrv_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_execmem_t:s0-s0:c0.c1023 tclass=sem I set SELinux into permissive mode, started M1 and set enforcing mode back again .. That seems to have solved the issue. Thank you ! Created attachment 818321 [details]
Reproducer
Hello Noriko,
I am attaching reproducer for the bug, which works reliably for me. Note 300 second sleep - I did not reproduce crash reliably with shorter one.
Also, I found typo in your steps, instead of
nsDS5ReplicaTransportInfo: startTLS
you should have used
nsDS5ReplicaTransportInfo: TLS
If you specify 'startTLS', simple ldap connection to port 389 will be used (bug or not ?).
Hope this helps. Let me know.
Thanks for the valgrind output, Ján! This is exactly what I wanted to see... ==16425== Thread 14: ==16425== Invalid read of size 8 ==16425== at 0x9E425B6: event_notify (windows_inc_protocol.c:1070) ==16425== by 0x9E428D9: windows_inc_backoff_expired (windows_inc_protocol.c:1657) ==16425== by 0x36A485E106: eq_loop (eventq.c:312) ==16425== by 0x36AD829995: _pt_root (ptthread.c:204) ==16425== by 0x36A24079D0: start_thread (pthread_create.c:301) ==16425== by 0x36A20E8B3C: clone (clone.S:115) ==16425== Address 0x4d3e778 is 72 bytes inside a block of size 160 free'd ==16425== at 0x4A063F0: free (vg_replace_malloc.c:446) ==16425== by 0x36A4848865: slapi_ch_free (ch_malloc.c:363) ==16425== by 0x9E42DC5: windows_inc_delete (windows_inc_protocol.c:178) ==16425== by 0x9E2709F: prot_free (repl5_protocol.c:188) ==16425== by 0x9E1AD94: agmt_stop (repl5_agmt.c:725) ==16425== by 0x9E1C48B: agmtlist_delete_callback (repl5_agmtlist.c:581) ==16425== by 0x36A4852337: dse_call_callback (dse.c:2411) ==16425== by 0x36A48527C7: dse_delete (dse.c:2307) ==16425== by 0x36A484C522: op_shared_delete (delete.c:364) ==16425== by 0x36A484C860: do_delete (delete.c:128) ==16425== by 0x414173: connection_threadmain (connection.c:584) ==16425== by 0x36AD829995: _pt_root (ptthread.c:204) ==16425== by 0x36A24079D0: start_thread (pthread_create.c:301) ==16425== by 0x36A20E8B3C: clone (clone.S:115) Also, thanks for the reproducer and the bug in my agreement :p. I also noticed it yesterday, but still it did not solve my issue by fixing it... Let me use your reproducer. Upstream ticket: https://fedorahosted.org/389/ticket/47581 Once again, thank you so much, Ján. I could not figure out the cause of the problem without your helps -- stacktraces, valgrind, and the reproducer! The key was the timing of removing an old agreement and adding a new one. Most likely, I was pausing too long between the 2 events... I ran Jan's reproducer on an older build, successfully reproducing the crash. With the latest build, server didn't crash. The sync is working as expected after initialization. Thus verified. Build: 389-ds-base-1.2.11.15-38.el6 Windows Server 2008 R2, up to date. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1385.html |
Created attachment 815376 [details] Stacktraces Description of problem: Directory server segfaults when synchronization agreement is set up with Active Directory. Crash occurs when DS instance cannot contact AD and goes into incremental backoff. Version-Release number of selected component (if applicable): 389-ds-base-1.2.11.15-22.el6_4.i686 How reproducible: always Steps to Reproduce: 0. (run under gdb) 1. Set up winsync agreement via STARTTLS but dont import AD certificate (DS will go into incremental backoff since certificate cannot be verified and thus connection cannot be established) 2. Wait for a few minutes, then delete agreement 3. Set up new agreement using SSL/TLS Error log: [23/Oct/2013:13:39:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:39:16 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsync:389): Replication bind with SIMPLE auth failed: LDAP error -11 (Connect error) (TLS error -8179:Peer's Certificate issuer is not recognized.) [23/Oct/2013:13:39:16 +0200] - slapd started. Listening on All Interfaces port 389 for LDAP requests [23/Oct/2013:13:39:16 +0200] - Listening on All Interfaces port 636 for LDAPS requests [23/Oct/2013:13:39:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:39:20 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:39:27 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:39:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:40:02 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:40:50 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:42:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:44:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:44:19 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:44:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:44:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:45:01 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:45:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:45:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:47:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:49:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:49:20 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:49:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:49:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:50:02 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:50:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:50:50 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:52:26 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:54:16 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:54:19 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:54:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:54:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:55:01 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:55:37 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:55:38 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:55:49 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:57:25 +0200] slapi_ldap_bind - Error: could not send startTLS request: error -11 (Connect error) errno 0 (Success) [23/Oct/2013:13:59:05 +0200] NSMMReplicationPlugin - agmt_delete: begin [23/Oct/2013:14:00:21 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [23/Oct/2013:14:00:21 +0200] NSMMReplicationPlugin - agmt="cn=M1toAD" (winsync:636): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) (TLS error -8179:Peer's Certificate issuer is not recognized.) [23/Oct/2013:14:00:24 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) [23/Oct/2013:14:00:30 +0200] slapi_ldap_bind - Error: could not send bind request for id [cn=administrator,cn=users,dc=adrelm,dc=com] mech [SIMPLE]: error -1 (Can't contact LDAP server) -8179 (Peer's Certificate issuer is not recognized.) 115 (Operation now in progress) Output from GDB: (gdb) cont Continuing. [Thread 0xb20feb70 (LWP 6986) exited] [New Thread 0xb20feb70 (LWP 10246)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xadaf7b70 (LWP 6993)] __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) bt #0 __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50 #1 0x04752e63 in PR_Lock (lock=0x0) at ../../../mozilla/nsprpub/pr/src/pthreads/ptsynch.c:174 #2 0x00623dd6 in event_notify (prp=0x89e1508, event=<value optimized out>) at ldap/servers/plugins/replication/windows_inc_protocol.c:1069 #3 0x006240a0 in windows_inc_backoff_expired (timer_fire_time=1382529637, arg=0x89e1508) at ldap/servers/plugins/replication/windows_inc_protocol.c:1656 #4 0x00c12f11 in eq_call_all (arg=0x0) at ldap/servers/slapd/eventq.c:312 #5 eq_loop (arg=0x0) at ldap/servers/slapd/eventq.c:359 #6 0x04759342 in _pt_root (arg=0x87d4f60) at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:156 #7 0x009f4a49 in start_thread (arg=0xadaf7b70) at pthread_create.c:301 #8 0x00930aae in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:133