Description of problem: when running reliab15 ns-slapd crashes with assertions and/or segfaults Seems to be related to having HPUX in the mix. Version-Release number of selected component (if applicable): 8.1 daily from Mar-4 How reproducible: always Steps to Reproduce: 1. setup up reliab15. Include at least one HPUX machines as a master. 2. run the test 3. observe the HPUX master Actual results: The HPUX machine exits at some point Additional info: Attached files will be the gdb output from the HPUX machine, and the replica it communicates with. In this setup, there is 4 masters: zeta, baggins, delta, and switch All of these machines are in the dsdev.sjc.redhat.com domain. these machines have been left running, and not cleaned up for observation.
I'm testing on RHEL5 x86_64. I have not been able to reproduce the crash, but I have reproduced the bad sequence of error log messages like this: [06/Mar/2009:10:20:44 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:1120): Consumer failed to replay change (uniqueid (null), CSN (null)): Bad parameter to an ldap routine. Will retry later. I set up 2 way mmr. On m1, I send a bunch of add entry requests. I view the replication status to see that it is sending the ops to m2. Then I shutdown m2. Then I see lots of these. I think I have a fix for this, but no crash so far.
Note: gdb /path/to/ns-slapd /path/to/corefile (gdb) set logging file /tmp/bt.log (gdb) set logging on (gdb) thread apply all bt (gdb) q The full bt will be in /tmp/bt.log
Created attachment 334335 [details] patch
Created attachment 334344 [details] cvs commit log Reviewed by: nhosoi (Thanks!) Fix Description: I could not reproduce the crash, but I think the problem is that the server is not handling the disconnection case correctly. It seems that in the event of disconnection (LDAP_SERVER_DOWN 81 - Can't contact server) the code would continue to read results. repl5_inc_result_threadmain() will call conn_read_result_ex() in a loop. If conn_read_result_ex() detects a disconnection or an unrecoverable error, it will call conn_disconnect to close the connection, and return CONN_NOT_CONNECTED. Once this happens, the code must not use conn->ld any more. However, the code did not differentiate between the not connected case and other errors, so it would keep trying to read results (in case some errors are recoverable, the thread still has to read all of the pending results). The code has been fixed to handle disconnect cases specially. I also added some additional locking to make sure the result and the abort flags were set/read correctly. Finally, I changed the code that waits for results to come in, so that if the connection has been closed, it will just return immediately. Platforms tested: RHEL5 x86_64 Flag Day: no Doc impact: no
We hit an assertion again using debug bits while running reliab15 on HP-UX: Program received signal SIGABRT, Aborted si_code: 0 - SI_UNKNOWN - signal of unknown origin. [Switching to thread 1324 (system thread 5571)] 0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1 (gdb) bt #0 0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1 #1 0xc000000000244c90:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1 #2 0xc0000000002e26b0:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1 #3 0xc0000000036cc930:0 in PR_Assert+0x110 () from /opt/dirsrv/libnspr4.so #4 0xc000000003711450:0 in PR_DestroyLock+0x120 () from /opt/dirsrv/libnspr4.so #5 0xc00000000299da70:0 in prldap_mutex_free+0x40 () from /opt/dirsrv/libprldap60.so #6 0xc00000000315fa20:0 in nsldapi_mutex_free_all+0x130 () from /opt/dirsrv/libldap60.so #7 0xc0000000031ac180:0 in ldap_ld_free+0x1b60 () from /opt/dirsrv/libldap60.so #8 0xc0000000031aa600:0 in ldap_unbind+0x160 () from /opt/dirsrv/libldap60.so #9 0xc000000002956c40:0 in slapi_ldap_unbind (ld=0x6000000002992760) at ../ldapserver/ldap/servers/slapd/util.c:855 #10 0xc000000004d9f670:0 in close_connection_internal (conn=0x6000000000b59da0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1066 #11 0xc000000004da0990:0 in conn_disconnect (conn=0x6000000000b59da0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1082 #12 0xc000000004d9fe70:0 in conn_read_result_ex (conn=0x6000000000b59da0, retoidp=0x0, retdatap=0x0, returned_controls=0x0, message_id=0x9fffffff7e27cfe0, block=0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:359 #13 0xc000000004da7cf0:0 in repl5_inc_result_threadmain ( param=0x6000000006edcc70) at ../ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c:296 #14 0xc00000000372c840:0 in _pt_root+0x3e0 () from /opt/dirsrv/libnspr4.so #15 0xc0000000000b4320:0 in __pthread_bound_body+0x190 () from /usr/lib/hpux64/libpthread.so.1
Created attachment 334743 [details] patch2
Created attachment 334797 [details] cvs commit log Reviewed by: nkinder (Thanks!) Fix Description: My earlier fix was for the case where the result reader thread disconnects. But it looks like there is still a problem if the update sender thread disconnects out from under the reader thread. We need to use conn_connected() to test to see if the connection is connected before we attempt to access conn->ld in the result reader thread. I also improved the error messages so that I could tell if the errors were coming from the update sender thread or the result reader thread. Platforms tested: RHEL5 Flag Day: no Doc impact: no
Still crashing - similar place - but this time it took a lot longer to crash - so looks like on the right track with locking protection
Created attachment 334885 [details] patch3
Created attachment 334887 [details] cvs commit log Reviewed by: nkinder (Thanks!) Fix Description: There was still a small window of time during which the connection could be closed out from under the other thread which was sending/reading result. The solution is to use explicit locking using the conn->lock to protect access to the conn->ld. Since this also affected the total update code, I tested it under similar conditions, and found that it exhibited the same behavior. I added checking to the total update code to check for disconnection and coordinate access in the entry sending/result reading threads. I also fixed a spurious error message about the sasl path. Platforms tested: RHEL5 Flag Day: no Doc impact: no
Changed the summary - the crash/assertions would occur when the supplier received certain errors from the consumer, like server down or timeouts, that caused the supplier to disconnect and attempt to re-establish the connection.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0455.html