Bug 488866
Summary: | crash/assert in replication when supplier attempts to reconnect to consumer during init or update | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Directory Server | Reporter: | Michael Gregg <mgregg> | ||||||||||||||
Component: | Replication - General | Assignee: | Rich Megginson <rmeggins> | ||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 8.1 | CC: | benl, mgregg, nhosoi, nkinder, rmeggins | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | 8.1 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2009-04-29 23:11:15 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 249650, 493682 | ||||||||||||||||
Attachments: |
|
Description
Michael Gregg
2009-03-06 01:01:02 UTC
I'm testing on RHEL5 x86_64. I have not been able to reproduce the crash, but I have reproduced the bad sequence of error log messages like this: [06/Mar/2009:10:20:44 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:1120): Consumer failed to replay change (uniqueid (null), CSN (null)): Bad parameter to an ldap routine. Will retry later. I set up 2 way mmr. On m1, I send a bunch of add entry requests. I view the replication status to see that it is sending the ops to m2. Then I shutdown m2. Then I see lots of these. I think I have a fix for this, but no crash so far. Note: gdb /path/to/ns-slapd /path/to/corefile (gdb) set logging file /tmp/bt.log (gdb) set logging on (gdb) thread apply all bt (gdb) q The full bt will be in /tmp/bt.log Created attachment 334335 [details]
patch
Created attachment 334344 [details]
cvs commit log
Reviewed by: nhosoi (Thanks!)
Fix Description: I could not reproduce the crash, but I think the problem is that the server is not handling the disconnection case correctly. It seems that in the event of disconnection (LDAP_SERVER_DOWN 81 - Can't contact server) the code would continue to read results.
repl5_inc_result_threadmain() will call conn_read_result_ex() in a loop. If conn_read_result_ex() detects a disconnection or an unrecoverable error, it will call conn_disconnect to close the connection, and return CONN_NOT_CONNECTED. Once this happens, the code must not use conn->ld any more. However, the code did not differentiate between the not connected case and other errors, so it would keep trying to read results (in case some errors are recoverable, the thread still has to read all of the pending results). The code has been fixed to handle disconnect cases specially. I also added some additional locking to make sure the result and the abort flags were set/read correctly. Finally, I changed the code that waits for results to come in, so that if the connection has been closed, it will just return immediately.
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no
We hit an assertion again using debug bits while running reliab15 on HP-UX: Program received signal SIGABRT, Aborted si_code: 0 - SI_UNKNOWN - signal of unknown origin. [Switching to thread 1324 (system thread 5571)] 0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1 (gdb) bt #0 0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1 #1 0xc000000000244c90:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1 #2 0xc0000000002e26b0:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1 #3 0xc0000000036cc930:0 in PR_Assert+0x110 () from /opt/dirsrv/libnspr4.so #4 0xc000000003711450:0 in PR_DestroyLock+0x120 () from /opt/dirsrv/libnspr4.so #5 0xc00000000299da70:0 in prldap_mutex_free+0x40 () from /opt/dirsrv/libprldap60.so #6 0xc00000000315fa20:0 in nsldapi_mutex_free_all+0x130 () from /opt/dirsrv/libldap60.so #7 0xc0000000031ac180:0 in ldap_ld_free+0x1b60 () from /opt/dirsrv/libldap60.so #8 0xc0000000031aa600:0 in ldap_unbind+0x160 () from /opt/dirsrv/libldap60.so #9 0xc000000002956c40:0 in slapi_ldap_unbind (ld=0x6000000002992760) at ../ldapserver/ldap/servers/slapd/util.c:855 #10 0xc000000004d9f670:0 in close_connection_internal (conn=0x6000000000b59da0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1066 #11 0xc000000004da0990:0 in conn_disconnect (conn=0x6000000000b59da0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1082 #12 0xc000000004d9fe70:0 in conn_read_result_ex (conn=0x6000000000b59da0, retoidp=0x0, retdatap=0x0, returned_controls=0x0, message_id=0x9fffffff7e27cfe0, block=0) at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:359 #13 0xc000000004da7cf0:0 in repl5_inc_result_threadmain ( param=0x6000000006edcc70) at ../ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c:296 #14 0xc00000000372c840:0 in _pt_root+0x3e0 () from /opt/dirsrv/libnspr4.so #15 0xc0000000000b4320:0 in __pthread_bound_body+0x190 () from /usr/lib/hpux64/libpthread.so.1 Created attachment 334743 [details]
patch2
Created attachment 334797 [details]
cvs commit log
Reviewed by: nkinder (Thanks!)
Fix Description: My earlier fix was for the case where the result reader thread disconnects. But it looks like there is still a problem if the update sender thread disconnects out from under the reader thread. We need to use conn_connected() to test to see if the connection is connected before we attempt to access conn->ld in the result reader thread. I also improved the error messages so that I could tell if the errors were coming from the update sender thread or the result reader thread.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no
Still crashing - similar place - but this time it took a lot longer to crash - so looks like on the right track with locking protection Created attachment 334885 [details]
patch3
Created attachment 334887 [details]
cvs commit log
Reviewed by: nkinder (Thanks!)
Fix Description: There was still a small window of time during which the connection could be closed out from under the other thread which was sending/reading result. The solution is to use explicit locking using the conn->lock to protect access to the conn->ld. Since this also affected the total update code, I tested it under similar conditions, and found that it exhibited the same behavior. I added checking to the total update code to check for disconnection and coordinate access in the entry sending/result reading threads.
I also fixed a spurious error message about the sasl path.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no
Changed the summary - the crash/assertions would occur when the supplier received certain errors from the consumer, like server down or timeouts, that caused the supplier to disconnect and attempt to re-establish the connection. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0455.html |