Bug 488866 - crash/assert in replication when supplier attempts to reconnect to consumer during init or update
crash/assert in replication when supplier attempts to reconnect to consumer d...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Directory Server
Classification: Red Hat
Component: Replication - General (Show other bugs)
8.1
All All
medium Severity medium
: ---
: ---
Assigned To: Rich Megginson
Chandrasekar Kannan
:
Depends On:
Blocks: 249650 FDS1.2.0
  Show dependency treegraph
 
Reported: 2009-03-05 20:01 EST by Michael Gregg
Modified: 2015-01-04 18:36 EST (History)
5 users (show)

See Also:
Fixed In Version: 8.1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-29 19:11:15 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
patch (6.28 KB, patch)
2009-03-06 13:38 EST, Rich Megginson
no flags Details | Diff
cvs commit log (386 bytes, text/plain)
2009-03-06 15:02 EST, Rich Megginson
no flags Details
patch2 (4.81 KB, patch)
2009-03-10 22:16 EDT, Rich Megginson
no flags Details | Diff
cvs commit log (460 bytes, text/plain)
2009-03-11 09:42 EDT, Rich Megginson
no flags Details
patch3 (14.08 KB, patch)
2009-03-11 21:40 EDT, Rich Megginson
no flags Details | Diff
cvs commit log (623 bytes, text/plain)
2009-03-11 22:17 EDT, Rich Megginson
no flags Details

  None (edit)
Description Michael Gregg 2009-03-05 20:01:02 EST
Description of problem:
when running reliab15 ns-slapd crashes with assertions and/or segfaults
Seems to be related to having HPUX in the mix.

Version-Release number of selected component (if applicable):
8.1 daily from Mar-4

How reproducible:
always

Steps to Reproduce:
1. setup up reliab15. Include at least one HPUX machines as a master.
2. run the test 
3. observe the HPUX master
  
Actual results:
The HPUX machine exits at some point

Additional info:
Attached files will be the gdb output from the HPUX machine, and the replica it communicates with.

In this setup, there is 4 masters:
zeta, baggins, delta, and switch

All of these machines are in the dsdev.sjc.redhat.com domain.

these machines have been left running, and not cleaned up for observation.
Comment 3 Rich Megginson 2009-03-06 12:20:42 EST
I'm testing on RHEL5 x86_64.  I have not been able to reproduce the crash, but I have reproduced the bad sequence of error log messages like this:
[06/Mar/2009:10:20:44 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:1120): Consumer failed to replay change (uniqueid (null), CSN (null)): Bad parameter to an ldap routine. Will retry later.

I set up 2 way mmr.  On m1, I send a bunch of add entry requests.  I view the replication status to see that it is sending the ops to m2.  Then I shutdown m2.  Then I see lots of these.  I think I have a fix for this, but no crash so far.
Comment 5 Rich Megginson 2009-03-06 13:24:12 EST
Note:
gdb /path/to/ns-slapd /path/to/corefile
(gdb) set logging file /tmp/bt.log
(gdb) set logging on
(gdb) thread apply all bt
(gdb) q

The full bt will be in /tmp/bt.log
Comment 6 Rich Megginson 2009-03-06 13:38:04 EST
Created attachment 334335 [details]
patch
Comment 7 Rich Megginson 2009-03-06 15:02:43 EST
Created attachment 334344 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Fix Description: I could not reproduce the crash, but I think the problem is that the server is not handling the disconnection case correctly.  It seems that in the event of disconnection (LDAP_SERVER_DOWN 81 - Can't contact server) the code would continue to read results.
repl5_inc_result_threadmain() will call conn_read_result_ex() in a loop.  If conn_read_result_ex() detects a disconnection or an unrecoverable error, it will call conn_disconnect to close the connection, and return CONN_NOT_CONNECTED.  Once this happens, the code must not use conn->ld any more.  However, the code did not differentiate between the not connected case and other errors, so it would keep trying to read results (in case some errors are recoverable, the thread still has to read all of the pending results).  The code has been fixed to handle disconnect cases specially.  I also added some additional locking to make sure the result and the abort flags were set/read correctly.  Finally, I changed the code that waits for results to come in, so that if the connection has been closed, it will just return immediately.
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no
Comment 8 Nathan Kinder 2009-03-10 16:32:03 EDT
We hit an assertion again using debug bits while running reliab15 on HP-UX:

Program received signal SIGABRT, Aborted
  si_code: 0 - SI_UNKNOWN - signal of unknown origin.
[Switching to thread 1324 (system thread 5571)]
0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0  0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0xc000000000244c90:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1
#2  0xc0000000002e26b0:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1
#3  0xc0000000036cc930:0 in PR_Assert+0x110 () from /opt/dirsrv/libnspr4.so
#4  0xc000000003711450:0 in PR_DestroyLock+0x120 ()
   from /opt/dirsrv/libnspr4.so
#5  0xc00000000299da70:0 in prldap_mutex_free+0x40 ()
   from /opt/dirsrv/libprldap60.so
#6  0xc00000000315fa20:0 in nsldapi_mutex_free_all+0x130 ()
   from /opt/dirsrv/libldap60.so
#7  0xc0000000031ac180:0 in ldap_ld_free+0x1b60 ()
   from /opt/dirsrv/libldap60.so
#8  0xc0000000031aa600:0 in ldap_unbind+0x160 () from /opt/dirsrv/libldap60.so
#9  0xc000000002956c40:0 in slapi_ldap_unbind (ld=0x6000000002992760)
    at ../ldapserver/ldap/servers/slapd/util.c:855
#10 0xc000000004d9f670:0 in close_connection_internal (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1066
#11 0xc000000004da0990:0 in conn_disconnect (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1082
#12 0xc000000004d9fe70:0 in conn_read_result_ex (conn=0x6000000000b59da0,
    retoidp=0x0, retdatap=0x0, returned_controls=0x0,
    message_id=0x9fffffff7e27cfe0, block=0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:359
#13 0xc000000004da7cf0:0 in repl5_inc_result_threadmain (
    param=0x6000000006edcc70)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c:296
#14 0xc00000000372c840:0 in _pt_root+0x3e0 () from /opt/dirsrv/libnspr4.so
#15 0xc0000000000b4320:0 in __pthread_bound_body+0x190 ()
   from /usr/lib/hpux64/libpthread.so.1
Comment 9 Rich Megginson 2009-03-10 22:16:12 EDT
Created attachment 334743 [details]
patch2
Comment 10 Rich Megginson 2009-03-11 09:42:07 EDT
Created attachment 334797 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: My earlier fix was for the case where the result reader thread disconnects.  But it looks like there is still a problem if the update sender thread disconnects out from under the reader thread.  We need to use conn_connected() to test to see if the connection is connected before we attempt to access conn->ld in the result reader thread.  I also improved the error messages so that I could tell if the errors were coming from the update sender thread or the result reader thread.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no
Comment 11 Rich Megginson 2009-03-11 21:40:31 EDT
Still crashing - similar place - but this time it took a lot longer to crash - so looks like on the right track with locking protection
Comment 12 Rich Megginson 2009-03-11 21:40:58 EDT
Created attachment 334885 [details]
patch3
Comment 13 Rich Megginson 2009-03-11 22:17:18 EDT
Created attachment 334887 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: There was still a small window of time during which the connection could be closed out from under the other thread which was sending/reading result.  The solution is to use explicit locking using the conn->lock to protect access to the conn->ld.  Since this also affected the total update code, I tested it under similar conditions, and found that it exhibited the same behavior.  I added checking to the total update code to check for disconnection and coordinate access in the entry sending/result reading threads.
I also fixed a spurious error message about the sasl path.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no
Comment 14 Rich Megginson 2009-04-02 13:48:11 EDT
Changed the summary - the crash/assertions would occur when the supplier received certain errors from the consumer, like server down or timeouts, that caused the supplier to disconnect and attempt to re-establish the connection.
Comment 17 Chandrasekar Kannan 2009-04-29 19:11:15 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html

Note You need to log in before you can comment on or make changes to this bug.