488866 – crash/assert in replication when supplier attempts to reconnect to consumer during init or update

Bug 488866 - crash/assert in replication when supplier attempts to reconnect to consumer during init or update

Summary: crash/assert in replication when supplier attempts to reconnect to consumer d...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Directory Server
Classification:	Red Hat
Component:	Replication - General
Sub Component:
Version:	8.1
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rich Megginson
QA Contact:	Chandrasekar Kannan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	249650 FDS1.2.0
TreeView+	depends on / blocked

Reported:	2009-03-06 01:01 UTC by Michael Gregg
Modified:	2015-01-04 23:36 UTC (History)
CC List:	5 users (show)
Fixed In Version:	8.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-29 23:11:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (6.28 KB, patch) 2009-03-06 18:38 UTC, Rich Megginson	no flags	Details \| Diff
cvs commit log (386 bytes, text/plain) 2009-03-06 20:02 UTC, Rich Megginson	no flags	Details
patch2 (4.81 KB, patch) 2009-03-11 02:16 UTC, Rich Megginson	no flags	Details \| Diff
cvs commit log (460 bytes, text/plain) 2009-03-11 13:42 UTC, Rich Megginson	no flags	Details
patch3 (14.08 KB, patch) 2009-03-12 01:40 UTC, Rich Megginson	no flags	Details \| Diff
cvs commit log (623 bytes, text/plain) 2009-03-12 02:17 UTC, Rich Megginson	no flags	Details
View All

Description Michael Gregg 2009-03-06 01:01:02 UTC

Description of problem:
when running reliab15 ns-slapd crashes with assertions and/or segfaults
Seems to be related to having HPUX in the mix.

Version-Release number of selected component (if applicable):
8.1 daily from Mar-4

How reproducible:
always

Steps to Reproduce:
1. setup up reliab15. Include at least one HPUX machines as a master.
2. run the test 
3. observe the HPUX master
  
Actual results:
The HPUX machine exits at some point

Additional info:
Attached files will be the gdb output from the HPUX machine, and the replica it communicates with.

In this setup, there is 4 masters:
zeta, baggins, delta, and switch

All of these machines are in the dsdev.sjc.redhat.com domain.

these machines have been left running, and not cleaned up for observation.

Comment 3 Rich Megginson 2009-03-06 17:20:42 UTC

I'm testing on RHEL5 x86_64.  I have not been able to reproduce the crash, but I have reproduced the bad sequence of error log messages like this:
[06/Mar/2009:10:20:44 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:1120): Consumer failed to replay change (uniqueid (null), CSN (null)): Bad parameter to an ldap routine. Will retry later.

I set up 2 way mmr.  On m1, I send a bunch of add entry requests.  I view the replication status to see that it is sending the ops to m2.  Then I shutdown m2.  Then I see lots of these.  I think I have a fix for this, but no crash so far.

Comment 5 Rich Megginson 2009-03-06 18:24:12 UTC

Note:
gdb /path/to/ns-slapd /path/to/corefile
(gdb) set logging file /tmp/bt.log
(gdb) set logging on
(gdb) thread apply all bt
(gdb) q

The full bt will be in /tmp/bt.log

Comment 6 Rich Megginson 2009-03-06 18:38:04 UTC

Created attachment 334335 [details]
patch

Comment 7 Rich Megginson 2009-03-06 20:02:43 UTC

Created attachment 334344 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Fix Description: I could not reproduce the crash, but I think the problem is that the server is not handling the disconnection case correctly.  It seems that in the event of disconnection (LDAP_SERVER_DOWN 81 - Can't contact server) the code would continue to read results.
repl5_inc_result_threadmain() will call conn_read_result_ex() in a loop.  If conn_read_result_ex() detects a disconnection or an unrecoverable error, it will call conn_disconnect to close the connection, and return CONN_NOT_CONNECTED.  Once this happens, the code must not use conn->ld any more.  However, the code did not differentiate between the not connected case and other errors, so it would keep trying to read results (in case some errors are recoverable, the thread still has to read all of the pending results).  The code has been fixed to handle disconnect cases specially.  I also added some additional locking to make sure the result and the abort flags were set/read correctly.  Finally, I changed the code that waits for results to come in, so that if the connection has been closed, it will just return immediately.
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no

Comment 8 Nathan Kinder 2009-03-10 20:32:03 UTC

We hit an assertion again using debug bits while running reliab15 on HP-UX:

Program received signal SIGABRT, Aborted
  si_code: 0 - SI_UNKNOWN - signal of unknown origin.
[Switching to thread 1324 (system thread 5571)]
0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0  0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0xc000000000244c90:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1
#2  0xc0000000002e26b0:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1
#3  0xc0000000036cc930:0 in PR_Assert+0x110 () from /opt/dirsrv/libnspr4.so
#4  0xc000000003711450:0 in PR_DestroyLock+0x120 ()
   from /opt/dirsrv/libnspr4.so
#5  0xc00000000299da70:0 in prldap_mutex_free+0x40 ()
   from /opt/dirsrv/libprldap60.so
#6  0xc00000000315fa20:0 in nsldapi_mutex_free_all+0x130 ()
   from /opt/dirsrv/libldap60.so
#7  0xc0000000031ac180:0 in ldap_ld_free+0x1b60 ()
   from /opt/dirsrv/libldap60.so
#8  0xc0000000031aa600:0 in ldap_unbind+0x160 () from /opt/dirsrv/libldap60.so
#9  0xc000000002956c40:0 in slapi_ldap_unbind (ld=0x6000000002992760)
    at ../ldapserver/ldap/servers/slapd/util.c:855
#10 0xc000000004d9f670:0 in close_connection_internal (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1066
#11 0xc000000004da0990:0 in conn_disconnect (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1082
#12 0xc000000004d9fe70:0 in conn_read_result_ex (conn=0x6000000000b59da0,
    retoidp=0x0, retdatap=0x0, returned_controls=0x0,
    message_id=0x9fffffff7e27cfe0, block=0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:359
#13 0xc000000004da7cf0:0 in repl5_inc_result_threadmain (
    param=0x6000000006edcc70)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c:296
#14 0xc00000000372c840:0 in _pt_root+0x3e0 () from /opt/dirsrv/libnspr4.so
#15 0xc0000000000b4320:0 in __pthread_bound_body+0x190 ()
   from /usr/lib/hpux64/libpthread.so.1

Comment 9 Rich Megginson 2009-03-11 02:16:12 UTC

Created attachment 334743 [details]
patch2

Comment 10 Rich Megginson 2009-03-11 13:42:07 UTC

Created attachment 334797 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: My earlier fix was for the case where the result reader thread disconnects.  But it looks like there is still a problem if the update sender thread disconnects out from under the reader thread.  We need to use conn_connected() to test to see if the connection is connected before we attempt to access conn->ld in the result reader thread.  I also improved the error messages so that I could tell if the errors were coming from the update sender thread or the result reader thread.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 11 Rich Megginson 2009-03-12 01:40:31 UTC

Still crashing - similar place - but this time it took a lot longer to crash - so looks like on the right track with locking protection

Comment 12 Rich Megginson 2009-03-12 01:40:58 UTC

Created attachment 334885 [details]
patch3

Comment 13 Rich Megginson 2009-03-12 02:17:18 UTC

Created attachment 334887 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: There was still a small window of time during which the connection could be closed out from under the other thread which was sending/reading result.  The solution is to use explicit locking using the conn->lock to protect access to the conn->ld.  Since this also affected the total update code, I tested it under similar conditions, and found that it exhibited the same behavior.  I added checking to the total update code to check for disconnection and coordinate access in the entry sending/result reading threads.
I also fixed a spurious error message about the sasl path.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 14 Rich Megginson 2009-04-02 17:48:11 UTC

Changed the summary - the crash/assertions would occur when the supplier received certain errors from the consumer, like server down or timeouts, that caused the supplier to disconnect and attempt to re-establish the connection.

Comment 17 Chandrasekar Kannan 2009-04-29 23:11:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html

Note You need to log in before you can comment on or make changes to this bug.