Bug 488866

Summary:

crash/assert in replication when supplier attempts to reconnect to consumer during init or update

Product:

Red Hat Directory Server

Reporter:

Michael Gregg <mgregg>

Component:

Replication - General

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Chandrasekar Kannan <ckannan>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

8.1

CC:

benl, mgregg, nhosoi, nkinder, rmeggins

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

8.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-04-29 23:11:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

249650, 493682

Attachments:

Description	Flags
patch	none
cvs commit log	none
patch2	none
cvs commit log	none
patch3	none
cvs commit log	none

Description Michael Gregg 2009-03-06 01:01:02 UTC

Description of problem:
when running reliab15 ns-slapd crashes with assertions and/or segfaults
Seems to be related to having HPUX in the mix.

Version-Release number of selected component (if applicable):
8.1 daily from Mar-4

How reproducible:
always

Steps to Reproduce:
1. setup up reliab15. Include at least one HPUX machines as a master.
2. run the test 
3. observe the HPUX master
  
Actual results:
The HPUX machine exits at some point

Additional info:
Attached files will be the gdb output from the HPUX machine, and the replica it communicates with.

In this setup, there is 4 masters:
zeta, baggins, delta, and switch

All of these machines are in the dsdev.sjc.redhat.com domain.

these machines have been left running, and not cleaned up for observation.

Comment 3 Rich Megginson 2009-03-06 17:20:42 UTC

I'm testing on RHEL5 x86_64.  I have not been able to reproduce the crash, but I have reproduced the bad sequence of error log messages like this:
[06/Mar/2009:10:20:44 -0700] NSMMReplicationPlugin - agmt="cn=meTolocalhost.localdomain1120" (localhost:1120): Consumer failed to replay change (uniqueid (null), CSN (null)): Bad parameter to an ldap routine. Will retry later.

I set up 2 way mmr.  On m1, I send a bunch of add entry requests.  I view the replication status to see that it is sending the ops to m2.  Then I shutdown m2.  Then I see lots of these.  I think I have a fix for this, but no crash so far.

Comment 5 Rich Megginson 2009-03-06 18:24:12 UTC

Note:
gdb /path/to/ns-slapd /path/to/corefile
(gdb) set logging file /tmp/bt.log
(gdb) set logging on
(gdb) thread apply all bt
(gdb) q

The full bt will be in /tmp/bt.log

Comment 6 Rich Megginson 2009-03-06 18:38:04 UTC

Created attachment 334335 [details]
patch

Comment 7 Rich Megginson 2009-03-06 20:02:43 UTC

Created attachment 334344 [details]
cvs commit log

Reviewed by: nhosoi (Thanks!)
Fix Description: I could not reproduce the crash, but I think the problem is that the server is not handling the disconnection case correctly.  It seems that in the event of disconnection (LDAP_SERVER_DOWN 81 - Can't contact server) the code would continue to read results.
repl5_inc_result_threadmain() will call conn_read_result_ex() in a loop.  If conn_read_result_ex() detects a disconnection or an unrecoverable error, it will call conn_disconnect to close the connection, and return CONN_NOT_CONNECTED.  Once this happens, the code must not use conn->ld any more.  However, the code did not differentiate between the not connected case and other errors, so it would keep trying to read results (in case some errors are recoverable, the thread still has to read all of the pending results).  The code has been fixed to handle disconnect cases specially.  I also added some additional locking to make sure the result and the abort flags were set/read correctly.  Finally, I changed the code that waits for results to come in, so that if the connection has been closed, it will just return immediately.
Platforms tested: RHEL5 x86_64
Flag Day: no
Doc impact: no

Comment 8 Nathan Kinder 2009-03-10 20:32:03 UTC

We hit an assertion again using debug bits while running reliab15 on HP-UX:

Program received signal SIGABRT, Aborted
  si_code: 0 - SI_UNKNOWN - signal of unknown origin.
[Switching to thread 1324 (system thread 5571)]
0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0  0xc00000000031fcf0:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0xc000000000244c90:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1
#2  0xc0000000002e26b0:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1
#3  0xc0000000036cc930:0 in PR_Assert+0x110 () from /opt/dirsrv/libnspr4.so
#4  0xc000000003711450:0 in PR_DestroyLock+0x120 ()
   from /opt/dirsrv/libnspr4.so
#5  0xc00000000299da70:0 in prldap_mutex_free+0x40 ()
   from /opt/dirsrv/libprldap60.so
#6  0xc00000000315fa20:0 in nsldapi_mutex_free_all+0x130 ()
   from /opt/dirsrv/libldap60.so
#7  0xc0000000031ac180:0 in ldap_ld_free+0x1b60 ()
   from /opt/dirsrv/libldap60.so
#8  0xc0000000031aa600:0 in ldap_unbind+0x160 () from /opt/dirsrv/libldap60.so
#9  0xc000000002956c40:0 in slapi_ldap_unbind (ld=0x6000000002992760)
    at ../ldapserver/ldap/servers/slapd/util.c:855
#10 0xc000000004d9f670:0 in close_connection_internal (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1066
#11 0xc000000004da0990:0 in conn_disconnect (conn=0x6000000000b59da0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:1082
#12 0xc000000004d9fe70:0 in conn_read_result_ex (conn=0x6000000000b59da0,
    retoidp=0x0, retdatap=0x0, returned_controls=0x0,
    message_id=0x9fffffff7e27cfe0, block=0)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_connection.c:359
#13 0xc000000004da7cf0:0 in repl5_inc_result_threadmain (
    param=0x6000000006edcc70)
    at ../ldapserver/ldap/servers/plugins/replication/repl5_inc_protocol.c:296
#14 0xc00000000372c840:0 in _pt_root+0x3e0 () from /opt/dirsrv/libnspr4.so
#15 0xc0000000000b4320:0 in __pthread_bound_body+0x190 ()
   from /usr/lib/hpux64/libpthread.so.1

Comment 9 Rich Megginson 2009-03-11 02:16:12 UTC

Created attachment 334743 [details]
patch2

Comment 10 Rich Megginson 2009-03-11 13:42:07 UTC

Created attachment 334797 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: My earlier fix was for the case where the result reader thread disconnects.  But it looks like there is still a problem if the update sender thread disconnects out from under the reader thread.  We need to use conn_connected() to test to see if the connection is connected before we attempt to access conn->ld in the result reader thread.  I also improved the error messages so that I could tell if the errors were coming from the update sender thread or the result reader thread.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 11 Rich Megginson 2009-03-12 01:40:31 UTC

Still crashing - similar place - but this time it took a lot longer to crash - so looks like on the right track with locking protection

Comment 12 Rich Megginson 2009-03-12 01:40:58 UTC

Created attachment 334885 [details]
patch3

Comment 13 Rich Megginson 2009-03-12 02:17:18 UTC

Created attachment 334887 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: There was still a small window of time during which the connection could be closed out from under the other thread which was sending/reading result.  The solution is to use explicit locking using the conn->lock to protect access to the conn->ld.  Since this also affected the total update code, I tested it under similar conditions, and found that it exhibited the same behavior.  I added checking to the total update code to check for disconnection and coordinate access in the entry sending/result reading threads.
I also fixed a spurious error message about the sasl path.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 14 Rich Megginson 2009-04-02 17:48:11 UTC

Changed the summary - the crash/assertions would occur when the supplier received certain errors from the consumer, like server down or timeouts, that caused the supplier to disconnect and attempt to re-establish the connection.

Comment 17 Chandrasekar Kannan 2009-04-29 23:11:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html