Bug 487425

Summary:

slapd crashes after changelog is moved

Product:

Red Hat Directory Server

Reporter:

Ulf Weltman <ulf.weltman>

Component:

Replication - General

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Viktor Ashirov <vashirov>

Severity:

high

Docs Contact:

Priority:

high

Version:

8.0

CC:

jgalipea, nhosoi, nkinder

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Other

Whiteboard:

Fixed In Version:

8.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-06 14:28:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

434914, 518519

Attachments:

Description	Flags
patch	none
cvs commit log	none
trivial fix proposal for crash after re-enabling changelog	none

Description Ulf Weltman 2009-02-25 21:58:45 UTC

I've duplicated this with RHDS 8.0 on 11.23 IA64:

1) Using console or modify of nsslapd-changelogdir, move the changelog.  Moving a changelog is briefly mentioned here:
http://www.redhat.com/docs/manuals/dir-server/ag/8.0/Managing_Replication-Deleting_the_Changelog.html
2) Re-initialize consumer.
3) Make a change in replicated area, ns-slapd crashes when it tries to changelog it.

Program terminated with signal 11, Segmentation fault.
SEGV_ACCERR - Invalid Permissions for object
#0  0x0 in <unknown_procedure> ()
warning: Attempting to unwind past bad PC 0x0
#1  0xc000000002466780:0 in clcache_open_cursor (txn=0x0,
    buf=0x6000000000650700, cursor=0x9fffffff7f3bce90)
    at ../ldap/servers/plugins/replication/cl5_clcache.c:891
#2  0xc000000002466290:0 in clcache_load_buffer_bulk (buf=0x6000000000650700,
    flag=18) at ../ldap/servers/plugins/replication/cl5_clcache.c:377
#3  0xc0000000024658c0:0 in clcache_load_buffer (buf=0x6000000000650700,
    anchorcsn=0x6000000000a3f5a0, flag=18)
    at ../ldap/servers/plugins/replication/cl5_clcache.c:333
#4  0xc00000000244b860:0 in _cl5PositionCursorForReplay (consumerRID=0,
    consumerRuv=0x6000000000679160, replica=0x60000000003ad290,
    fileObj=0x60000000005bac70, iterator=0x9fffffff7f3bcf10)
    at ../ldap/servers/plugins/replication/cl5_api.c:5699
#5  0xc00000000244c660:0 in cl5CreateReplayIterator (prp=0x6000000000561a20,
    consumerRuv=0x6000000000679160, iterator=0x9fffffff7f3bcf10)
    at ../ldap/servers/plugins/replication/cl5_api.c:1838
#6  0xc0000000024a30e0:0 in send_updates (prp=0x6000000000561a20,
    remote_update_vector=0x6000000000679160,
    num_changes_sent=0x9fffffff7f3bcfb0)
    at ../ldap/servers/plugins/replication/repl5_inc_protocol.c:1627
#7  0xc0000000024a1380:0 in repl5_inc_run (prp=0x6000000000561a20)
    at ../ldap/servers/plugins/replication/repl5_inc_protocol.c:1119
#8  0xc0000000024b3a90:0 in prot_thread_main (arg=0x600000000039a2a0)
    at ../ldap/servers/plugins/replication/repl5_protocol.c:313
#9  0xc000000000bd60d0:0 in _pt_root ()
    at ../../../../../nsprpub/pr/src/pthreads/ptthread.c:220
#10 0xc0000000000d0f40:0 in __pthread_bound_body+0x190 ()
   from /usr/lib/hpux64/libpthread.so.1

Comment 1 Rich Megginson 2009-03-03 18:38:29 UTC

When you change the changelog directory, the code attempts to completely close and delete the old changelog before creating the new one in the new directory.  The problem is that the replication incrememtal protocol code has an open reference to the changelog database (fileObj in the iterator object).  The only way to cleanly change the changelog directory is to make sure incremental update is not happening.  I think the way it should work then is that if you attempt to change the changelog directory, and an update (or init) is in progress, the code should return Unwilling to Perform with a message like "cannot change the changelog directory while an update is in progress - please stop replication first, then change the changelog".  I did a simple test like this:

set up 2 masters
replica init 1 to 2
added a lot of entries to m1
printed the replication status to make sure update was in progress
stopped replication by changing the nsds5replicaupdateschedule using ldapmodify
printed the replication status to make sure update was stopped
changed the changelog directory using ldapmodify
restarted replication by changing the schedule back
did the replica re-init

This worked fine.  However, if I omit the stopped replication step, it fails with the crash.

Comment 2 Rich Megginson 2009-03-04 04:32:05 UTC

No, it doesn't work.  It will still crash as soon as you start firing updates at the master after the replica re-init.  The problem is the clcache (cl5_clcache.c).  This keeps a static list of caches, and uses a per-thread cache as well.  These caches must be cleaned up when the changelog is closed because they contain pointers to the dbenv in s_cl5Desc.dbEnv and the DB* in s_cl5Desc.dbFiles in cl5_api.c.  Unfortunately, there is currently no code to do this.  At this point, it will quite likely take at least a couple more days to fix this, and likely more, and will be quite invasive, requiring some extensive testing (and debugging) of the replication code.  Is it worth it to try to get this fix into 8.1?

Comment 3 Ulf Weltman 2009-03-04 16:22:54 UTC

Doesn't seem worth it to me.  I imagine it's a pretty uncommon procedure and it must be an old bug, while this is the first I hear of it.  Could you add a workaround do the docs?  I don't think it will crash if the instance is restarted between moving the changelog and reinitializing the consumers.

Comment 4 Rich Megginson 2009-03-04 17:26:52 UTC

Created attachment 334019 [details]
patch

Comment 5 Rich Megginson 2009-03-04 18:59:10 UTC

Created attachment 334032 [details]
cvs commit log

Reviewed by: nkinder, nhosoi (Thanks!)
Files: see diff
Branch: HEAD
Fix Description: There are a number of real fixes, mixed in with many changes for debugging and instrumentation.
1) When the update thread gets the changelog iterator, it will use _cl5AddThread to increment the count of threads holding an open handle to the changelog.  When it releases the iterator, or if there were some error acquiring the database handle, it will decrement the thread count.  The way it used to work was that it would increment the thread count when retrieving the DB object, but then would immediately decrement it, meaning it had an open handle to the database, but there was no way for the changelog code to know that (except via the reference count on the DB object itself).
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec4
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec5
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec6
2) Changed the AddThread code to increment the thread count outside of the state lock - this better fits the semantics of the other uses of threadcount which are outside of the lock.
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec12
3) The changelog code that closes the databases was not closing things down in the correct order.  The first thing it must do is wait for all threads with open database handles or otherwise accessing the database to terminate.  Once that is done, it can call _cl5DBClose() to actually close all of the databases.  Otherwise, a race condition could cause a database to be accessed after it has been closed.
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec16
4) Added clcache cleanup code, and made it possible to re-init the clcache.  The clcache was not designed to be dynamically closed and opened.
clcache is init-ed in _cl5Open - https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec8
clcache_init is re-entrant - https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_clcache.c_sec1
Added more code to clean up the clcache - https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_clcache.c_sec4
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_clcache.c_sec5
Delete the clcache in _cl5Delete - https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_api.c_sec18
5) The clcache stores the current buffer in a thread private storage area.  If the clcache has been re-initialized, this buffer is also invalid and the clcache code must get a new buffer.
https://bugzilla.redhat.com/attachment.cgi?id=334019&action=diff#ldapserver/ldap/servers/plugins/replication/cl5_clcache.c_sec2
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 6 Yi Zhang 2009-04-07 22:04:25 UTC

Talk with Nathan, and it seems like we won't be able to verify all changes Rich listed above. So my test is simple follow Ulf's report, and the server are working fine after the changelog change --- Such test is sufficient to prove the over result of fixes underneath the surface.

Unless we (QA) can perform testes at code level, QA will not be able to verify every and each details. For now, this bug is considered as fixed, therefore closed.

Comment 7 Chandrasekar Kannan 2009-04-29 23:10:49 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html

Comment 8 Ulf Weltman 2009-08-05 18:16:29 UTC

Our QA noted a new crash that seems to be related to this fix, caused by enabling, disabling, and then enabling the changelog again.

It seems the global structure _pool will have been freed during the disabling (by the new clcache_destroy() function) and then it's dereferenced during the second enabling.

Program received signal SIGBUS, Bus error
  si_code: 0 - BUS_UNKNOWN - Unknown Error.
[Switching to thread 14 (system thread 164086)]
0x800003ffee4edf1c in clcache_set_config (config=0x800003ffedf71630)
    at ../ldap/servers/plugins/replication/cl5_clcache.c:204
204             _pool->pl_buffer_cnt_max = config->maxChCacheEntries;
(gdb) bt
#0  0x800003ffee4edf1c in clcache_set_config (config=0x800003ffedf71630)
    at ../ldap/servers/plugins/replication/cl5_clcache.c:204
#1  0x800003ffee4f3458 in changelog5_extract_config (entry=0x80000001005ada80,
    config=0x800003ffedf71618)
    at ../ldap/servers/plugins/replication/cl5_config.c:911
#2  0x800003ffee4f0fb4 in changelog5_config_add (pb=0x8000000100936cb8,
    e=0x80000001005ada80, entryAfter=0x0, returncode=0x800003ffedf7125c,
    returntext=0x800003ffedf71260 "", arg=0x0)
    at ../ldap/servers/plugins/replication/cl5_config.c:194
#3  0x800003ffeff2a568 in dse_call_callback (pdse=0x800000010003b988,
    pb=0x8000000100936cb8, operation=16, flags=1,
    entryBefore=0x80000001005ada80, entryAfter=0x0,
    returncode=0x800003ffedf7125c, returntext=0x800003ffedf71260 "")
    at ../ldap/servers/slapd/dse.c:2195
#4  0x800003ffeff29b1c in dse_add (pb=0x8000000100936cb8)
    at ../ldap/servers/slapd/dse.c:2021
#5  0x800003ffeff081c8 in op_shared_add (pb=0x8000000100936cb8)
    at ../ldap/servers/slapd/add.c:624
#6  0x800003ffeff06c44 in do_add (pb=0x8000000100936cb8)
    at ../ldap/servers/slapd/add.c:225
#7  0x400000000001eddc in connection_dispatch_operation (
    conn=0x80000001007a4618, op=0x80000001009382d0, pb=0x8000000100936cb8)
    at ../ldap/servers/slapd/connection.c:487
#8  0x4000000000021134 in connection_threadmain ()
    at ../ldap/servers/slapd/connection.c:2161
#9  0x800003ffefabfe0c in _pt_root+0x124 () from /opt/dirsrv/lib/libnspr4.sl
#10 0x800003ffefe4b064 in __pthread_bound_body+0xcc ()
   from /lib/pa20_64/libpthread.1
#11 0x800003ffefe7d804 in __pthread_bound_start+0x14 ()
   from /lib/pa20_64/libpthread.1

Comment 9 Ulf Weltman 2009-08-17 20:23:03 UTC

Created attachment 357703 [details]
trivial fix proposal for crash after re-enabling changelog

Trivial fix proposal which depends on (doesn't obsolete) patch 334019:
call clcache_set_config() from cl5Open() instead of changelog5_extract_config() to ensure it's only called after the global changelog cache pool has been allocated by _cl5Open().

Comment 10 Rich Megginson 2009-08-25 19:05:19 UTC

commit e8bc100b891e9e9d5a446394058af5f6de2e9b3e
Author: Rich Megginson <rmeggins>
Date:   Tue Aug 25 11:09:52 2009 -0600

    https://bugzilla.redhat.com/show_bug.cgi?id=487425
    Resolves: bug 487425
    Bug Description: slapd crashes after changelog is moved
    Reviewed by: rmeggins
    Fix Description: Call clcache_set_config after the global changelog cache po
    Platforms tested: HPUX 11 (PA-RISC 2.0 64-bit)
    Flag Day: no
    Doc impact: no

Comment 11 Rich Megginson 2009-08-25 19:23:24 UTC

1.2 branch commit
commit 5d5ce8ed4a3e37e49f4cc0fb3da17bb2d248d61a
Author: Rich Megginson <rmeggins>
Date:   Tue Aug 25 11:09:52 2009 -0600

Comment 12 Jenny Severance 2010-05-20 17:27:29 UTC

fix verified - RHEL 4

version:

redhat-ds-base-8.2.0-2010052004.el4dsrv

Changlog location successfully changed on server 1 while large import and update in progress, subsequent consumer re-initialization also successful during the update - no crash of either server