Bug 723937 - replication failing on RUV errors
replication failing on RUV errors
Status: CLOSED CURRENTRELEASE
Product: 389
Classification: Community
Component: Directory Server (Show other bugs)
1.2.9
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Rich Megginson
Viktor Ashirov
: VerifiedUpstream
Depends On:
Blocks: 389_1.2.9 726742
  Show dependency treegraph
 
Reported: 2011-07-21 11:18 EDT by Rob Crittenden
Modified: 2015-12-07 11:35 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-07 11:35:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
errors log on replica (27.98 KB, text/plain)
2011-07-21 11:19 EDT, Rob Crittenden
no flags Details
Patch (3.94 KB, patch)
2011-07-25 17:11 EDT, Nathan Kinder
nkinder: review?
nhosoi: review+
Details | Diff
0001-Bug-723937-replication-failing-on-RUV-errors.patch (11.13 KB, patch)
2011-08-01 19:09 EDT, Rich Megginson
nkinder: review+
Details | Diff
dirsrv errors on a replica (26.02 KB, text/plain)
2011-08-10 05:36 EDT, Martin Kosek
no flags Details
0001-Bug-723937-replication-failing-on-RUV-errors.patch (9.21 KB, patch)
2011-08-10 13:09 EDT, Rich Megginson
nkinder: review+
Details | Diff

  None (edit)
Description Rob Crittenden 2011-07-21 11:18:17 EDT
Description of problem:

Trying to create a replica of an IPA server, the installation fails because an entry cannot be added to the local 389-ds instance due to an error 32.

This is done after an MMR agreement has been set up and initial replication completed. During this replication the replia logs a slew of RUV errors and the server ends up in a strange state.

Version-Release number of selected component (if applicable):

389-ds-base-1.2.8.3-1.fc15.i686
389-ds-base-1.2.9.0-1.fc15.i686

How reproducible:


Steps to Reproduce:
1. On server A, ipa-server-install
2. On server A, ipa-replica-prepare B
3. On server B, ipa-replica-install replica-info-B
Comment 1 Rob Crittenden 2011-07-21 11:19:07 EDT
Created attachment 514222 [details]
errors log on replica
Comment 2 Nathan Kinder 2011-07-21 17:59:06 EDT
I am able to reproduce this outside of IPA.  The issue is reproducible by setting up 2 masters on F15 i686.  The problem does not occur on a F15 x86_64 system.

I believe that the problem is related to these cache errors:

[20/Jul/2011:22:26:55 -0400] - entrycache_clear_int: there are still 29 entries in the entry cache.
[20/Jul/2011:22:26:55 -0400] - dncache_clear_int: there are still 187 dn's in the dn cache. :/
Comment 3 Nathan Kinder 2011-07-22 13:54:48 EDT
This definitely looks to be some sort of issue with the caches, yet there is some timing element to it as well.  It only seems to happen on 32-bit F15 with optimized 389-ds-base binaries (F14 x86_64 was tested as well as F14 i686).

I rebuilt an optimized 389-ds-base with extra cache debugging compiled in.  I encounter this assertion in gdb on the master that is being initialized:

------------------------------------------------------------------
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb27ffb70 (LWP 8938)]
0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8)
    at ldap/servers/slapd/back-ldbm/cache.c:933
933             ASSERT(e->ep_refcnt > 0);
(gdb) bt
#0  0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8)
    at ldap/servers/slapd/back-ldbm/cache.c:933
#1  0x009e99e1 in import_foreman (param=0xb2800870)
    at ldap/servers/slapd/back-ldbm/import-threads.c:2327
#2  0x4da50c22 in ?? () from /lib/libnspr4.so
#3  0x4c9d59fe in start_thread () from /lib/libpthread.so.0
#4  0x4c91c24e in clone () from /lib/libc.so.6
------------------------------------------------------------------

The cache_remove() function expects the reference count to be at least 1.  In import_foreman, we call cache_remove() shortly after calling id2entry_add_ext(), which should add the new entry to the cache and thus set the reference count to 1.  What I'm finding is that id2entry_add_ext() fails to add the new entry to the cache since another entry with the same ID and DN already exists in the cache.  This leaves the reference count at 0 for the new entry, yet we still call cache_remove(), leading to the assertion.  Here's the debug cache logging:

------------------------------------------------------------------
[22/Jul/2011:10:39:01 -0700] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication
[22/Jul/2011:10:39:02 -0700] - => cache_find_dn (dc=example,dc=com)
[22/Jul/2011:10:39:02 -0700] - <= cache_find_dn (NOT FOUND)
[22/Jul/2011:10:39:02 -0700] - => cache_return
<= cache_return (null entry)
[22/Jul/2011:10:39:02 -0700] - => cache_return
<= cache_return (null entry)
[22/Jul/2011:10:39:02 -0700] - => entrycache_flush
[22/Jul/2011:10:39:02 -0700] - <= entrycache_flush (down to 1 entries, 0 bytes)
[22/Jul/2011:10:39:02 -0700] - entrycache_clear_int: there are still 1 entries in the entry cache.
[22/Jul/2011:10:39:02 -0700] - ID(s) in entry cache:
[22/Jul/2011:10:39:03 -0700] - 1:
[22/Jul/2011:10:39:03 -0700] - => dncache_flush
[22/Jul/2011:10:39:03 -0700] - <= dncache_flush (down to 1 dns, 0 bytes)
[22/Jul/2011:10:39:03 -0700] - dncache_clear_int: there are still 1 dn's in the dn cache. :/
[22/Jul/2011:10:39:03 -0700] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[22/Jul/2011:10:39:03 -0700] - => dncache_add_int( "dc=example,dc=com", 1 )
[22/Jul/2011:10:39:03 -0700] - entry dc=example,dc=com already in id cache!
[22/Jul/2011:10:39:03 -0700] - => dncache_replace (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) -> (dc=example,dc=com)
[22/Jul/2011:10:39:03 -0700] - <= dncache_replace OK,  cache size now 0 cache count now 0
[22/Jul/2011:10:39:03 -0700] - => dncache_return (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) reference count: 1, dn in cache:1
[22/Jul/2011:10:39:04 -0700] - => dncache_return (dc=example,dc=com) reference count: 1, dn in cache:1
[22/Jul/2011:10:39:04 -0700] - => entrycache_add_int( "dc=example,dc=com", 1 )
[22/Jul/2011:10:39:04 -0700] - entry dc=example,dc=com already in id cache!
[22/Jul/2011:10:39:04 -0700] - BAD CACHE ASSERTION at ldap/servers/slapd/back-ldbm/cache.c/933: e->ep_refcnt > 0
------------------------------------------------------------------

The cache is supposed to be cleared before we start importing entries, yet we can see that entries are being left in the cache for some reason.  These leftover entries are causing the assertion (and the RUV errors).
Comment 4 Nathan Kinder 2011-07-22 19:36:34 EDT
Noriko and I got to the bottom of this issue.  It turns out that the slapi_counter_* API is broken on 32-bit F15.  We use this API to maintain the current cache sizes, which are stored as a 64-bit value (even on 32-bit systems).

To support atomic operations on a Slapi_Counter on a 32-bit system, we previously needed to have our own assembly code to get and set specific values in the counter.  Certain platforms did not have the gcc __sync_bool_compare_and_swap() built-in for us to use, which required us to roll our own in assembly code.  Long story short, it appears that gcc on F15 optimizes our assembly code, which breaks it.  The version of gcc on F14 and earlier did not optimize our assembly the same way, which is why it works there.

I believe that we no longer need to have our own assembly code, as __sync_bool_compare_and_swap() should be available on all platforms we currently build for.  I believe it was in the gcc3 timeframe that we needed to add the assembly code.  I'll work up a patch for this next week, but a quick test of just using __sync_bool_compare_and_swap() in slapi_counter_set() and slapi_counter_get() fixes the issue on my F15 i686 system.
Comment 5 Nathan Kinder 2011-07-25 17:11:14 EDT
Created attachment 515140 [details]
Patch
Comment 6 Rich Megginson 2011-07-25 17:37:51 EDT
Comment on attachment 515140 [details]
Patch

Is this just for f15?  If so, is there some logic we can put in the code or in configure to turn on this code when using older versions of gcc?
Comment 7 Nathan Kinder 2011-07-25 17:53:41 EDT
(In reply to comment #6)
> Comment on attachment 515140 [details]
> Patch
> 
> Is this just for f15?  If so, is there some logic we can put in the code or in
> configure to turn on this code when using older versions of gcc?

There was no way of detecting if __sync_bool_compare_and_swap() was available and supported 8 byte parameters in old versions of gcc from what I remember.  The problem is that you can have the __sync_bool_compare_and_swap symbol, but it may only support 32-bit values.  You don't find out that there's a problem until you actually compile, which doesn't help at configure time.

The original issue that required us to write our own get/set in assembly was with gcc3 IIRC.  I've tested using __sync_bool_compare_and_swap() with 8 byte parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and works correctly.
Comment 8 Rich Megginson 2011-07-25 18:08:16 EDT
(In reply to comment #7)
> (In reply to comment #6)
> > Comment on attachment 515140 [details]
> > Patch
> > 
> > Is this just for f15?  If so, is there some logic we can put in the code or in
> > configure to turn on this code when using older versions of gcc?
> 
> There was no way of detecting if __sync_bool_compare_and_swap() was available
> and supported 8 byte parameters in old versions of gcc from what I remember. 
> The problem is that you can have the __sync_bool_compare_and_swap symbol, but
> it may only support 32-bit values.  You don't find out that there's a problem
> until you actually compile, which doesn't help at configure time.

We could write a small test program at configure time - that's the way most of the configure tests for headers and libraries works.

But I'm ok with the way it is, but we may have problems if we decide to support platforms that use older gcc (debian?  suse?  bsd?)

> The original issue that required us to write our own get/set in assembly was
> with gcc3 IIRC.  I've tested using __sync_bool_compare_and_swap() with 8 byte
> parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and
> works correctly.
Comment 9 Nathan Kinder 2011-07-26 10:55:47 EDT
(In reply to comment #8)
> 
> But I'm ok with the way it is, but we may have problems if we decide to support
> platforms that use older gcc (debian?  suse?  bsd?)

We can always change the define for 64-bit atomic operations on those platforms, which will use a mutex in the Slapi_Counter calls instead.  We'll deal with that if it comes up, though I think we're safe on anything fairly modern.

Pushed to master.  Thanks to Rich and Noriko for their reviews!

Counting objects: 11, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 793 bytes, done.
Total 6 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   5efd62b..c5bd907  master -> master
Comment 10 Rich Megginson 2011-08-01 19:09:10 EDT
Created attachment 516222 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch
Comment 11 Fedora Update System 2011-08-02 13:36:23 EDT
389-ds-base-1.2.9.1-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc16
Comment 12 Fedora Update System 2011-08-02 13:37:03 EDT
389-ds-base-1.2.9.1-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc15
Comment 13 Fedora Update System 2011-08-02 13:37:56 EDT
389-ds-base-1.2.9.1-2.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc14
Comment 14 Fedora Update System 2011-08-02 13:38:41 EDT
389-ds-base-1.2.9.1-2.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.el5
Comment 15 Martin Kosek 2011-08-10 05:31:14 EDT
Rich, I tested the new 389-ds-base version on my two F-15 32b systems and the installation is still broken (RUV bugs that Rob reported). I will attach my errors file.

RPM version:
389-ds-base-1.2.9.5-1.fc15.i686
Comment 16 Martin Kosek 2011-08-10 05:36:10 EDT
Created attachment 517554 [details]
dirsrv errors on a replica
Comment 17 Rich Megginson 2011-08-10 13:09:57 EDT
Created attachment 517662 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch
Comment 18 Rich Megginson 2011-08-10 13:36:15 EDT
To ssh://git.fedorahosted.org/git/389/ds.git
   1e357c1..de75e6d  master -> master
commit de75e6d27839a6a63f0ac7b8a830db5b95813073
Author: Rich Megginson <rmeggins@redhat.com>
Date:   Wed Aug 10 09:46:59 2011 -0600
    Reviewed by: nkinder (Thanks!)
    Branch: master
    Fix Description: I incorrectly assumed I could just revert the patch
    that removed the assembler code, but I needed to tweak the #if
    condition that decides to use the assembler code or not.
    I also added separate configure checks and defines for the 64bit atomic
    op functions and the 64bit atomic cas function since they are different
    on some platforms.
    Platforms tested: RHEL6 x86_64, F15 i686, EL5 i386
    Flag Day: no
    Doc impact: no
Comment 19 Martin Kosek 2011-08-11 07:04:09 EDT
I can confirm that FreeIPA replica installation was OK on a 32-bit system. I used the following version:

389-ds-base-1.2.9.6-1.fc15.i686

Note You need to log in before you can comment on or make changes to this bug.