Bug 723937

Summary: replication failing on RUV errors
Product: [Retired] 389 Reporter: Rob Crittenden <rcritten>
Component: Directory ServerAssignee: Rich Megginson <rmeggins>
Status: CLOSED CURRENTRELEASE QA Contact: Viktor Ashirov <vashirov>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.2.9CC: edewata, mkosek, nhosoi, nkinder, rmeggins
Target Milestone: ---Keywords: VerifiedUpstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-07 16:35:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 708096, 726742    
Attachments:
Description Flags
errors log on replica
none
Patch
nkinder: review?, nhosoi: review+
0001-Bug-723937-replication-failing-on-RUV-errors.patch
nkinder: review+
dirsrv errors on a replica
none
0001-Bug-723937-replication-failing-on-RUV-errors.patch nkinder: review+

Description Rob Crittenden 2011-07-21 15:18:17 UTC
Description of problem:

Trying to create a replica of an IPA server, the installation fails because an entry cannot be added to the local 389-ds instance due to an error 32.

This is done after an MMR agreement has been set up and initial replication completed. During this replication the replia logs a slew of RUV errors and the server ends up in a strange state.

Version-Release number of selected component (if applicable):

389-ds-base-1.2.8.3-1.fc15.i686
389-ds-base-1.2.9.0-1.fc15.i686

How reproducible:


Steps to Reproduce:
1. On server A, ipa-server-install
2. On server A, ipa-replica-prepare B
3. On server B, ipa-replica-install replica-info-B

Comment 1 Rob Crittenden 2011-07-21 15:19:07 UTC
Created attachment 514222 [details]
errors log on replica

Comment 2 Nathan Kinder 2011-07-21 21:59:06 UTC
I am able to reproduce this outside of IPA.  The issue is reproducible by setting up 2 masters on F15 i686.  The problem does not occur on a F15 x86_64 system.

I believe that the problem is related to these cache errors:

[20/Jul/2011:22:26:55 -0400] - entrycache_clear_int: there are still 29 entries in the entry cache.
[20/Jul/2011:22:26:55 -0400] - dncache_clear_int: there are still 187 dn's in the dn cache. :/

Comment 3 Nathan Kinder 2011-07-22 17:54:48 UTC
This definitely looks to be some sort of issue with the caches, yet there is some timing element to it as well.  It only seems to happen on 32-bit F15 with optimized 389-ds-base binaries (F14 x86_64 was tested as well as F14 i686).

I rebuilt an optimized 389-ds-base with extra cache debugging compiled in.  I encounter this assertion in gdb on the master that is being initialized:

------------------------------------------------------------------
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb27ffb70 (LWP 8938)]
0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8)
    at ldap/servers/slapd/back-ldbm/cache.c:933
933             ASSERT(e->ep_refcnt > 0);
(gdb) bt
#0  0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8)
    at ldap/servers/slapd/back-ldbm/cache.c:933
#1  0x009e99e1 in import_foreman (param=0xb2800870)
    at ldap/servers/slapd/back-ldbm/import-threads.c:2327
#2  0x4da50c22 in ?? () from /lib/libnspr4.so
#3  0x4c9d59fe in start_thread () from /lib/libpthread.so.0
#4  0x4c91c24e in clone () from /lib/libc.so.6
------------------------------------------------------------------

The cache_remove() function expects the reference count to be at least 1.  In import_foreman, we call cache_remove() shortly after calling id2entry_add_ext(), which should add the new entry to the cache and thus set the reference count to 1.  What I'm finding is that id2entry_add_ext() fails to add the new entry to the cache since another entry with the same ID and DN already exists in the cache.  This leaves the reference count at 0 for the new entry, yet we still call cache_remove(), leading to the assertion.  Here's the debug cache logging:

------------------------------------------------------------------
[22/Jul/2011:10:39:01 -0700] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication
[22/Jul/2011:10:39:02 -0700] - => cache_find_dn (dc=example,dc=com)
[22/Jul/2011:10:39:02 -0700] - <= cache_find_dn (NOT FOUND)
[22/Jul/2011:10:39:02 -0700] - => cache_return
<= cache_return (null entry)
[22/Jul/2011:10:39:02 -0700] - => cache_return
<= cache_return (null entry)
[22/Jul/2011:10:39:02 -0700] - => entrycache_flush
[22/Jul/2011:10:39:02 -0700] - <= entrycache_flush (down to 1 entries, 0 bytes)
[22/Jul/2011:10:39:02 -0700] - entrycache_clear_int: there are still 1 entries in the entry cache.
[22/Jul/2011:10:39:02 -0700] - ID(s) in entry cache:
[22/Jul/2011:10:39:03 -0700] - 1:
[22/Jul/2011:10:39:03 -0700] - => dncache_flush
[22/Jul/2011:10:39:03 -0700] - <= dncache_flush (down to 1 dns, 0 bytes)
[22/Jul/2011:10:39:03 -0700] - dncache_clear_int: there are still 1 dn's in the dn cache. :/
[22/Jul/2011:10:39:03 -0700] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[22/Jul/2011:10:39:03 -0700] - => dncache_add_int( "dc=example,dc=com", 1 )
[22/Jul/2011:10:39:03 -0700] - entry dc=example,dc=com already in id cache!
[22/Jul/2011:10:39:03 -0700] - => dncache_replace (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) -> (dc=example,dc=com)
[22/Jul/2011:10:39:03 -0700] - <= dncache_replace OK,  cache size now 0 cache count now 0
[22/Jul/2011:10:39:03 -0700] - => dncache_return (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) reference count: 1, dn in cache:1
[22/Jul/2011:10:39:04 -0700] - => dncache_return (dc=example,dc=com) reference count: 1, dn in cache:1
[22/Jul/2011:10:39:04 -0700] - => entrycache_add_int( "dc=example,dc=com", 1 )
[22/Jul/2011:10:39:04 -0700] - entry dc=example,dc=com already in id cache!
[22/Jul/2011:10:39:04 -0700] - BAD CACHE ASSERTION at ldap/servers/slapd/back-ldbm/cache.c/933: e->ep_refcnt > 0
------------------------------------------------------------------

The cache is supposed to be cleared before we start importing entries, yet we can see that entries are being left in the cache for some reason.  These leftover entries are causing the assertion (and the RUV errors).

Comment 4 Nathan Kinder 2011-07-22 23:36:34 UTC
Noriko and I got to the bottom of this issue.  It turns out that the slapi_counter_* API is broken on 32-bit F15.  We use this API to maintain the current cache sizes, which are stored as a 64-bit value (even on 32-bit systems).

To support atomic operations on a Slapi_Counter on a 32-bit system, we previously needed to have our own assembly code to get and set specific values in the counter.  Certain platforms did not have the gcc __sync_bool_compare_and_swap() built-in for us to use, which required us to roll our own in assembly code.  Long story short, it appears that gcc on F15 optimizes our assembly code, which breaks it.  The version of gcc on F14 and earlier did not optimize our assembly the same way, which is why it works there.

I believe that we no longer need to have our own assembly code, as __sync_bool_compare_and_swap() should be available on all platforms we currently build for.  I believe it was in the gcc3 timeframe that we needed to add the assembly code.  I'll work up a patch for this next week, but a quick test of just using __sync_bool_compare_and_swap() in slapi_counter_set() and slapi_counter_get() fixes the issue on my F15 i686 system.

Comment 5 Nathan Kinder 2011-07-25 21:11:14 UTC
Created attachment 515140 [details]
Patch

Comment 6 Rich Megginson 2011-07-25 21:37:51 UTC
Comment on attachment 515140 [details]
Patch

Is this just for f15?  If so, is there some logic we can put in the code or in configure to turn on this code when using older versions of gcc?

Comment 7 Nathan Kinder 2011-07-25 21:53:41 UTC
(In reply to comment #6)
> Comment on attachment 515140 [details]
> Patch
> 
> Is this just for f15?  If so, is there some logic we can put in the code or in
> configure to turn on this code when using older versions of gcc?

There was no way of detecting if __sync_bool_compare_and_swap() was available and supported 8 byte parameters in old versions of gcc from what I remember.  The problem is that you can have the __sync_bool_compare_and_swap symbol, but it may only support 32-bit values.  You don't find out that there's a problem until you actually compile, which doesn't help at configure time.

The original issue that required us to write our own get/set in assembly was with gcc3 IIRC.  I've tested using __sync_bool_compare_and_swap() with 8 byte parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and works correctly.

Comment 8 Rich Megginson 2011-07-25 22:08:16 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Comment on attachment 515140 [details]
> > Patch
> > 
> > Is this just for f15?  If so, is there some logic we can put in the code or in
> > configure to turn on this code when using older versions of gcc?
> 
> There was no way of detecting if __sync_bool_compare_and_swap() was available
> and supported 8 byte parameters in old versions of gcc from what I remember. 
> The problem is that you can have the __sync_bool_compare_and_swap symbol, but
> it may only support 32-bit values.  You don't find out that there's a problem
> until you actually compile, which doesn't help at configure time.

We could write a small test program at configure time - that's the way most of the configure tests for headers and libraries works.

But I'm ok with the way it is, but we may have problems if we decide to support platforms that use older gcc (debian?  suse?  bsd?)

> The original issue that required us to write our own get/set in assembly was
> with gcc3 IIRC.  I've tested using __sync_bool_compare_and_swap() with 8 byte
> parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and
> works correctly.

Comment 9 Nathan Kinder 2011-07-26 14:55:47 UTC
(In reply to comment #8)
> 
> But I'm ok with the way it is, but we may have problems if we decide to support
> platforms that use older gcc (debian?  suse?  bsd?)

We can always change the define for 64-bit atomic operations on those platforms, which will use a mutex in the Slapi_Counter calls instead.  We'll deal with that if it comes up, though I think we're safe on anything fairly modern.

Pushed to master.  Thanks to Rich and Noriko for their reviews!

Counting objects: 11, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 793 bytes, done.
Total 6 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   5efd62b..c5bd907  master -> master

Comment 10 Rich Megginson 2011-08-01 23:09:10 UTC
Created attachment 516222 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch

Comment 11 Fedora Update System 2011-08-02 17:36:23 UTC
389-ds-base-1.2.9.1-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc16

Comment 12 Fedora Update System 2011-08-02 17:37:03 UTC
389-ds-base-1.2.9.1-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc15

Comment 13 Fedora Update System 2011-08-02 17:37:56 UTC
389-ds-base-1.2.9.1-2.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc14

Comment 14 Fedora Update System 2011-08-02 17:38:41 UTC
389-ds-base-1.2.9.1-2.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.el5

Comment 15 Martin Kosek 2011-08-10 09:31:14 UTC
Rich, I tested the new 389-ds-base version on my two F-15 32b systems and the installation is still broken (RUV bugs that Rob reported). I will attach my errors file.

RPM version:
389-ds-base-1.2.9.5-1.fc15.i686

Comment 16 Martin Kosek 2011-08-10 09:36:10 UTC
Created attachment 517554 [details]
dirsrv errors on a replica

Comment 17 Rich Megginson 2011-08-10 17:09:57 UTC
Created attachment 517662 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch

Comment 18 Rich Megginson 2011-08-10 17:36:15 UTC
To ssh://git.fedorahosted.org/git/389/ds.git
   1e357c1..de75e6d  master -> master
commit de75e6d27839a6a63f0ac7b8a830db5b95813073
Author: Rich Megginson <rmeggins>
Date:   Wed Aug 10 09:46:59 2011 -0600
    Reviewed by: nkinder (Thanks!)
    Branch: master
    Fix Description: I incorrectly assumed I could just revert the patch
    that removed the assembler code, but I needed to tweak the #if
    condition that decides to use the assembler code or not.
    I also added separate configure checks and defines for the 64bit atomic
    op functions and the 64bit atomic cas function since they are different
    on some platforms.
    Platforms tested: RHEL6 x86_64, F15 i686, EL5 i386
    Flag Day: no
    Doc impact: no

Comment 19 Martin Kosek 2011-08-11 11:04:09 UTC
I can confirm that FreeIPA replica installation was OK on a 32-bit system. I used the following version:

389-ds-base-1.2.9.6-1.fc15.i686