Bug 723937
Summary: | replication failing on RUV errors | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Rob Crittenden <rcritten> | ||||||||||||
Component: | Directory Server | Assignee: | Rich Megginson <rmeggins> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 1.2.9 | CC: | edewata, mkosek, nhosoi, nkinder, rmeggins | ||||||||||||
Target Milestone: | --- | Keywords: | VerifiedUpstream | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2015-12-07 16:35:41 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 708096, 726742 | ||||||||||||||
Attachments: |
|
Description
Rob Crittenden
2011-07-21 15:18:17 UTC
Created attachment 514222 [details]
errors log on replica
I am able to reproduce this outside of IPA. The issue is reproducible by setting up 2 masters on F15 i686. The problem does not occur on a F15 x86_64 system. I believe that the problem is related to these cache errors: [20/Jul/2011:22:26:55 -0400] - entrycache_clear_int: there are still 29 entries in the entry cache. [20/Jul/2011:22:26:55 -0400] - dncache_clear_int: there are still 187 dn's in the dn cache. :/ This definitely looks to be some sort of issue with the caches, yet there is some timing element to it as well. It only seems to happen on 32-bit F15 with optimized 389-ds-base binaries (F14 x86_64 was tested as well as F14 i686). I rebuilt an optimized 389-ds-base with extra cache debugging compiled in. I encounter this assertion in gdb on the master that is being initialized: ------------------------------------------------------------------ Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xb27ffb70 (LWP 8938)] 0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8) at ldap/servers/slapd/back-ldbm/cache.c:933 933 ASSERT(e->ep_refcnt > 0); (gdb) bt #0 0x009c7aa5 in cache_remove (cache=0x98c24dc, ptr=0xb3a021d8) at ldap/servers/slapd/back-ldbm/cache.c:933 #1 0x009e99e1 in import_foreman (param=0xb2800870) at ldap/servers/slapd/back-ldbm/import-threads.c:2327 #2 0x4da50c22 in ?? () from /lib/libnspr4.so #3 0x4c9d59fe in start_thread () from /lib/libpthread.so.0 #4 0x4c91c24e in clone () from /lib/libc.so.6 ------------------------------------------------------------------ The cache_remove() function expects the reference count to be at least 1. In import_foreman, we call cache_remove() shortly after calling id2entry_add_ext(), which should add the new entry to the cache and thus set the reference count to 1. What I'm finding is that id2entry_add_ext() fails to add the new entry to the cache since another entry with the same ID and DN already exists in the cache. This leaves the reference count at 0 for the new entry, yet we still call cache_remove(), leading to the assertion. Here's the debug cache logging: ------------------------------------------------------------------ [22/Jul/2011:10:39:01 -0700] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=com is going offline; disabling replication [22/Jul/2011:10:39:02 -0700] - => cache_find_dn (dc=example,dc=com) [22/Jul/2011:10:39:02 -0700] - <= cache_find_dn (NOT FOUND) [22/Jul/2011:10:39:02 -0700] - => cache_return <= cache_return (null entry) [22/Jul/2011:10:39:02 -0700] - => cache_return <= cache_return (null entry) [22/Jul/2011:10:39:02 -0700] - => entrycache_flush [22/Jul/2011:10:39:02 -0700] - <= entrycache_flush (down to 1 entries, 0 bytes) [22/Jul/2011:10:39:02 -0700] - entrycache_clear_int: there are still 1 entries in the entry cache. [22/Jul/2011:10:39:02 -0700] - ID(s) in entry cache: [22/Jul/2011:10:39:03 -0700] - 1: [22/Jul/2011:10:39:03 -0700] - => dncache_flush [22/Jul/2011:10:39:03 -0700] - <= dncache_flush (down to 1 dns, 0 bytes) [22/Jul/2011:10:39:03 -0700] - dncache_clear_int: there are still 1 dn's in the dn cache. :/ [22/Jul/2011:10:39:03 -0700] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [22/Jul/2011:10:39:03 -0700] - => dncache_add_int( "dc=example,dc=com", 1 ) [22/Jul/2011:10:39:03 -0700] - entry dc=example,dc=com already in id cache! [22/Jul/2011:10:39:03 -0700] - => dncache_replace (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) -> (dc=example,dc=com) [22/Jul/2011:10:39:03 -0700] - <= dncache_replace OK, cache size now 0 cache count now 0 [22/Jul/2011:10:39:03 -0700] - => dncache_return (nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=example,dc=com) reference count: 1, dn in cache:1 [22/Jul/2011:10:39:04 -0700] - => dncache_return (dc=example,dc=com) reference count: 1, dn in cache:1 [22/Jul/2011:10:39:04 -0700] - => entrycache_add_int( "dc=example,dc=com", 1 ) [22/Jul/2011:10:39:04 -0700] - entry dc=example,dc=com already in id cache! [22/Jul/2011:10:39:04 -0700] - BAD CACHE ASSERTION at ldap/servers/slapd/back-ldbm/cache.c/933: e->ep_refcnt > 0 ------------------------------------------------------------------ The cache is supposed to be cleared before we start importing entries, yet we can see that entries are being left in the cache for some reason. These leftover entries are causing the assertion (and the RUV errors). Noriko and I got to the bottom of this issue. It turns out that the slapi_counter_* API is broken on 32-bit F15. We use this API to maintain the current cache sizes, which are stored as a 64-bit value (even on 32-bit systems). To support atomic operations on a Slapi_Counter on a 32-bit system, we previously needed to have our own assembly code to get and set specific values in the counter. Certain platforms did not have the gcc __sync_bool_compare_and_swap() built-in for us to use, which required us to roll our own in assembly code. Long story short, it appears that gcc on F15 optimizes our assembly code, which breaks it. The version of gcc on F14 and earlier did not optimize our assembly the same way, which is why it works there. I believe that we no longer need to have our own assembly code, as __sync_bool_compare_and_swap() should be available on all platforms we currently build for. I believe it was in the gcc3 timeframe that we needed to add the assembly code. I'll work up a patch for this next week, but a quick test of just using __sync_bool_compare_and_swap() in slapi_counter_set() and slapi_counter_get() fixes the issue on my F15 i686 system. Created attachment 515140 [details]
Patch
Comment on attachment 515140 [details]
Patch
Is this just for f15? If so, is there some logic we can put in the code or in configure to turn on this code when using older versions of gcc?
(In reply to comment #6) > Comment on attachment 515140 [details] > Patch > > Is this just for f15? If so, is there some logic we can put in the code or in > configure to turn on this code when using older versions of gcc? There was no way of detecting if __sync_bool_compare_and_swap() was available and supported 8 byte parameters in old versions of gcc from what I remember. The problem is that you can have the __sync_bool_compare_and_swap symbol, but it may only support 32-bit values. You don't find out that there's a problem until you actually compile, which doesn't help at configure time. The original issue that required us to write our own get/set in assembly was with gcc3 IIRC. I've tested using __sync_bool_compare_and_swap() with 8 byte parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and works correctly. (In reply to comment #7) > (In reply to comment #6) > > Comment on attachment 515140 [details] > > Patch > > > > Is this just for f15? If so, is there some logic we can put in the code or in > > configure to turn on this code when using older versions of gcc? > > There was no way of detecting if __sync_bool_compare_and_swap() was available > and supported 8 byte parameters in old versions of gcc from what I remember. > The problem is that you can have the __sync_bool_compare_and_swap symbol, but > it may only support 32-bit values. You don't find out that there's a problem > until you actually compile, which doesn't help at configure time. We could write a small test program at configure time - that's the way most of the configure tests for headers and libraries works. But I'm ok with the way it is, but we may have problems if we decide to support platforms that use older gcc (debian? suse? bsd?) > The original issue that required us to write our own get/set in assembly was > with gcc3 IIRC. I've tested using __sync_bool_compare_and_swap() with 8 byte > parameters on F14 i386, F15 i386, and RHEL6.1 i386, and it's available and > works correctly. (In reply to comment #8) > > But I'm ok with the way it is, but we may have problems if we decide to support > platforms that use older gcc (debian? suse? bsd?) We can always change the define for 64-bit atomic operations on those platforms, which will use a mutex in the Slapi_Counter calls instead. We'll deal with that if it comes up, though I think we're safe on anything fairly modern. Pushed to master. Thanks to Rich and Noriko for their reviews! Counting objects: 11, done. Delta compression using up to 2 threads. Compressing objects: 100% (6/6), done. Writing objects: 100% (6/6), 793 bytes, done. Total 6 (delta 4), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 5efd62b..c5bd907 master -> master Created attachment 516222 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch
389-ds-base-1.2.9.1-2.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc16 389-ds-base-1.2.9.1-2.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc15 389-ds-base-1.2.9.1-2.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.fc14 389-ds-base-1.2.9.1-2.el5 has been submitted as an update for Fedora EPEL 5. https://admin.fedoraproject.org/updates/389-ds-base-1.2.9.1-2.el5 Rich, I tested the new 389-ds-base version on my two F-15 32b systems and the installation is still broken (RUV bugs that Rob reported). I will attach my errors file. RPM version: 389-ds-base-1.2.9.5-1.fc15.i686 Created attachment 517554 [details]
dirsrv errors on a replica
Created attachment 517662 [details]
0001-Bug-723937-replication-failing-on-RUV-errors.patch
To ssh://git.fedorahosted.org/git/389/ds.git 1e357c1..de75e6d master -> master commit de75e6d27839a6a63f0ac7b8a830db5b95813073 Author: Rich Megginson <rmeggins> Date: Wed Aug 10 09:46:59 2011 -0600 Reviewed by: nkinder (Thanks!) Branch: master Fix Description: I incorrectly assumed I could just revert the patch that removed the assembler code, but I needed to tweak the #if condition that decides to use the assembler code or not. I also added separate configure checks and defines for the 64bit atomic op functions and the 64bit atomic cas function since they are different on some platforms. Platforms tested: RHEL6 x86_64, F15 i686, EL5 i386 Flag Day: no Doc impact: no I can confirm that FreeIPA replica installation was OK on a 32-bit system. I used the following version: 389-ds-base-1.2.9.6-1.fc15.i686 |