Description of problem: segfault when replicating data in multimaster mode: * two node in mmr * a balancer behind Version-Release number of selected component (if applicable): 1.2.4 How reproducible: * enable multimaster between two nodes and create/update/delete frequently it will happen that one node will Steps to Reproduce: 1. node A has a replica record specifying to delete entry X 2. node B has no entry named X 3. node B segfaults Actual results: Program received signal SIGSEGV, Segmentation fault. 0x000000364fa79140 in strcmp () from /lib64/libc.so.6 called by entry_same_dn Expected results: no segfaults Additional info: see http://www.mail-archive.com/389-users@lists.fedoraproject.org/msg00185.html stack trace follows: We have already installed the 389-ds-base-debuginfo and the stacktrace is: Program received signal SIGSEGV, Segmentation fault. 0x000000364fa79140 in strcmp () from /lib64/libc.so.6 (gdb) bt #0 0x000000364fa79140 in strcmp () from /lib64/libc.so.6 #1 0x00002b39f5cea4fc in entry_same_dn (e=<value optimized out>, k=0x2aaab800e860) at ldap/servers/slapd/back-ldbm/cache.c:137 #2 0x00002b39f5ce98d9 in add_hash (ht=0x191b1900, key=0x2aaab800e860, keylen=<value optimized out>, entry=0x2aaab800ae00, alt=0x64035b68) at ldap/servers/slapd/back-ldbm/cache.c:185 #3 0x00002b39f5ce9f27 in cache_add_int (cache=0x19105718, e=0x2aaab800ae00, state=0, alt=0x64035c18) at ldap/servers/slapd/back-ldbm/cache.c:1037 #4 0x00002b39f5cf8273 in id2entry (be=0x191aef70, id=1505303, txn=0x0, err=0x64035d58) at ldap/servers/slapd/back-ldbm/id2entry.c:268 #5 0x00002b39f5d254c0 in uniqueid2entry (be=0x191aef70, uniqueid=<value optimized out>, txn=0x0, err=0x64035d58) at ldap/servers/slapd/back-ldbm/uniqueid2entry.c:86 #6 0x00002b39f5cf7961 in find_entry_internal (pb=0x2aaab8008200, be=0x191aef70, addr=<value optimized out>, lock=1, txn=0x0, really_internal=0) at ldap/servers/slapd/back-ldbm/findentry.c:201 #7 0x00002b39f5d105fc in ldbm_back_delete (pb=0x2aaab8008200) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:140 #8 0x00002b39f1d810d4 in op_shared_delete (pb=0x2aaab8008200) at ldap/servers/slapd/delete.c:318 #9 0x00002b39f1d81413 in do_delete (pb=0x2aaab8008200) at ldap/servers/slapd/delete.c:116 #10 0x0000000000412e79 in connection_threadmain () at ldap/servers/slapd/connection.c:548 #11 0x0000003590827fad in ?? () from /usr/lib64/libnspr4.so #12 0x00000036506064a7 in start_thread () from /lib64/libpthread.so.0 #13 0x000000364fad3c2d in clone () from /lib64/libc.so.6
What do these entries look like? What sorts of operations are you doing?
Further information: made some stresstest on 64bit - about 8million entries - during stresstest we had a crash due to a filesystem full - after that event, the issue arose - after that, we drop the databases - db2ldif - and recreate the db ldif2db - but it didn't fix and the segfault presented again - when we found a guilty entry, we removed it and after some hour the issue come out again with another one we had to reconfigure the infrastructure in a "master-slave" and disable multimaster since then, no more segfault Could be cache related: db cache size 2gb server ram 32gb Not sure if that is nsslapd-cachememsize or nsslapd-dbcachesize. If it is the latter, the entry cache could be set to the default value, which would mean a tremendous amount of churn on the entry cache as entries are constantly being moved in and out. It could be that the entry churn on the entry cache causes this problem.
Roberto, is it possible for you to re-test this using the latest version (1.2.6)? yum --enablerepo=updates-testing install 389-ds The code has changed quite significantly since 1.2.4 so the above stack trace is no longer valid. Do you do any search operation in your test? Thanks.
Hi Endi, we use EPEL x86_64 , and the latest stable is 1.2.5-1. http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/389-ds-base.html we may evaluate the possibility to upgrade to that release.. the error arose during read/write access to the servers. Peace, R.
Created attachment 406561 [details] 0001-Bug-576644-segfault-while-multimaster-replication.patch
To ssh://git.fedorahosted.org/git/389/ds.git 92ca2bb..c15e10b Directory_Server_8_2_Branch -> Directory_Server_8_2_Branch commit c15e10b50189d384436728be9ee17986225882c8 Author: Rich Megginson <rmeggins> Date: Wed Apr 14 10:15:53 2010 -0600 c53b8b3..e50dceb master -> master commit e50dceb45a2ddffe749b444fa057d93776f882c9 Author: Rich Megginson <rmeggins> Date: Wed Apr 14 10:15:53 2010 -0600 Fixed by: edewata, nhosoi Fix Description: The delete code stores the actual entry from the entry cache in the pblock as SLAPI_DELETE_BEPREOP_ENTRY so that the be preop plugins can have access to the entry. SLAPI_DELETE_BEPREOP_ENTRY is an alias for SLAPI_ENTRY_PRE_OP which is used by the front-end delete code. When processing a replicated delete operation, and the entry has already been deleted (converted to a tombstone), we needed to restore the original entry in SLAPI_DELETE_BEPREOP_ENTRY so that the front-end code can free it as SLAPI_ENTRY_PRE_OP instead of freeing the actual entry from the cache. Platforms tested: RHEL5 x86_64 Flag Day: no Doc impact: no
Here are the steps to reproduce this bug consistently: 1. Create 2 DS nodes (e.g. node1 & node2). 2. Configure MMR on both nodes. 3. Open Directory->config->plugins->ldbm database->userRoot. Set nsslapd-cachesize to 500 on both nodes. 4. Restart both nodes. 5. Create replication agreement on node1. 6. Add 500 entries to node1 and let them replicate to node2. Delete entries from node1 while suspending the replication: 7. Stop node2. 8. Remove the entries from node1. Delete entries from node2 while suspending the replication: 9. Stop node1 first, then start node2. 10. Remove the entries from node2. Resume replication: 11. Start node1 again. As soon as node1 tries to synchronize node2 will crash. After these steps node2 can be started again, but as soon as node1 tries to synchronize node2 will crash again.
verified - RHEL 4 redhat-ds-base-8.2.0-2010051204.el4dsrv 1. Set up 2 way MMR 2. Set nsslapd-memcachesize to 500 for both instances 3. Restarted both nodes 4. Added over 500 users and verified replication of users 5. Stopped instance 2 and deleted the users from instance 1. 6. Stopped instance 1 and started instance 2. 7 Deleted the users from instance 2. 8. Started instance 1 No crash of instance 2 or instance 1, no errors in logs.