Bug 576644
Summary: | segfault while multimaster replication (paired node won't find deleted entries) | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Roberto Polli <rpolli> | ||||
Component: | Database - General | Assignee: | Rich Megginson <rmeggins> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 1.2.4 | CC: | andrey.ivanov, edewata, jgalipea, rmeggins | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-12-07 17:05:13 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 434914, 543590 | ||||||
Attachments: |
|
Description
Roberto Polli
2010-03-24 17:24:41 UTC
What do these entries look like? What sorts of operations are you doing? Further information: made some stresstest on 64bit - about 8million entries - during stresstest we had a crash due to a filesystem full - after that event, the issue arose - after that, we drop the databases - db2ldif - and recreate the db ldif2db - but it didn't fix and the segfault presented again - when we found a guilty entry, we removed it and after some hour the issue come out again with another one we had to reconfigure the infrastructure in a "master-slave" and disable multimaster since then, no more segfault Could be cache related: db cache size 2gb server ram 32gb Not sure if that is nsslapd-cachememsize or nsslapd-dbcachesize. If it is the latter, the entry cache could be set to the default value, which would mean a tremendous amount of churn on the entry cache as entries are constantly being moved in and out. It could be that the entry churn on the entry cache causes this problem. Roberto, is it possible for you to re-test this using the latest version (1.2.6)? yum --enablerepo=updates-testing install 389-ds The code has changed quite significantly since 1.2.4 so the above stack trace is no longer valid. Do you do any search operation in your test? Thanks. Hi Endi, we use EPEL x86_64 , and the latest stable is 1.2.5-1. http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/389-ds-base.html we may evaluate the possibility to upgrade to that release.. the error arose during read/write access to the servers. Peace, R. Created attachment 406561 [details]
0001-Bug-576644-segfault-while-multimaster-replication.patch
To ssh://git.fedorahosted.org/git/389/ds.git 92ca2bb..c15e10b Directory_Server_8_2_Branch -> Directory_Server_8_2_Branch commit c15e10b50189d384436728be9ee17986225882c8 Author: Rich Megginson <rmeggins> Date: Wed Apr 14 10:15:53 2010 -0600 c53b8b3..e50dceb master -> master commit e50dceb45a2ddffe749b444fa057d93776f882c9 Author: Rich Megginson <rmeggins> Date: Wed Apr 14 10:15:53 2010 -0600 Fixed by: edewata, nhosoi Fix Description: The delete code stores the actual entry from the entry cache in the pblock as SLAPI_DELETE_BEPREOP_ENTRY so that the be preop plugins can have access to the entry. SLAPI_DELETE_BEPREOP_ENTRY is an alias for SLAPI_ENTRY_PRE_OP which is used by the front-end delete code. When processing a replicated delete operation, and the entry has already been deleted (converted to a tombstone), we needed to restore the original entry in SLAPI_DELETE_BEPREOP_ENTRY so that the front-end code can free it as SLAPI_ENTRY_PRE_OP instead of freeing the actual entry from the cache. Platforms tested: RHEL5 x86_64 Flag Day: no Doc impact: no Here are the steps to reproduce this bug consistently: 1. Create 2 DS nodes (e.g. node1 & node2). 2. Configure MMR on both nodes. 3. Open Directory->config->plugins->ldbm database->userRoot. Set nsslapd-cachesize to 500 on both nodes. 4. Restart both nodes. 5. Create replication agreement on node1. 6. Add 500 entries to node1 and let them replicate to node2. Delete entries from node1 while suspending the replication: 7. Stop node2. 8. Remove the entries from node1. Delete entries from node2 while suspending the replication: 9. Stop node1 first, then start node2. 10. Remove the entries from node2. Resume replication: 11. Start node1 again. As soon as node1 tries to synchronize node2 will crash. After these steps node2 can be started again, but as soon as node1 tries to synchronize node2 will crash again. verified - RHEL 4 redhat-ds-base-8.2.0-2010051204.el4dsrv 1. Set up 2 way MMR 2. Set nsslapd-memcachesize to 500 for both instances 3. Restarted both nodes 4. Added over 500 users and verified replication of users 5. Stopped instance 2 and deleted the users from instance 1. 6. Stopped instance 1 and started instance 2. 7 Deleted the users from instance 2. 8. Started instance 1 No crash of instance 2 or instance 1, no errors in logs. |