Bug 1387340

Summary: Aborted operation can leave RUV in incorrect state
Product: Red Hat Enterprise Linux 7 Reporter: German Parente <gparente>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED ERRATA QA Contact: Viktor Ashirov <vashirov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.4CC: brubisch, mreynolds, msauton, nkinder, rmeggins, tmihinto
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.6.1-3.el7 Doc Type: No Doc Update
Doc Text:
See the 7.3.z bug 1416368.
Story Points: ---
Clone Of:
: 1416368 (view as bug list) Environment:
Last Closed: 2017-08-01 21:12:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1396494, 1399979, 1416368    

Description German Parente 2016-10-20 16:29:53 UTC
Description of problem: this is a clone of ticket DS 49008

this is Ludwig's investigation and description of the issue:


if a plugin operation succeeded, but the operation itself fails and is aborted the RUV is in an incorrect state (rolled up to the succesful plugin op).

Here is test case:
enable memberof and add two members to a group, the second has no objectclasss allowing memberof, so the complete operation fails with err=65

=====================================
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000000640000 into pending list <<<< main MOD
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000100640000 into pending list <<<< succesful memberof
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 57ff838e000100640000
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000200640000 into pending list <<< failing memberof
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin -  csn=57ff838e000200640000 process postop: canceling operation csn
[13/Oct/2016:14:48:22 +0200] NSMMReplicationPlugin - conn=866 op=1 csn=57ff838e000000640000 process postop: canceling operation csn
[13/Oct/2016:14:48:22 +0200] NSMMReplicationPlugin - agmt="cn=to200" (localhost:4945): {replica 100 ldap://localhost:30522} 57f4cb95000000640000 57ff838e000100640000 57ff8295

nsds50ruv: {replicageneration} 57bf05e40000012c0000
nsds50ruv: {replica 100 ldap://localhost:30522} 57f4cb95000000640000 57ff838e000100640000
======================================

And we can see these messages in the errors log:



 This can be seen when having locker not able to resolve deadlocks. The messages that could show we are in presence of this bug are:

==============================
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5804e0db000800030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5804e0db000800030000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for <entry dn> (uniqid: 40d752a2-4fbb11e5-93a0d259-b00ed7d3, optype: 32) to changelog csn 5804e0db000800030000
[17/Oct/2016:10:32:07 +0000] - SLAPI_PLUGIN_BE_TXN_POST_DELETE_FN plugin returned error code but did not set SLAPI_RESULT_CODE
[17/Oct/2016:10:32:07 +0000] agmt="<agmt dn>" (<host>:389) - Can't locate CSN 5804e0dc000200060000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
=========================================


Version-Release number of selected component (if applicable): will be reproduceable in RHEL7.3 version of 389-ds-base

For the moment, in release 33 + replication hotfix can be seen.


How reproducible: not easy. 


A possible workaround would be to set:

nsslapd-db-deadlock-policy to 6 

or, probably, to increase the nsslapd-db-locks since it seems to be a performance issue.

I will attach the three cases where we have seen this.

Comment 3 Noriko Hosoi 2016-11-07 19:31:26 UTC
*** Bug 1339693 has been marked as a duplicate of this bug. ***

Comment 7 Viktor Ashirov 2017-05-20 19:33:07 UTC
========================================================== test session starts ==========================================================
platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python
cachedir: .cache
metadata: {'Python': '2.7.5', 'Platform': 'Linux-3.10.0-663.el7.x86_64-x86_64-with-redhat-7.4-Maipo', 'Packages': {'py': '1.4.33', 'pytest': '3.0.7', 'pluggy': '0.4.0'}, 'Plugins': {'beakerlib': '0.7.1', 'html': '1.14.2', 'cov': '2.5.1', 'metadata': '1.5.0'}}
DS build: 1.3.6.1
389-ds-base: 1.3.6.1-14.el7
nss: 3.28.4-6.el7
nspr: 4.13.1-1.0.el7_3
openldap: 2.4.44-4.el7
svrcore: 4.1.3-2.el7

rootdir: /export/tests, inifile:
plugins: metadata-1.5.0, html-1.14.2, cov-2.5.1, beakerlib-0.7.1
collected 1 items 

tickets/ticket49008_test.py::test_ticket49008 PASSED

======================================================= 1 passed in 74.82 seconds =======================================================

Marking as VERIFIED.

Comment 8 errata-xmlrpc 2017-08-01 21:12:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086