Bug 975250 - Changelog deadlock replication failures with DNA
Summary: Changelog deadlock replication failures with DNA
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base
Version: 6.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Rich Megginson
QA Contact: Sankar Ramalingam
URL:
Whiteboard:
: 979168 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-17 22:30 UTC by Venkat Mahadevan
Modified: 2020-09-13 20:37 UTC (History)
4 users (show)

Fixed In Version: 389-ds-base-1.2.11.15-22.el6
Doc Type: Bug Fix
Doc Text:
Cause: Under certain conditions, with a mix of concurrent search and update and outgoing replication operations, there will be deadlocks in the changelog db, leading to error messages like this: NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (XXXXXXX); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock This is caused by a deadlock between the changelog readers, writers, and main database writers. Consequence: Update operations will fail with the above error message in the directory server errors log. Fix: A new configuration parameter is introduced: dn: cn=config,cn=ldbm database,cn=plugins,cn=config nsslapd-db-deadlock-policy: 9 With the default policy 9 (DB_LOCK_YOUNGEST), the last locker gets killed when there is a deadlock. In the case that this is the changelog writer, the write will fail, and the entire update will fail. Users who frequently see the above errors in the errors log are advised to change this setting to 6 (DB_LOCK_MINWRITE) will which instead kill the locker that has the fewest write locks (that is, the changelog reader). The changelog reader code has been changed to handle this deadlock condition and retry. The setting can be changed like this: ldapmodify -x -D "cn=directory manager" -W <<EOF dn: cn=config,cn=ldbm database,cn=plugins,cn=config changetype: modify replace: nsslapd-db-deadlock-policy nsslapd-db-deadlock-policy: 6 EOF You may ask why the default is not changed to 6. The answer is that the setting will apply to _all_ threads, so that changing this setting could cause regular search requests to fail, if the directory server is under a heavy update load. In our testing, we did not see this happen, but we cannot guarantee that changing this value to 6 will not impact regular search requests. Result: After changing nsslapd-db-deadlock-policy to 6, updates will succeed and no longer cause errors like the above.
Clone Of:
Environment:
Last Closed: 2013-11-21 21:09:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 747 None closed changelog db deadlocks with DNA and replication 2020-11-02 17:47:50 UTC
Red Hat Product Errata RHBA-2013:1653 normal SHIPPED_LIVE 389-ds-base bug fix update 2013-11-20 21:53:19 UTC

Description Venkat Mahadevan 2013-06-17 22:30:23 UTC
Description of problem:
Entry is added to the master server but fails to replicate due to a changelog error caused by a database deadlock.

Version-Release number of selected component (if applicable):
389-ds-base.x86_64            1.2.11.15-14.el6_4   rhel-x86_64-server-6

How reproducible:
Consistently

Steps to Reproduce:
1. Setup a multi-master replication environment (2 masters) and 3 consumers. Replication should be real-time i.e. directories always kept in sync.
2. Enable the DNA plugin on the master servers. Example config is as follows:

master server 1:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 4294967295
dnanextvalue: 131073
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

master server 2:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 0
dnanextvalue: 0
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

3. Setup a LDAP client (e.g. JMeter LDAP tester) to create and delete entries on one of the master servers. An example entry is:

objectClass	top
objectClass	person
objectClass	organizationalPerson
objectClass	inetOrgPerson
cn	Test Plan
sn	Plan
givenName	Test
userPassword	{SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
objectClass	posixAccount
gidNumber	999
uidNumber	999
uid	${entrydn}
homeDirectory	/home/somedir

${entrydn} can just be a unique uid that is sequentially listed in a text file e.g. jmeter0, jmeter1, jmeter2, etc.

4. Setup the following sequence of actions to run in a loop from the LDAP client:

a. BIND
b. ADD entry.
c. DEL entry.
d. UNBIND.

5. After running for about 5-6 minutes, you should trigger the error. An example of the error log is given below.

Actual results:
1. Entry is added to master server and a return code of 0 is given to the client.
2. The changelog for the entry fails to write, so while the entry is successfully written to the master, it fails to replicate to the other servers in the environment.

Error log:

[17/Jun/2013:15:23:13 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=id,dc=ubc,dc=ca: LDAP error - 51
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51bf8c50004100010000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51bf8c50004100010000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter741,dc=id,dc=ubc,dc=ca (uniqid: 70b03020-d79c11e2-8c03dfeb-4acc1d05, optype: 16) to changelog csn 51bf8c50004100010000

Access log:

[17/Jun/2013:15:23:10 -0700] conn=119116 fd=90 slot=90 connection from 142.103.1.221 to 10.7.0.51
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[17/Jun/2013:15:23:10 -0700] conn=119116 op=1 ADD dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51bf8c50004100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 DEL dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 RESULT err=0 tag=107 nentries=0 etime=0 csn=51bf8c55001100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 UNBIND
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 fd=90 closed - U1

Expected results:
1. Changelog for all entries is correctly written without deadlocking.
2. The change is replicated to all servers in the environment.

Additional info:
Please see the following discussion thread for more info:

https://lists.fedoraproject.org/pipermail/389-users/2013-June/016014.html

Comment 5 Rich Megginson 2013-06-28 20:48:53 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/47410

Comment 6 Nathan Kinder 2013-06-28 20:49:55 UTC
*** Bug 979168 has been marked as a duplicate of this bug. ***

Comment 8 Rich Megginson 2013-08-08 19:33:28 UTC
NOTE: doc text is the same as https://bugzilla.redhat.com/show_bug.cgi?id=979169

Comment 9 Sankar Ramalingam 2013-10-17 17:54:32 UTC
1). Configured 2 way MMR with the latest 389-ds-base.
2). Enabled DNA plugin and added DNA entries like this...for M1

dn: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: modify
replace: nsslapd-pluginEnabled
nsslapd-pluginEnabled: On

dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 4294967295
dnanextvalue: 131073
dnascope: dc=passsync,dc=com
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=passsync,dc=com
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

3).Added the following entry to M2...

dn: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: modify
replace: nsslapd-pluginEnabled
nsslapd-pluginEnabled: On

dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 0
dnanextvalue: 0
dnascope: dc=passsync,dc=com
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=passsync,dc=com
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

4). Restarted both M1 and M2.

5). Simultaneous add/delete of user entries to M1.

dn: uid=bug975250new7666,ou=people,dc=passsync,dc=com
telephoneNumber: 989898197666
mail: bug975250new7666@redhat.com">bug975250new7666@redhat.com
uid: bug975250new7666
givenName: bug975250new7666
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetorgperson
objectclass: posixAccount
sn: bug975250new7666
cn: bug975250new7666
homeDirectory: /home/bug975250new7666
loginShell: /bin/bash
userPassword: {SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
uidNumber: 999
gidNumber: 999

6) No error messages encountered while running the tests. Hence, marking the bug as Verified.

Checked both M1 and M2 error logs.

Comment 10 errata-xmlrpc 2013-11-21 21:09:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1653.html


Note You need to log in before you can comment on or make changes to this bug.