Bug 975250

Summary:	Changelog deadlock replication failures with DNA
Product:	Red Hat Enterprise Linux 6	Reporter:	Venkat Mahadevan <venkmaha>
Component:	389-ds-base	Assignee:	Rich Megginson <rmeggins>
Status:	CLOSED ERRATA	QA Contact:	Sankar Ramalingam <sramling>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.4	CC:	jgalipea, nhosoi, nkinder, rmeggins
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	389-ds-base-1.2.11.15-22.el6	Doc Type:	Bug Fix
Doc Text:	Cause: Under certain conditions, with a mix of concurrent search and update and outgoing replication operations, there will be deadlocks in the changelog db, leading to error messages like this: NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (XXXXXXX); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock This is caused by a deadlock between the changelog readers, writers, and main database writers. Consequence: Update operations will fail with the above error message in the directory server errors log. Fix: A new configuration parameter is introduced: dn: cn=config,cn=ldbm database,cn=plugins,cn=config nsslapd-db-deadlock-policy: 9 With the default policy 9 (DB_LOCK_YOUNGEST), the last locker gets killed when there is a deadlock. In the case that this is the changelog writer, the write will fail, and the entire update will fail. Users who frequently see the above errors in the errors log are advised to change this setting to 6 (DB_LOCK_MINWRITE) will which instead kill the locker that has the fewest write locks (that is, the changelog reader). The changelog reader code has been changed to handle this deadlock condition and retry. The setting can be changed like this: ldapmodify -x -D "cn=directory manager" -W <<EOF dn: cn=config,cn=ldbm database,cn=plugins,cn=config changetype: modify replace: nsslapd-db-deadlock-policy nsslapd-db-deadlock-policy: 6 EOF You may ask why the default is not changed to 6. The answer is that the setting will apply to _all_ threads, so that changing this setting could cause regular search requests to fail, if the directory server is under a heavy update load. In our testing, we did not see this happen, but we cannot guarantee that changing this value to 6 will not impact regular search requests. Result: After changing nsslapd-db-deadlock-policy to 6, updates will succeed and no longer cause errors like the above.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-11-21 21:09:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Venkat Mahadevan 2013-06-17 22:30:23 UTC

Description of problem:
Entry is added to the master server but fails to replicate due to a changelog error caused by a database deadlock.

Version-Release number of selected component (if applicable):
389-ds-base.x86_64            1.2.11.15-14.el6_4   rhel-x86_64-server-6

How reproducible:
Consistently

Steps to Reproduce:
1. Setup a multi-master replication environment (2 masters) and 3 consumers. Replication should be real-time i.e. directories always kept in sync.
2. Enable the DNA plugin on the master servers. Example config is as follows:

master server 1:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 4294967295
dnanextvalue: 131073
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

master server 2:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 0
dnanextvalue: 0
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

3. Setup a LDAP client (e.g. JMeter LDAP tester) to create and delete entries on one of the master servers. An example entry is:

objectClass	top
objectClass	person
objectClass	organizationalPerson
objectClass	inetOrgPerson
cn	Test Plan
sn	Plan
givenName	Test
userPassword	{SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
objectClass	posixAccount
gidNumber	999
uidNumber	999
uid	${entrydn}
homeDirectory	/home/somedir

${entrydn} can just be a unique uid that is sequentially listed in a text file e.g. jmeter0, jmeter1, jmeter2, etc.

4. Setup the following sequence of actions to run in a loop from the LDAP client:

a. BIND
b. ADD entry.
c. DEL entry.
d. UNBIND.

5. After running for about 5-6 minutes, you should trigger the error. An example of the error log is given below.

Actual results:
1. Entry is added to master server and a return code of 0 is given to the client.
2. The changelog for the entry fails to write, so while the entry is successfully written to the master, it fails to replicate to the other servers in the environment.

Error log:

[17/Jun/2013:15:23:13 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=id,dc=ubc,dc=ca: LDAP error - 51
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51bf8c50004100010000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51bf8c50004100010000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter741,dc=id,dc=ubc,dc=ca (uniqid: 70b03020-d79c11e2-8c03dfeb-4acc1d05, optype: 16) to changelog csn 51bf8c50004100010000

Access log:

[17/Jun/2013:15:23:10 -0700] conn=119116 fd=90 slot=90 connection from 142.103.1.221 to 10.7.0.51
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[17/Jun/2013:15:23:10 -0700] conn=119116 op=1 ADD dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51bf8c50004100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 DEL dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 RESULT err=0 tag=107 nentries=0 etime=0 csn=51bf8c55001100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 UNBIND
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 fd=90 closed - U1

Expected results:
1. Changelog for all entries is correctly written without deadlocking.
2. The change is replicated to all servers in the environment.

Additional info:
Please see the following discussion thread for more info:

https://lists.fedoraproject.org/pipermail/389-users/2013-June/016014.html

Comment 5 Rich Megginson 2013-06-28 20:48:53 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/47410

Comment 6 Nathan Kinder 2013-06-28 20:49:55 UTC

*** Bug 979168 has been marked as a duplicate of this bug. ***

Comment 8 Rich Megginson 2013-08-08 19:33:28 UTC

NOTE: doc text is the same as https://bugzilla.redhat.com/show_bug.cgi?id=979169

Comment 9 Sankar Ramalingam 2013-10-17 17:54:32 UTC

1). Configured 2 way MMR with the latest 389-ds-base.
2). Enabled DNA plugin and added DNA entries like this...for M1

dn: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: modify
replace: nsslapd-pluginEnabled
nsslapd-pluginEnabled: On

dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 4294967295
dnanextvalue: 131073
dnascope: dc=passsync,dc=com
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=passsync,dc=com
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

3).Added the following entry to M2...

dn: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: modify
replace: nsslapd-pluginEnabled
nsslapd-pluginEnabled: On

dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 0
dnanextvalue: 0
dnascope: dc=passsync,dc=com
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=passsync,dc=com
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber

4). Restarted both M1 and M2.

5). Simultaneous add/delete of user entries to M1.

dn: uid=bug975250new7666,ou=people,dc=passsync,dc=com
telephoneNumber: 989898197666
mail: bug975250new7666
uid: bug975250new7666
givenName: bug975250new7666
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetorgperson
objectclass: posixAccount
sn: bug975250new7666
cn: bug975250new7666
homeDirectory: /home/bug975250new7666
loginShell: /bin/bash
userPassword: {SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
uidNumber: 999
gidNumber: 999

6) No error messages encountered while running the tests. Hence, marking the bug as Verified.

Checked both M1 and M2 error logs.

Comment 10 errata-xmlrpc 2013-11-21 21:09:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1653.html