Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Cause: Under certain conditions, with a mix of concurrent search and update and outgoing replication operations, there will be deadlocks in the changelog db, leading to error messages like this:
NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (XXXXXXX); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
This is caused by a deadlock between the changelog readers, writers, and main database writers.
Consequence: Update operations will fail with the above error message in the directory server errors log.
Fix: A new configuration parameter is introduced:
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-db-deadlock-policy: 9
With the default policy 9 (DB_LOCK_YOUNGEST), the last locker gets killed when there is a deadlock. In the case that this is the changelog writer, the write will fail, and the entire update will fail.
Users who frequently see the above errors in the errors log are advised to change this setting to 6 (DB_LOCK_MINWRITE) will which instead kill the locker that has the fewest write locks (that is, the changelog reader). The changelog reader code has been changed to handle this deadlock condition and retry. The setting can be changed like this:
ldapmodify -x -D "cn=directory manager" -W <<EOF
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
changetype: modify
replace: nsslapd-db-deadlock-policy
nsslapd-db-deadlock-policy: 6
EOF
You may ask why the default is not changed to 6. The answer is that the setting will apply to _all_ threads, so that changing this setting could cause regular search requests to fail, if the directory server is under a heavy update load. In our testing, we did not see this happen, but we cannot guarantee that changing this value to 6 will not impact regular search requests.
Result: After changing nsslapd-db-deadlock-policy to 6, updates will succeed and no longer cause errors like the above.
DescriptionVenkat Mahadevan
2013-06-17 22:30:23 UTC
Description of problem:
Entry is added to the master server but fails to replicate due to a changelog error caused by a database deadlock.
Version-Release number of selected component (if applicable):
389-ds-base.x86_64 1.2.11.15-14.el6_4 rhel-x86_64-server-6
How reproducible:
Consistently
Steps to Reproduce:
1. Setup a multi-master replication environment (2 masters) and 3 consumers. Replication should be real-time i.e. directories always kept in sync.
2. Enable the DNA plugin on the master servers. Example config is as follows:
master server 1:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 4294967295
dnanextvalue: 131073
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber
master server 2:
dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
changetype: add
objectClass: top
objectClass: extensibleObject
cn: Posix IDs
dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup))
dnamagicregen: 999
dnamaxvalue: 0
dnanextvalue: 0
dnascope: dc=dev,dc=id,dc=ubc,dc=ca
dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca
dnathreshold: 1000
dnatype: uidNumber
dnatype: gidNumber
3. Setup a LDAP client (e.g. JMeter LDAP tester) to create and delete entries on one of the master servers. An example entry is:
objectClass top
objectClass person
objectClass organizationalPerson
objectClass inetOrgPerson
cn Test Plan
sn Plan
givenName Test
userPassword {SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
objectClass posixAccount
gidNumber 999
uidNumber 999
uid ${entrydn}
homeDirectory /home/somedir
${entrydn} can just be a unique uid that is sequentially listed in a text file e.g. jmeter0, jmeter1, jmeter2, etc.
4. Setup the following sequence of actions to run in a loop from the LDAP client:
a. BIND
b. ADD entry.
c. DEL entry.
d. UNBIND.
5. After running for about 5-6 minutes, you should trigger the error. An example of the error log is given below.
Actual results:
1. Entry is added to master server and a return code of 0 is given to the client.
2. The changelog for the entry fails to write, so while the entry is successfully written to the master, it fails to replicate to the other servers in the environment.
Error log:
[17/Jun/2013:15:23:13 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=id,dc=ubc,dc=ca: LDAP error - 51
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51bf8c50004100010000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51bf8c50004100010000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter741,dc=id,dc=ubc,dc=ca (uniqid: 70b03020-d79c11e2-8c03dfeb-4acc1d05, optype: 16) to changelog csn 51bf8c50004100010000
Access log:
[17/Jun/2013:15:23:10 -0700] conn=119116 fd=90 slot=90 connection from 142.103.1.221 to 10.7.0.51
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[17/Jun/2013:15:23:10 -0700] conn=119116 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[17/Jun/2013:15:23:10 -0700] conn=119116 op=1 ADD dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51bf8c50004100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 DEL dn="uid=jmeter741,dc=id,dc=ubc,dc=ca"
[17/Jun/2013:15:23:15 -0700] conn=119116 op=2 RESULT err=0 tag=107 nentries=0 etime=0 csn=51bf8c55001100010000
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 UNBIND
[17/Jun/2013:15:23:15 -0700] conn=119116 op=3 fd=90 closed - U1
Expected results:
1. Changelog for all entries is correctly written without deadlocking.
2. The change is replicated to all servers in the environment.
Additional info:
Please see the following discussion thread for more info:
https://lists.fedoraproject.org/pipermail/389-users/2013-June/016014.html
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHBA-2013-1653.html
Description of problem: Entry is added to the master server but fails to replicate due to a changelog error caused by a database deadlock. Version-Release number of selected component (if applicable): 389-ds-base.x86_64 1.2.11.15-14.el6_4 rhel-x86_64-server-6 How reproducible: Consistently Steps to Reproduce: 1. Setup a multi-master replication environment (2 masters) and 3 consumers. Replication should be real-time i.e. directories always kept in sync. 2. Enable the DNA plugin on the master servers. Example config is as follows: master server 1: dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config changetype: add objectClass: top objectClass: extensibleObject cn: Posix IDs dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup)) dnamagicregen: 999 dnamaxvalue: 4294967295 dnanextvalue: 131073 dnascope: dc=dev,dc=id,dc=ubc,dc=ca dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca dnathreshold: 1000 dnatype: uidNumber dnatype: gidNumber master server 2: dn: cn=Posix IDs,cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config changetype: add objectClass: top objectClass: extensibleObject cn: Posix IDs dnafilter: (|(objectclass=posixAccount)(objectClass=posixGroup)) dnamagicregen: 999 dnamaxvalue: 0 dnanextvalue: 0 dnascope: dc=dev,dc=id,dc=ubc,dc=ca dnasharedcfgdn: cn=posix-ids,cn=dna,cn=plugins,cn=configuration,ou=ELDAP,ou=Services,dc=dev,dc=id,dc=ubc,dc=ca dnathreshold: 1000 dnatype: uidNumber dnatype: gidNumber 3. Setup a LDAP client (e.g. JMeter LDAP tester) to create and delete entries on one of the master servers. An example entry is: objectClass top objectClass person objectClass organizationalPerson objectClass inetOrgPerson cn Test Plan sn Plan givenName Test userPassword {SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V objectClass posixAccount gidNumber 999 uidNumber 999 uid ${entrydn} homeDirectory /home/somedir ${entrydn} can just be a unique uid that is sequentially listed in a text file e.g. jmeter0, jmeter1, jmeter2, etc. 4. Setup the following sequence of actions to run in a loop from the LDAP client: a. BIND b. ADD entry. c. DEL entry. d. UNBIND. 5. After running for about 5-6 minutes, you should trigger the error. An example of the error log is given below. Actual results: 1. Entry is added to master server and a return code of 0 is given to the client. 2. The changelog for the entry fails to write, so while the entry is successfully written to the master, it fails to replicate to the other servers in the environment. Error log: [17/Jun/2013:15:23:13 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=id,dc=ubc,dc=ca: LDAP error - 51 [17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51bf8c50004100010000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51bf8c50004100010000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [17/Jun/2013:15:23:15 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter741,dc=id,dc=ubc,dc=ca (uniqid: 70b03020-d79c11e2-8c03dfeb-4acc1d05, optype: 16) to changelog csn 51bf8c50004100010000 Access log: [17/Jun/2013:15:23:10 -0700] conn=119116 fd=90 slot=90 connection from 142.103.1.221 to 10.7.0.51 [17/Jun/2013:15:23:10 -0700] conn=119116 op=0 BIND dn="cn=Directory Manager" method=128 version=3 [17/Jun/2013:15:23:10 -0700] conn=119116 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager" [17/Jun/2013:15:23:10 -0700] conn=119116 op=1 ADD dn="uid=jmeter741,dc=id,dc=ubc,dc=ca" [17/Jun/2013:15:23:15 -0700] conn=119116 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51bf8c50004100010000 [17/Jun/2013:15:23:15 -0700] conn=119116 op=2 DEL dn="uid=jmeter741,dc=id,dc=ubc,dc=ca" [17/Jun/2013:15:23:15 -0700] conn=119116 op=2 RESULT err=0 tag=107 nentries=0 etime=0 csn=51bf8c55001100010000 [17/Jun/2013:15:23:15 -0700] conn=119116 op=3 UNBIND [17/Jun/2013:15:23:15 -0700] conn=119116 op=3 fd=90 closed - U1 Expected results: 1. Changelog for all entries is correctly written without deadlocking. 2. The change is replicated to all servers in the environment. Additional info: Please see the following discussion thread for more info: https://lists.fedoraproject.org/pipermail/389-users/2013-June/016014.html