Bug 907985
| Summary: | DNA plugin acceptance tests failed with "DB_LOCK_DEADLOCK" error. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Sankar Ramalingam <sramling> |
| Component: | 389-ds-base | Assignee: | Rich Megginson <rmeggins> |
| Status: | CLOSED ERRATA | QA Contact: | Sankar Ramalingam <sramling> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.4 | CC: | ddas, jgalipea, jwest, nhosoi, nkinder, shaines, venkmaha |
| Target Milestone: | rc | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 389-ds-base-1.2.11.15-14.el6_4 | Doc Type: | Bug Fix |
| Doc Text: |
The DNA (distributed numeric assignment) plug-in, under certain conditions, could log error messages with the "DB_LOCK_DEADLOCK" error code when attempting to create an entry with an uidNumber attribute. Now, DNA handles this case properly and errors no longer occur during attempts to create entries with uidNumber attributes.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-21 21:03:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 911467 | ||
dna_015: Indeed the add operation (cn=Posix User13) is not logged in the changelog. Note: replicated add (cn=Posix User12) from s1 and its verification are both successfully finished by then. So, it looks the Add (cn=Posix User13) is facing the deadlock by itself or with some internal MMR activity?
It might change the timing, but to gather more information, we may want to enable the replication log level with the namePipeLogging?
s2/access
[05/Feb/2013:00:28:42 -0500] conn=17 fd=66 slot=66 connection from 10.16.65.106 to 10.16.65.106
[05/Feb/2013:00:28:42 -0500] conn=17 op=0 BIND dn="cn=directory manager" method=128 version=3
[05/Feb/2013:00:28:42 -0500] conn=17 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[05/Feb/2013:00:28:42 -0500] conn=17 op=1 ADD dn="cn=Posix User13,ou=People,o=dna.net"
[05/Feb/2013:00:28:47 -0500] conn=17 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=5110988b000000020000
[05/Feb/2013:00:28:47 -0500] conn=17 op=2 UNBIND
[05/Feb/2013:00:28:47 -0500] conn=17 op=2 fd=66 closed - U1
s2/errors
[05/Feb/2013:00:26:48 -0500] - slapd started. Listening on All Interfaces port 24204 for LDAP requests
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5110988b000000020000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5110988b000000020000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=Posix User13,ou=People,o=dna.net (uniqid: c6eea203-6f5411e2-a861fe36-46df9f7b, optype: 16) to changelog csn 5110988b000000020000
s2/changelog
dbid: 5110988b000100020000
replgen: 1360042122 Tue Feb 5 00:28:42 2013
csn: 5110988b000100020000
uniqueid: c6eea201-6f5411e2-a861fe36-46df9f7b
dn: dnaHostname=dell-pe2800-01.rhts.eng.bos.redhat.com+dnaPortNum=24204,cn=Account UIDs,ou=Ranges,o=d
na.net
operation: modify
dnaRemainingValues: 7
modifiersname: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
modifytimestamp: 20130205052842Z
dbid: 51109895000000010000
replgen: 1360042132 Tue Feb 5 00:28:52 2013
csn: 51109895000000010000
uniqueid: e5ed3401-6f5411e2-aa4cafbf-707e4001
parentuniqueid: 2bf5c803-6f5411e2-ac9df9ea-03a514b3
dn: cn=Posix User510,ou=People,o=dna.net
operation: add
cn: Posix User510
uid: puser510
uidNumber: 510
gidNumber: 501
homeDirectory: /export/puser510
objectClass: top
objectClass: posixAccount
creatorsName: cn=directory manager
modifiersName: cn=directory manager
createTimestamp: 20130205052852Z
modifyTimestamp: 20130205052852Z
nsUniqueId: e5ed3401-6f5411e2-aa4cafbf-707e4001
parentid: 4
entryid: 29
entrydn: cn=posix user510,ou=people,o=dna.net
During the ADD operation, is there no other client activity? Not even a search? What about internal activity? Are there any other internal operations or housekeeping operations going on during the ADD? Something reading the changelog at the same time as the ADD operation is trying to write to the changelog, that held the page the ADD operation was trying to grab, long enough to cause the deadlock and retry failure? (In reply to comment #3) > During the ADD operation, is there no other client activity? Not even a > search? I don't see any other activities in the access log... (even a search) > What about internal activity? Are there any other internal > operations or housekeeping operations going on during the ADD? Something > reading the changelog at the same time as the ADD operation is trying to > write to the changelog, that held the page the ADD operation was trying to > grab, long enough to cause the deadlock and retry failure? I'm also expecting that -- internal activities accessing the changelog. I asked Sankar to enable the replication log level to reveal it. But it could change the timing and hide the deadlock, thus I suggested to use named pipe. I repeatedly ran the dna test on my VM, but I cannot reproduce the deadlock in the changelog, most likely because it only has one CPU. (Sankar's test machine has 4, I think) Sankar, could you please try running DNA acceptance test with this diff?
Index: dna_local_lib.sh
===================================================================
--- dna_local_lib.sh (revision 7361)
+++ dna_local_lib.sh (work copy)
@@ -27,6 +27,9 @@
RC=1
fi
+ # make sure the configuration updates is stored in the db.
+ sleep 35
+
message "====== Stopping Server 1 ====="
$S1/stop-slapd
@@ -39,8 +42,6 @@
message "===== Starting Server 2 ====="
$S2/start-slapd
- sleep 30
-
return $RC
}
@@ -333,6 +334,9 @@
message "DNA configuration deleted from server 2."
fi
+ # make sure the configuration updates is stored in the db.
+ sleep 35
+
return $result
}
delete_SharedCfg()
@@ -363,6 +367,9 @@
return 0
fi
+ # make sure the configuration updates is stored in the db.
+ sleep 35
+
# rm -rf $TET_TMP_DIR/shared.ldif
# return $result
}
Correction: Please keep this "sleep 30", too. @@ -39,8 +42,6 @@ message "===== Starting Server 2 =====" $S2/start-slapd - sleep 30 - return $RC } To include this bug in RHEL-6.4.z release, setting rhel-6.5.0 flag... 6.4.z proposed. Regression. Deadlock issue with existing feature. (In reply to comment #6) > Correction: Please keep this "sleep 30", too. > @@ -39,8 +42,6 @@ > message "===== Starting Server 2 =====" > $S2/start-slapd > > - sleep 30 I didn't observe any change in the test results with the above sleep commands. quickinstall startup 100% (2/2) DNA startup 100% (1/1) DNA run 11% (7/60) > - > return $RC > } Hello, I just wanted to inquire as to whether this bug has been fixed or not in: 389-ds-base.x86_64 1.2.11.15-14.el6_4 @rhel-x86_64-server-6 I am currently experiencing this frequently when using the DNA plugin to assign uidNumbers to posix account entries. I can reproduce it within a few minutes of adding an entry. If I do not use the plugin, I don't experience the bug. The posix account will correctly add to the master server, but will not replicate to any of the replicas. A sample of the errors (consistent with the original reports above): [12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c148001e02be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51b8c148001e02be0000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter429,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c908c-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c148001e02be0000 [12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c154004002be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter797,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c9143-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c154004002be0000 I'm just a bit confused as to the status of this bug because the last comment from abienven says it has been fixed, but I am still experiencing the issue with the latest packages from RHEL 6.4. Apologies if I am missing something obvious. Thanks! Yes, this issue was fixed in (In reply to Venkat Mahadevan from comment #11) > Hello, > > I just wanted to inquire as to whether this bug has been fixed or not in: > > 389-ds-base.x86_64 1.2.11.15-14.el6_4 @rhel-x86_64-server-6 > > I'm just a bit confused as to the status of this bug because the last > comment from abienven says it has been fixed, but I am still > experiencing the issue with the latest packages from RHEL 6.4. Apologies if > I am missing something obvious. Thanks! Yes, this fix for this bug is in 1.2.11.15-14.el6_4. Perhaps you are hitting a similar issue. Have you opened up up a support case for this? (In reply to Nathan Kinder from comment #12) > Yes, this fix for this bug is in 1.2.11.15-14.el6_4. Perhaps you are > hitting a similar issue. Have you opened up up a support case for this? Hi Nathan, Thank you for your reply. I have filed a support case with Redhat just today but haven't heard back yet, hopefully soon :-) I am running the version (1.2.11.15-14.el6_4) that should have this bug fixed based on the advisory, but I am running into it frequently so perhaps it still exists in that version as well. The bug does not occur if I don't use the auto-generated uidNumber via the DNA plugin. I am using a JMeter loop test to reproduce the error. cheers, VM > I am using a JMeter loop test to reproduce the error.
Hello Venkat,
Sorry, I'm not familiar with "JMeter loop test". Could you give us more details how we can get and use the test to reproduce the problem in house?
Also, could it be possible for us to see your configuration file (dse.ldif), access log and error log?
Thank you,
--noriko
Hi Noriko, No problem. The JMeter loop test I am running is just a simple LDAP server stress test using JMeter (http://jmeter.apache.org/). Essentially what it does is: 1. Binds to a LDAP server (MMR master). 2. Adds an entry as follows: objectClass top objectClass person objectClass organizationalPerson objectClass inetOrgPerson cn Test Plan sn Plan givenName Test userPassword {SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V objectClass posixAccount gidNumber 999 uidNumber 999 uid ${entrydn} homeDirectory /home/somedir The ${entrydn} is just a variable that is read from a text file with a list of uids such as jmeter1, jmeter2, jmeter3, etc. 999 is a magic number that tells the DNA plugin to assign an auto-generated uid/gid number to the entry. 3. The entry is then deleted. 4. The test then restarts at 1. Essentially, this is a test to stress the LDAP server with ADD and DEL operations and to verify if replication is working (we have 2 MMR masters and 3 consumers). I have noticed at times when errors such as the one below occur on a master, replication of an entry to the consumers will fail: [12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c154004002be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter797,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c9143-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c154004002be0000 I also repeated the same test, but in my second case, I did not use gidNumber or uidNumber and removed the POSIX attributes; in this case I am assuming the DNA plugin is not triggered and the errors like above do not occur when running the test. I'll send you a message with the XML test files for JMeter. Thanks! cheers, VM Automated acceptance for DNA 100% PASS. Hence, marking the bug as Verified.
############## Result for backend test : DNA run
DNA run elapse time : 00:19:41
DNA run Tests PASS : 100% (71/71)
Verified against build - 389-ds-base-1.2.11.15-24
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1653.html |
Description of problem: DNA plugin acceptance tests failed to create an entry with uidNumber attribute. This failure is observed with dna_015, dna_026, dna_030, dna_031, dna_032 and dna_039. However, this failure is not very consistent with the DNA acceptance tests. The only test which fails every time is dna_039. Version-Release number of selected component (if applicable): 389-ds-base-1.2.11.15-11 How reproducible: Not consistent. Steps to Reproduce: 1. Run DNA acceptance tests with the latest 389-ds-base builds. 2. The number of failed tests would vary from 1 to 15. The test dna_039 fails consistently. 3. The test fails to create users on s2 with uidNumber attribute which is above dnaMaxValue. - DNA plugin on s2 is configured to have dnaMaxvalue: 1015 - DNA plugin on s1 is configured to have dnaMaxvalue: 1025 - User(Posix_User11) on s1 is created with 1021 as uidNumber. Actual results: Posix_User11 failed to sync to s2. Few error messages observed in the error logs: [05/Feb/2013:00:26:48 -0500] - slapd started. Listening on All Interfaces port 24204 for LDAP requests [05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5110988b000000020000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5110988b000000020000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=Posix User13,ou=People,o=dna.net (uniqid: c6eea203-6f5411e2-a861fe36-46df9f7b, optype: 16) to changelog csn 5110988b000000020000 Expected results: DNA plugin is expected work correctly.