Bug 907985

Summary: DNA plugin acceptance tests failed with "DB_LOCK_DEADLOCK" error.
Product: Red Hat Enterprise Linux 6 Reporter: Sankar Ramalingam <sramling>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Sankar Ramalingam <sramling>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.4CC: ddas, jgalipea, jwest, nhosoi, nkinder, shaines, venkmaha
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.11.15-14.el6_4 Doc Type: Bug Fix
Doc Text:
The DNA (distributed numeric assignment) plug-in, under certain conditions, could log error messages with the "DB_LOCK_DEADLOCK" error code when attempting to create an entry with an uidNumber attribute. Now, DNA handles this case properly and errors no longer occur during attempts to create entries with uidNumber attributes.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-21 21:03:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 911467    

Description Sankar Ramalingam 2013-02-05 15:31:40 UTC
Description of problem: 
        DNA plugin acceptance tests failed to create an entry with uidNumber attribute. This failure is observed with dna_015, dna_026, dna_030, dna_031, dna_032 and dna_039. However, this failure is not very consistent with the DNA acceptance tests. The only test which fails every time is dna_039.

Version-Release number of selected component (if applicable): 389-ds-base-1.2.11.15-11

How reproducible: Not consistent.


Steps to Reproduce:
1. Run DNA acceptance tests with the latest 389-ds-base builds.
2. The number of failed tests would vary from 1 to 15. The test dna_039 fails consistently.
3. The test fails to create users on s2 with uidNumber attribute which is above dnaMaxValue.

- DNA plugin on s2 is configured to have dnaMaxvalue: 1015
- DNA plugin on s1 is configured to have dnaMaxvalue: 1025
- User(Posix_User11) on s1 is created with 1021 as uidNumber.

  
Actual results: Posix_User11 failed to sync to s2.

Few error messages observed in the error logs:

[05/Feb/2013:00:26:48 -0500] - slapd started.  Listening on All Interfaces port 24204 for LDAP requests
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5110988b000000020000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5110988b000000020000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=Posix User13,ou=People,o=dna.net (uniqid: c6eea203-6f5411e2-a861fe36-46df9f7b, optype: 16) to changelog csn 5110988b000000020000

Expected results:

DNA plugin is expected work correctly.

Comment 2 Noriko Hosoi 2013-02-05 19:35:31 UTC
dna_015: Indeed the add operation (cn=Posix User13) is not logged in the changelog.  Note: replicated add (cn=Posix User12) from s1 and its verification are both successfully finished by then.  So, it looks the Add (cn=Posix User13) is facing the deadlock by itself or with some internal MMR activity?

It might change the timing, but to gather more information, we may want to enable the replication log level with the namePipeLogging?

s2/access
[05/Feb/2013:00:28:42 -0500] conn=17 fd=66 slot=66 connection from 10.16.65.106 to 10.16.65.106
[05/Feb/2013:00:28:42 -0500] conn=17 op=0 BIND dn="cn=directory manager" method=128 version=3
[05/Feb/2013:00:28:42 -0500] conn=17 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[05/Feb/2013:00:28:42 -0500] conn=17 op=1 ADD dn="cn=Posix User13,ou=People,o=dna.net"
[05/Feb/2013:00:28:47 -0500] conn=17 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=5110988b000000020000
[05/Feb/2013:00:28:47 -0500] conn=17 op=2 UNBIND
[05/Feb/2013:00:28:47 -0500] conn=17 op=2 fd=66 closed - U1

s2/errors
[05/Feb/2013:00:26:48 -0500] - slapd started.  Listening on All Interfaces port 24204 for LDAP requests
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5110988b000000020000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5110988b000000020000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[05/Feb/2013:00:28:46 -0500] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=Posix User13,ou=People,o=dna.net (uniqid: c6eea203-6f5411e2-a861fe36-46df9f7b, optype: 16) to changelog csn 5110988b000000020000

s2/changelog
dbid: 5110988b000100020000
        replgen: 1360042122 Tue Feb  5 00:28:42 2013
        csn: 5110988b000100020000
        uniqueid: c6eea201-6f5411e2-a861fe36-46df9f7b
        dn: dnaHostname=dell-pe2800-01.rhts.eng.bos.redhat.com+dnaPortNum=24204,cn=Account UIDs,ou=Ranges,o=d
na.net
        operation: modify
                dnaRemainingValues: 7
                modifiersname: cn=Distributed Numeric Assignment Plugin,cn=plugins,cn=config
                modifytimestamp: 20130205052842Z

dbid: 51109895000000010000
        replgen: 1360042132 Tue Feb  5 00:28:52 2013
        csn: 51109895000000010000
        uniqueid: e5ed3401-6f5411e2-aa4cafbf-707e4001
        parentuniqueid: 2bf5c803-6f5411e2-ac9df9ea-03a514b3
        dn: cn=Posix User510,ou=People,o=dna.net
        operation: add
                cn: Posix User510
                uid: puser510
                uidNumber: 510
                gidNumber: 501
                homeDirectory: /export/puser510
                objectClass: top
                objectClass: posixAccount
                creatorsName: cn=directory manager
                modifiersName: cn=directory manager
                createTimestamp: 20130205052852Z
                modifyTimestamp: 20130205052852Z
                nsUniqueId: e5ed3401-6f5411e2-aa4cafbf-707e4001
                parentid: 4
                entryid: 29
                entrydn: cn=posix user510,ou=people,o=dna.net

Comment 3 Rich Megginson 2013-02-05 19:44:11 UTC
During the ADD operation, is there no other client activity?  Not even a search?  What about internal activity?  Are there any other internal operations or housekeeping operations going on during the ADD?  Something reading the changelog at the same time as the ADD operation is trying to write to the changelog, that held the page the ADD operation was trying to grab, long enough to cause the deadlock and retry failure?

Comment 4 Noriko Hosoi 2013-02-05 19:50:02 UTC
(In reply to comment #3)
> During the ADD operation, is there no other client activity?  Not even a
> search?  
I don't see any other activities in the access log... (even a search)

> What about internal activity?  Are there any other internal
> operations or housekeeping operations going on during the ADD?  Something
> reading the changelog at the same time as the ADD operation is trying to
> write to the changelog, that held the page the ADD operation was trying to
> grab, long enough to cause the deadlock and retry failure?
I'm also expecting that -- internal activities accessing the changelog.  I asked Sankar to enable the replication log level to reveal it.  But it could change the timing and hide the deadlock, thus I suggested to use named pipe.

I repeatedly ran the dna test on my VM, but I cannot reproduce the deadlock in the changelog, most likely because it only has one CPU. (Sankar's test machine has 4, I think)

Comment 5 Noriko Hosoi 2013-02-06 18:33:54 UTC
Sankar, could you please try running DNA acceptance test with this diff?

Index: dna_local_lib.sh
===================================================================
--- dna_local_lib.sh	(revision 7361)
+++ dna_local_lib.sh	(work copy)
@@ -27,6 +27,9 @@
     RC=1
   fi
 
+  # make sure the configuration updates is stored in the db.
+  sleep 35
+
   message "====== Stopping Server 1 ====="
   $S1/stop-slapd
 
@@ -39,8 +42,6 @@
   message "===== Starting Server 2 ====="
   $S2/start-slapd
 
-  sleep 30 
-
   return $RC
 }
 
@@ -333,6 +334,9 @@
      message "DNA configuration deleted from server 2."
    fi
 
+   # make sure the configuration updates is stored in the db.
+   sleep 35
+
    return $result
 }
 delete_SharedCfg()
@@ -363,6 +367,9 @@
 	 return 0
    fi
    
+   # make sure the configuration updates is stored in the db.
+   sleep 35
+
 #   rm -rf $TET_TMP_DIR/shared.ldif
 #   return $result
 }

Comment 6 Noriko Hosoi 2013-02-06 19:22:08 UTC
Correction: Please keep this "sleep 30", too.
@@ -39,8 +42,6 @@
   message "===== Starting Server 2 ====="
   $S2/start-slapd
 
-  sleep 30 
-
   return $RC
 }

Comment 7 Noriko Hosoi 2013-02-08 18:34:20 UTC
To include this bug in RHEL-6.4.z release, setting rhel-6.5.0 flag...

Comment 8 Scott Haines 2013-02-08 18:54:16 UTC
6.4.z proposed.  Regression. Deadlock issue with existing feature.

Comment 9 Sankar Ramalingam 2013-02-11 19:13:04 UTC
(In reply to comment #6)
> Correction: Please keep this "sleep 30", too.
> @@ -39,8 +42,6 @@
>    message "===== Starting Server 2 ====="
>    $S2/start-slapd
>  
> -  sleep 30 

I didn't observe any change in the test results with the above sleep commands.

quickinstall startup 	100% (2/2) 	  	 
DNA startup 	100% (1/1) 	  	 
DNA run 	  	11% (7/60)

> -
>    return $RC
>  }

Comment 11 Venkat Mahadevan 2013-06-12 21:36:13 UTC
Hello,

I just wanted to inquire as to whether this bug has been fixed or not in:

389-ds-base.x86_64   1.2.11.15-14.el6_4 @rhel-x86_64-server-6

I am currently experiencing this frequently when using the DNA plugin to assign uidNumbers to posix account entries. I can reproduce it within a few minutes of adding an entry. If I do not use the plugin, I don't experience the bug. The posix account will correctly add to the master server, but will not replicate to any of the replicas. A sample of the errors (consistent with the original reports above): 

[12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c148001e02be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51b8c148001e02be0000); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Jun/2013:11:43:24 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter429,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c908c-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c148001e02be0000
[12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c154004002be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter797,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c9143-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c154004002be0000

I'm just a bit confused as to the status of this bug because the last comment from abienven says it has been fixed, but I am still experiencing the issue with the latest packages from RHEL 6.4. Apologies if I am missing something obvious. Thanks!

Comment 12 Nathan Kinder 2013-06-12 21:59:08 UTC
Yes, this issue was fixed in (In reply to Venkat Mahadevan from comment #11)
> Hello,
> 
> I just wanted to inquire as to whether this bug has been fixed or not in:
> 
> 389-ds-base.x86_64   1.2.11.15-14.el6_4 @rhel-x86_64-server-6
> 
> I'm just a bit confused as to the status of this bug because the last
> comment from abienven says it has been fixed, but I am still
> experiencing the issue with the latest packages from RHEL 6.4. Apologies if
> I am missing something obvious. Thanks!

Yes, this fix for this bug is in 1.2.11.15-14.el6_4.  Perhaps you are hitting a similar issue.  Have you opened up up a support case for this?

Comment 13 Venkat Mahadevan 2013-06-12 23:37:43 UTC
(In reply to Nathan Kinder from comment #12)

> Yes, this fix for this bug is in 1.2.11.15-14.el6_4.  Perhaps you are
> hitting a similar issue.  Have you opened up up a support case for this?

Hi Nathan,

Thank you for your reply. I have filed a support case with Redhat just today but haven't heard back yet, hopefully soon :-)
 
I am running the version (1.2.11.15-14.el6_4) that should have this bug fixed based on the advisory, but I am running into it frequently so perhaps it still exists in that version as well. 
The bug does not occur if I don't use the auto-generated uidNumber via the DNA plugin. I am using a JMeter loop test to reproduce the error. 

cheers,

VM

Comment 14 Noriko Hosoi 2013-06-13 01:15:48 UTC
> I am using a JMeter loop test to reproduce the error.

Hello Venkat,

Sorry, I'm not familiar with "JMeter loop test".  Could you give us more details how we can get and use the test to reproduce the problem in house?

Also, could it be possible for us to see your configuration file (dse.ldif), access log and error log?

Thank you,
--noriko

Comment 15 Venkat Mahadevan 2013-06-13 18:00:03 UTC
Hi Noriko,

No problem. The JMeter loop test I am running is just a simple LDAP server stress test using JMeter (http://jmeter.apache.org/). Essentially what it does is:

1. Binds to a LDAP server (MMR master).
2. Adds an entry as follows:

objectClass	top
objectClass	person
objectClass	organizationalPerson
objectClass	inetOrgPerson
cn	Test Plan
sn	Plan
givenName	Test
userPassword	{SSHA}aBIV4atRWyMZqiWucSiZgYGVEw1bJa7V
objectClass	posixAccount
gidNumber	999
uidNumber	999
uid	${entrydn}
homeDirectory	/home/somedir

The ${entrydn} is just a variable that is read from a text file with a list of uids such as jmeter1, jmeter2, jmeter3, etc. 999 is a magic number that tells the DNA plugin to assign an auto-generated uid/gid number to the entry.

3. The entry is then deleted.
4. The test then restarts at 1.

Essentially, this is a test to stress the LDAP server with ADD and DEL operations and to verify if replication is working (we have 2 MMR masters and 3 consumers). I have noticed at times when errors such as the one below occur on a master, replication of an entry to the consumers will fail:

[12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8c154004002be0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Jun/2013:11:43:36 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter797,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: e62c9143-d38f11e2-96fdeacd-f14f05d6, optype: 16) to changelog csn 51b8c154004002be0000

I also repeated the same test, but in my second case, I did not use gidNumber or uidNumber and removed the POSIX attributes; in this case I am assuming the DNA plugin is not triggered and the errors like above do not occur when running the test. I'll send you a message with the XML test files for JMeter. Thanks!

cheers,

VM

Comment 18 Sankar Ramalingam 2013-09-18 11:20:00 UTC
Automated acceptance for DNA 100% PASS. Hence, marking the bug as Verified.

############## Result  for  backend test :   DNA run
    DNA run elapse time : 00:19:41
    DNA run Tests PASS      : 100% (71/71)

Verified against build - 389-ds-base-1.2.11.15-24

Comment 19 errata-xmlrpc 2013-11-21 21:03:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1653.html