Description of problem: this bug is just a "flavor" of https://fedorahosted.org/389/ticket/47788 Supplier can skip a failing update, although it should retry. IHAC who manages to reproduce this behavior very often. Scenario is two nodes in replication where one million add's and one million del's take place. The result is that "sometimes" but very often, delete's are not replicated. This happens in pairs. That is to say, when two deletes are done simultaneously on different entries on each node, the replicated operation fails after retrying 50 times in the transaction backend. Note that both nodes are DELeting entries at the same time and in each node, the client application and the replication user are deleting entries. The issue is when two DEL's "crosses" each other. The transaction backend must be locked by one which is provoking the replicated operation fail after 50 retries and same in each node. Here it's an extract of access and error logs: NODE1: errors: [20/Dec/2015:05:19:10 -0500] - Retry count exceeded in delete access: [20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US" ==> this DEL fails (it's a replicated op, binddn of this conn is repl. user). [20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US" ==> this DEL succeeds. [20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000 ==> failing DEL [20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US" [20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000 ==> successful DEL At the same time, in NODE2: [20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US" =====> this DEL fails (it's a replicated one from the other node with repl. user bind) [20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US" =====> this DEL succeeds but it's not replicated. It's the one failing in the other node. [20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000 ===> failing RES [20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US" [20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000 ==> RES of successful DEL Seems as if for a certain reason, the failing replicated operation with err=51 is not informed to the master which "thinks" it's all right and then, it never retries it again. As we see both DEL's are "crossing each other" Version-Release number of selected component (if applicable): customer is reproducing this in389-ds-base-1.2.11.15-68.el6_7.x86_64
Fixed upstream. Verification steps: [1] Set up MMR [2] Add 1 million entries to each replica(total of 2 million entries) [3] On each replica delete the 1 million entries that were just added(total of 2 million deletes) [4] Check the access log for error 51. If the error is found, see if that CSN from that failed operation is replayed shortly after the failure.
Build tested: 389-ds-base-1.2.11.15-74.el6.x86_64 Verification steps: 1) Set up MMR master1 - 389 - dc=example,dc=com master2 - 390 - dc=example,dc=com 2) Add 1 million entries to each replica (total of 2 million entries) ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -f cn=MrXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999 ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -f cn=MrsXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999 3) On each replica delete the 1 million entries that were just added(total of 2 million deletes) ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrXXXXXX -e incr,noloop,commoncounter -r0 -R999999 -I 32 ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrsXXXXXX -e incr,noloop,commoncounter -r0 -R999999 -I 32 4) Check the access log for error 51. If the error is found, see if that CSN from that failed operation is replayed shortly after the failure grep "err=51" /var/log/dirsrv/slapd-master1/access echo $? 1 grep "err=51" /var/log/dirsrv/slapd-master2/access echo $? 1 Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0737.html