Bug 1294770

Summary:	Supplier can skip a failing update, although it should retry.
Product:	Red Hat Enterprise Linux 6	Reporter:	German Parente <gparente>
Component:	389-ds-base	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	unspecified	Docs Contact:	Petr Bokoc <pbokoc>
Priority:	unspecified
Version:	6.8	CC:	jgalipea, mreynolds, nkinder, pbokoc, rmeggins, spichugi, tbordaz, tmihinto
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	389-ds-base-1.2.11.15-73.el6	Doc Type:	Bug Fix
Doc Text:	Replication failures no longer result in missing changes after additional updates Previously, if a replicated update failed on the consumer side, it was never retried due to a bug in the replication asynchronous result thread which caused it to miss the failure before another update was replicated successfully. The second update also updated the consumer Replica Update Vector (RUV), and the first (failed) update was lost. In this release, replication failures cause the connection to close, stopping the replication session and preventing any subsequent updates from updating the consumer RUV, which allows the supplier to retry the operation in the next replication session. No updates are therefore lost.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-10 19:22:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description German Parente 2015-12-30 08:55:21 UTC

Description of problem:

this bug is just a "flavor" of 

https://fedorahosted.org/389/ticket/47788

Supplier can skip a failing update, although it should retry.

IHAC who manages to reproduce this behavior very often.

Scenario is two nodes in replication where one million add's and one million del's take place.

The result is that "sometimes" but very often, delete's are not replicated.

This happens in pairs. That is to say, when two deletes are done simultaneously on different entries on each node, the replicated operation fails after retrying 50 times in the transaction backend. Note that both nodes are DELeting entries at the same time and in each node, the client application and the replication user are deleting entries. The issue is when two DEL's "crosses" each other. The transaction backend must be locked by one which is provoking the replicated operation fail after 50 retries and same in each node.

Here it's an extract of access and error logs:

NODE1:

errors:
[20/Dec/2015:05:19:10 -0500] - Retry count exceeded in delete

access:
[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US" ==> this DEL fails (it's a replicated op, binddn of this conn is repl. user).
[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"  ==> this DEL succeeds.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000          ==> failing DEL
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000                    ==> successful DEL



At the same time, in NODE2:


[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL fails (it's a replicated one from the other node with repl. user bind)

[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL succeeds but it's not replicated. It's the one failing in the other node.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000               ===> failing RES


[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000         ==> RES of successful DEL

Seems as if for a certain reason, the failing replicated operation with err=51 is not informed to the master which "thinks" it's all right and then, it never retries it again.

As we see both DEL's are "crossing each other"

Version-Release number of selected component (if applicable): customer is reproducing this in389-ds-base-1.2.11.15-68.el6_7.x86_64

Comment 3 mreynolds 2016-01-18 00:43:29 UTC

Fixed upstream.

Verification steps:

[1]  Set up MMR
[2]  Add 1 million entries to each replica(total of 2 million entries)
[3]  On each replica delete the 1 million entries that were just added(total of 2 million deletes)
[4]  Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure.

Comment 14 Simon Pichugin 2016-03-23 10:04:17 UTC

Build tested:
389-ds-base-1.2.11.15-74.el6.x86_64

Verification steps:
1) Set up MMR
master1 - 389 - dc=example,dc=com
master2 - 390 - dc=example,dc=com

2) Add 1 million entries to each replica (total of 2 million entries)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -f cn=MrXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -f cn=MrsXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

3) On each replica delete the 1 million entries that were just added(total of 2 million deletes)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrsXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

4) Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure
grep "err=51" /var/log/dirsrv/slapd-master1/access
echo $?
1

grep "err=51" /var/log/dirsrv/slapd-master2/access
echo $?
1

Marking as verified.

Comment 16 errata-xmlrpc 2016-05-10 19:22:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0737.html