Bug 1294770 - Supplier can skip a failing update, although it should retry.
Supplier can skip a failing update, although it should retry.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
6.8
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Noriko Hosoi
Viktor Ashirov
Petr Bokoc
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-30 03:55 EST by German Parente
Modified: 2017-01-02 07:59 EST (History)
8 users (show)

See Also:
Fixed In Version: 389-ds-base-1.2.11.15-73.el6
Doc Type: Bug Fix
Doc Text:
Replication failures no longer result in missing changes after additional updates Previously, if a replicated update failed on the consumer side, it was never retried due to a bug in the replication asynchronous result thread which caused it to miss the failure before another update was replicated successfully. The second update also updated the consumer Replica Update Vector (RUV), and the first (failed) update was lost. In this release, replication failures cause the connection to close, stopping the replication session and preventing any subsequent updates from updating the consumer RUV, which allows the supplier to retry the operation in the next replication session. No updates are therefore lost.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-10 15:22:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description German Parente 2015-12-30 03:55:21 EST
Description of problem:

this bug is just a "flavor" of 

https://fedorahosted.org/389/ticket/47788

Supplier can skip a failing update, although it should retry.

IHAC who manages to reproduce this behavior very often.

Scenario is two nodes in replication where one million add's and one million del's take place.

The result is that "sometimes" but very often, delete's are not replicated.

This happens in pairs. That is to say, when two deletes are done simultaneously on different entries on each node, the replicated operation fails after retrying 50 times in the transaction backend. Note that both nodes are DELeting entries at the same time and in each node, the client application and the replication user are deleting entries. The issue is when two DEL's "crosses" each other. The transaction backend must be locked by one which is provoking the replicated operation fail after 50 retries and same in each node.

Here it's an extract of access and error logs:

NODE1:

errors:
[20/Dec/2015:05:19:10 -0500] - Retry count exceeded in delete

access:
[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US" ==> this DEL fails (it's a replicated op, binddn of this conn is repl. user).
[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"  ==> this DEL succeeds.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000          ==> failing DEL
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000                    ==> successful DEL



At the same time, in NODE2:


[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL fails (it's a replicated one from the other node with repl. user bind)

[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL succeeds but it's not replicated. It's the one failing in the other node.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000               ===> failing RES


[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000         ==> RES of successful DEL

Seems as if for a certain reason, the failing replicated operation with err=51 is not informed to the master which "thinks" it's all right and then, it never retries it again.

As we see both DEL's are "crossing each other"

Version-Release number of selected component (if applicable): customer is reproducing this in389-ds-base-1.2.11.15-68.el6_7.x86_64
Comment 3 mreynolds 2016-01-17 19:43:29 EST
Fixed upstream.

Verification steps:

[1]  Set up MMR
[2]  Add 1 million entries to each replica(total of 2 million entries)
[3]  On each replica delete the 1 million entries that were just added(total of 2 million deletes)
[4]  Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure.
Comment 14 Simon Pichugin 2016-03-23 06:04:17 EDT
Build tested:
389-ds-base-1.2.11.15-74.el6.x86_64

Verification steps:
1) Set up MMR
master1 - 389 - dc=example,dc=com
master2 - 390 - dc=example,dc=com

2) Add 1 million entries to each replica (total of 2 million entries)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -f cn=MrXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -f cn=MrsXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

3) On each replica delete the 1 million entries that were just added(total of 2 million deletes)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrsXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

4) Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure
grep "err=51" /var/log/dirsrv/slapd-master1/access
echo $?
1

grep "err=51" /var/log/dirsrv/slapd-master2/access
echo $?
1

Marking as verified.
Comment 16 errata-xmlrpc 2016-05-10 15:22:35 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0737.html

Note You need to log in before you can comment on or make changes to this bug.