1294770 – Supplier can skip a failing update, although it should retry.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1294770 - Supplier can skip a failing update, although it should retry.

Summary: Supplier can skip a failing update, although it should retry.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	6.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:	Petr Bokoc
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-30 08:55 UTC by German Parente
Modified:	2020-09-13 21:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:	389-ds-base-1.2.11.15-73.el6
Doc Type:	Bug Fix
Doc Text:	Replication failures no longer result in missing changes after additional updates Previously, if a replicated update failed on the consumer side, it was never retried due to a bug in the replication asynchronous result thread which caused it to miss the failure before another update was replicated successfully. The second update also updated the consumer Replica Update Vector (RUV), and the first (failed) update was lost. In this release, replication failures cause the connection to close, stopping the replication session and preventing any subsequent updates from updating the consumer RUV, which allows the supplier to retry the operation in the next replication session. No updates are therefore lost.
Clone Of:
Environment:
Last Closed:	2016-05-10 19:22:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 1119	0	None	None	None	2020-09-13 21:04:25 UTC
Red Hat Product Errata	RHBA-2016:0737	0	normal	SHIPPED_LIVE	389-ds-base bug fix and enhancement update	2016-05-10 22:29:13 UTC

Description German Parente 2015-12-30 08:55:21 UTC

Description of problem:

this bug is just a "flavor" of 

https://fedorahosted.org/389/ticket/47788

Supplier can skip a failing update, although it should retry.

IHAC who manages to reproduce this behavior very often.

Scenario is two nodes in replication where one million add's and one million del's take place.

The result is that "sometimes" but very often, delete's are not replicated.

This happens in pairs. That is to say, when two deletes are done simultaneously on different entries on each node, the replicated operation fails after retrying 50 times in the transaction backend. Note that both nodes are DELeting entries at the same time and in each node, the client application and the replication user are deleting entries. The issue is when two DEL's "crosses" each other. The transaction backend must be locked by one which is provoking the replicated operation fail after 50 retries and same in each node.

Here it's an extract of access and error logs:

NODE1:

errors:
[20/Dec/2015:05:19:10 -0500] - Retry count exceeded in delete

access:
[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US" ==> this DEL fails (it's a replicated op, binddn of this conn is repl. user).
[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"  ==> this DEL succeeds.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000          ==> failing DEL
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000                    ==> successful DEL



At the same time, in NODE2:


[20/Dec/2015:05:19:08 -0500] conn=2263844 op=577550 DEL dn="uid=p6k4n3522802,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL fails (it's a replicated one from the other node with repl. user bind)

[20/Dec/2015:05:19:09 -0500] conn=2755125 op=1 DEL dn="uid=p6k4n3522763,ou=nsPeople,o=nscorp.com,c=US"       =====> this DEL succeeds but it's not replicated. It's the one failing in the other node.
[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577550 RESULT err=51 tag=107 nentries=0 etime=2 csn=5676809a001127120000               ===> failing RES


[20/Dec/2015:05:19:10 -0500] conn=2263844 op=577551 DEL dn="uid=p6k4n3522800,ou=nsPeople,o=nscorp.com,c=US"
[20/Dec/2015:05:19:11 -0500] conn=2755125 op=1 RESULT err=0 tag=107 nentries=0 etime=2 csn=5676809f000027110000         ==> RES of successful DEL

Seems as if for a certain reason, the failing replicated operation with err=51 is not informed to the master which "thinks" it's all right and then, it never retries it again.

As we see both DEL's are "crossing each other"

Version-Release number of selected component (if applicable): customer is reproducing this in389-ds-base-1.2.11.15-68.el6_7.x86_64

Comment 3 mreynolds 2016-01-18 00:43:29 UTC

Fixed upstream.

Verification steps:

[1]  Set up MMR
[2]  Add 1 million entries to each replica(total of 2 million entries)
[3]  On each replica delete the 1 million entries that were just added(total of 2 million deletes)
[4]  Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure.

Comment 14 Simon Pichugin 2016-03-23 10:04:17 UTC

Build tested:
389-ds-base-1.2.11.15-74.el6.x86_64

Verification steps:
1) Set up MMR
master1 - 389 - dc=example,dc=com
master2 - 390 - dc=example,dc=com

2) Add 1 million entries to each replica (total of 2 million entries)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -f cn=MrXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -f cn=MrsXXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r0 -R999999

3) On each replica delete the 1 million entries that were just added(total of 2 million deletes)
ldclt -h localhost -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

ldclt -h localhost -p 390 -D "cn=Directory Manager" -w Secret123 -b "ou=people,dc=example,dc=com" -e delete -f cn=MrsXXXXXX -e incr,noloop,commoncounter -r0 -R999999  -I 32

4) Check the access log for error 51.  If the error is found, see if that CSN from that failed operation is replayed shortly after the failure
grep "err=51" /var/log/dirsrv/slapd-master1/access
echo $?
1

grep "err=51" /var/log/dirsrv/slapd-master2/access
echo $?
1

Marking as verified.

Comment 16 errata-xmlrpc 2016-05-10 19:22:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0737.html

Note You need to log in before you can comment on or make changes to this bug.