1310848 – Supplier can skip a failing update, although it should retry.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1310848 - Supplier can skip a failing update, although it should retry.

Summary: Supplier can skip a failing update, although it should retry.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:	Petr Bokoc
URL:
Whiteboard:
Duplicates (1):	1339693 (view as bug list)
Depends On:
Blocks:	1351447
TreeView+	depends on / blocked

Reported:	2016-02-22 19:34 UTC by Noriko Hosoi
Modified:	2020-09-13 21:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:	389-ds-base-1.3.5.2-1.el7
Doc Type:	Bug Fix
Doc Text:	Failed replication updates are now retried correctly in the next session If a replica update failed on the consumer side and was followed by another update that succeeded, the consumer's replication status was updated by the successful update, which caused the consumer to seem as if it was up to date. Consequently, the failed update was never retried, leading to data loss. With this update, a replication failure closes the connection and stops the replication session. This prevents further updates from changing the consumer's replication status, and allows the supplier to retry the failed operation in the next session, avoiding data loss.
Clone Of:
Clones:	1351447 (view as bug list)
Environment:
Last Closed:	2016-11-03 20:39:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 1119	0	None	None	None	2020-09-13 21:04:31 UTC
Red Hat Product Errata	RHSA-2016:2594	0	normal	SHIPPED_LIVE	Moderate: 389-ds-base security, bug fix, and enhancement update	2016-11-03 12:11:08 UTC

Description Noriko Hosoi 2016-02-22 19:34:12 UTC

This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/47788

although the consumer returns a failure, and this failure should stop (break) replication (ignore_error_and_keep_going() does not ignore unwilling to perform), the operations following the failing one are replicated and the update the RUV. At the end it is like the failing operation is skipped.


Here is the test case scenario (in bold the failing CSN and the CSN following the failing one)


M1:
        MOD entry_18: csn=53592ba7000000010000
        DEL  entry_1: csn=53592ba7000100010000
        MOD entry_18: csn=53592ba8000000010000

                <-- M2
                MOD entry_19: csn=53592ba9000000020000
                MOD  entry_1: csn=53592baa000000020000 (error)
                MOD entry_19: csn=53592baa000100020000
        MOD entry_19: csn=53592bb6000000010000
        MOD entry_19: csn=53592bbb000000010000
        MOD entry_19: csn=53592bc1000000010000
        ...
        MOD entry_19: csn=53592beb000000010000
                <-- M2
                MOD entry_19: csn=53592bf2000000020000
                MOD entry_19: csn=53592bf8000000020000
                ...


M2:
        MOD entry_19: csn=53592ba9000000020000
        MOD  entry_1: csn=53592baa000000020000
        MOD entry_19: csn=53592baa000100020000
                M1 -->
                MOD entry_18: csn=53592ba7000000010000
                DEL  entry_1: csn=53592ba7000100010000
                MOD entry_18: csn=53592ba8000000010000
                MOD entry_19: csn=53592bb6000000010000
                MOD entry_19: csn=53592bbb000000010000
                MOD entry_19: csn=53592bc1000000010000
                ...
        ...


On M1, we can see that the MOD on entry_1 failed but replication of M2 updates is not stopped

    [24/Apr/2014:17:20:13 +0200] conn=5 op=5 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:14 +0200] conn=5 op=5 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592ba9000000020000
    [24/Apr/2014:17:20:14 +0200] conn=5 op=6 MOD dn="cn=new_account1,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:14 +0200] conn=5 op=6 RESULT err=53 tag=103 nentries=0 etime=0 csn=53592baa000000020000
    [24/Apr/2014:17:20:14 +0200] conn=5 op=7 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:16 +0200] conn=5 op=7 RESULT err=0 tag=103 nentries=0 etime=2 csn=53592baa000100020000
    [24/Apr/2014:17:20:19 +0200] conn=3 op=40 SRCH base="dc=example,dc=com" scope=2 filter="(cn=new_account19)" attrs=ALL
    [24/Apr/2014:17:20:19 +0200] conn=3 op=40 RESULT err=0 tag=101 nentries=1 etime=0
    [24/Apr/2014:17:20:20 +0200] conn=5 op=8 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
    [24/Apr/2014:17:20:20 +0200] conn=5 op=8 RESULT err=0 tag=120 nentries=0 etime=0
    [24/Apr/2014:17:20:21 +0200] conn=5 op=9 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
    [24/Apr/2014:17:20:21 +0200] conn=3 op=42 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:21 +0200] conn=5 op=9 RESULT err=0 tag=120 nentries=0 etime=1
    [24/Apr/2014:17:20:22 +0200] conn=3 op=42 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592bb6000000010000
    ...
    [24/Apr/2014:17:21:25 +0200] conn=5 op=44 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:21:25 +0200] conn=5 op=44 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf2000000020000
    ...
    [24/Apr/2014:17:21:30 +0200] conn=5 op=48 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:21:30 +0200] conn=5 op=48 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf8000000020000


On M1, we can see that the failing MOD is add to the pending list, but when the next update (53592baa000100020000) is commited, the failing update is no longer in the pending list.

    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - 53592baa000000020000, not committed
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000000020000 into pending list
    [24/Apr/2014:17:20:14 +0200] ldbm_back_modify - Attempt to modify a tombstone entry nsuniqueid=d131a60c-cbc311e3-847ba7c2-9c316dbb,cn=new_account1,cn=staged user,dc=example,dc=com
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - conn=5 op=6 csn=53592baa000000020000 process postop: canceling operation csn
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - csnplInsert: CSN Pending list content:
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000100020000 into pending list
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - replica_update_ruv: csn 53592baa000100020000
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - csnplCommit: CSN Pending list content:
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 53592baa000100020000
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: rolled up to csn 53592baa000100020000

On M2, we can see that the second replication session find 53592baa000100020000 in the RUV

    [24/Apr/2014:17:20:12 +0200] Consumer RUV:
    [24/Apr/2014:17:20:12 +0200] {replicageneration} 53590540000000010000
    [24/Apr/2014:17:20:12 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
    [24/Apr/2014:17:20:12 +0200] {replica 2 ldap://localhost.localdomain:45389}
    ...
    [24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=2 csn=53592baa000000020000
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account1,cn=staged user,dc=example,dc=com" csn=53592baa000000020000)
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000000020000
    [24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=3 csn=53592baa000100020000
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account19,cn=staged user,dc=example,dc=com" csn=53592baa000100020000)
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000100020000
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): Consumer failed to replay change (uniqueid d131a60c-cbc311e3-847ba7c2-9c316dbb, CSN 53592baa000000020000): Server is unwilling to perform (53). Will retry later.
    ...
    [24/Apr/2014:17:20:21 +0200] Consumer RUV:
    [24/Apr/2014:17:20:21 +0200] {replicageneration} 53590540000000010000
    [24/Apr/2014:17:20:21 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
    [24/Apr/2014:17:20:21 +0200] {replica 2 ldap://localhost.localdomain:45389} 53592ba9000000020000 53592baa000100020000 00000000


So replication did not stop on 53592baa000000020000, and the update is skipped although the supplier intended to retry it. It is possibly related to the update being removed from the pending list, the next commited update 53592baa000100020000 can go to the RUV

Comment 1 Mike McCune 2016-03-28 23:13:32 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 9 Kamlesh 2016-08-02 09:34:23 UTC

Bug Verified
Work fine for me no error occure during the adding and deleteing entry in MMR setup

Step1) Set up MMR 

Step 2) Add 1 million user in instance using ldclt
ldclt  -D "cn=Directory Manager" -w test1234 -f cn=XXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r1 -R1000000 -I 68 -I 32 -H ldap://localhost:389

Step 3) delete same 1 million user 
ldclt  -D "cn=Directory Manager" -w test1234 -f cn=XXXXXX -b "ou=people,dc=example,dc=com" -e delete,person,incr,noloop,commoncounter -r1 -R1000000 -I 68 -I 32 -H ldap://localhost:389

Step 4) Check access log
[root@dhcp210-197 slapd-test1]# pwd
/var/log/dirsrv/slapd-test1
[root@dhcp210-197 slapd-test1]# grep -c err=53 access*
access:0
access.20160727-183248:0
access.20160728-064940:0
access.20160728-192225:0
access.20160801-090001:0
access.20160801-222119:0
access.rotationinfo:0


[root@dhcp210-197 slapd-test]# pwd
/var/log/dirsrv/slapd-test
[root@dhcp210-197 slapd-test]# grep -c err=53 access*
access:0
access.20160727-183209:0
access.20160728-063057:0
access.20160728-190940:0
access.20160801-093749:0
access.20160801-222320:0
access.rotationinfo:0

Comment 11 errata-xmlrpc 2016-11-03 20:39:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2594.html

Comment 12 Noriko Hosoi 2016-11-15 16:04:11 UTC

*** Bug 1339693 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.