Bug 1310848

Summary:	Supplier can skip a failing update, although it should retry.
Product:	Red Hat Enterprise Linux 7	Reporter:	Noriko Hosoi <nhosoi>
Component:	389-ds-base	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	urgent	Docs Contact:	Petr Bokoc <pbokoc>
Priority:	urgent
Version:	7.3	CC:	ekeck, gparente, kbanerje, mkolaja, nkinder, pbokoc, rmeggins, tmihinto
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	389-ds-base-1.3.5.2-1.el7	Doc Type:	Bug Fix
Doc Text:	Failed replication updates are now retried correctly in the next session If a replica update failed on the consumer side and was followed by another update that succeeded, the consumer's replication status was updated by the successful update, which caused the consumer to seem as if it was up to date. Consequently, the failed update was never retried, leading to data loss. With this update, a replication failure closes the connection and stops the replication session. This prevents further updates from changing the consumer's replication status, and allows the supplier to retry the failed operation in the next session, avoiding data loss.	Story Points:	---
Clone Of:
Clones:	1351447 (view as bug list)		Environment:
Last Closed:	2016-11-03 20:39:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351447

Description Noriko Hosoi 2016-02-22 19:34:12 UTC

This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/47788

although the consumer returns a failure, and this failure should stop (break) replication (ignore_error_and_keep_going() does not ignore unwilling to perform), the operations following the failing one are replicated and the update the RUV. At the end it is like the failing operation is skipped.


Here is the test case scenario (in bold the failing CSN and the CSN following the failing one)


M1:
        MOD entry_18: csn=53592ba7000000010000
        DEL  entry_1: csn=53592ba7000100010000
        MOD entry_18: csn=53592ba8000000010000

                <-- M2
                MOD entry_19: csn=53592ba9000000020000
                MOD  entry_1: csn=53592baa000000020000 (error)
                MOD entry_19: csn=53592baa000100020000
        MOD entry_19: csn=53592bb6000000010000
        MOD entry_19: csn=53592bbb000000010000
        MOD entry_19: csn=53592bc1000000010000
        ...
        MOD entry_19: csn=53592beb000000010000
                <-- M2
                MOD entry_19: csn=53592bf2000000020000
                MOD entry_19: csn=53592bf8000000020000
                ...


M2:
        MOD entry_19: csn=53592ba9000000020000
        MOD  entry_1: csn=53592baa000000020000
        MOD entry_19: csn=53592baa000100020000
                M1 -->
                MOD entry_18: csn=53592ba7000000010000
                DEL  entry_1: csn=53592ba7000100010000
                MOD entry_18: csn=53592ba8000000010000
                MOD entry_19: csn=53592bb6000000010000
                MOD entry_19: csn=53592bbb000000010000
                MOD entry_19: csn=53592bc1000000010000
                ...
        ...


On M1, we can see that the MOD on entry_1 failed but replication of M2 updates is not stopped

    [24/Apr/2014:17:20:13 +0200] conn=5 op=5 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:14 +0200] conn=5 op=5 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592ba9000000020000
    [24/Apr/2014:17:20:14 +0200] conn=5 op=6 MOD dn="cn=new_account1,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:14 +0200] conn=5 op=6 RESULT err=53 tag=103 nentries=0 etime=0 csn=53592baa000000020000
    [24/Apr/2014:17:20:14 +0200] conn=5 op=7 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:16 +0200] conn=5 op=7 RESULT err=0 tag=103 nentries=0 etime=2 csn=53592baa000100020000
    [24/Apr/2014:17:20:19 +0200] conn=3 op=40 SRCH base="dc=example,dc=com" scope=2 filter="(cn=new_account19)" attrs=ALL
    [24/Apr/2014:17:20:19 +0200] conn=3 op=40 RESULT err=0 tag=101 nentries=1 etime=0
    [24/Apr/2014:17:20:20 +0200] conn=5 op=8 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
    [24/Apr/2014:17:20:20 +0200] conn=5 op=8 RESULT err=0 tag=120 nentries=0 etime=0
    [24/Apr/2014:17:20:21 +0200] conn=5 op=9 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
    [24/Apr/2014:17:20:21 +0200] conn=3 op=42 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:20:21 +0200] conn=5 op=9 RESULT err=0 tag=120 nentries=0 etime=1
    [24/Apr/2014:17:20:22 +0200] conn=3 op=42 RESULT err=0 tag=103 nentries=0 etime=1 csn=53592bb6000000010000
    ...
    [24/Apr/2014:17:21:25 +0200] conn=5 op=44 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:21:25 +0200] conn=5 op=44 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf2000000020000
    ...
    [24/Apr/2014:17:21:30 +0200] conn=5 op=48 MOD dn="cn=new_account19,cn=staged user,dc=example,dc=com"
    [24/Apr/2014:17:21:30 +0200] conn=5 op=48 RESULT err=0 tag=103 nentries=0 etime=0 csn=53592bf8000000020000


On M1, we can see that the failing MOD is add to the pending list, but when the next update (53592baa000100020000) is commited, the failing update is no longer in the pending list.

    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - 53592baa000000020000, not committed
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000000020000 into pending list
    [24/Apr/2014:17:20:14 +0200] ldbm_back_modify - Attempt to modify a tombstone entry nsuniqueid=d131a60c-cbc311e3-847ba7c2-9c316dbb,cn=new_account1,cn=staged user,dc=example,dc=com
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - conn=5 op=6 csn=53592baa000000020000 process postop: canceling operation csn
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - csnplInsert: CSN Pending list content:
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
    [24/Apr/2014:17:20:15 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 53592baa000100020000 into pending list
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - replica_update_ruv: csn 53592baa000100020000
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - csnplCommit: CSN Pending list content:
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - 53592baa000100020000, not committed
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 53592baa000100020000
    [24/Apr/2014:17:20:16 +0200] NSMMReplicationPlugin - ruv_update_ruv: rolled up to csn 53592baa000100020000

On M2, we can see that the second replication session find 53592baa000100020000 in the RUV

    [24/Apr/2014:17:20:12 +0200] Consumer RUV:
    [24/Apr/2014:17:20:12 +0200] {replicageneration} 53590540000000010000
    [24/Apr/2014:17:20:12 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
    [24/Apr/2014:17:20:12 +0200] {replica 2 ldap://localhost.localdomain:45389}
    ...
    [24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=2 csn=53592baa000000020000
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account1,cn=staged user,dc=example,dc=com" csn=53592baa000000020000)
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000000020000
    [24/Apr/2014:17:20:13 +0200] agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389) - load=1 rec=3 csn=53592baa000100020000
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Sending modify operation (dn="cn=new_account19,cn=staged user,dc=example,dc=com" csn=53592baa000100020000)
    [24/Apr/2014:17:20:13 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): replay_update: Consumer successfully sent operation with csn 53592baa000100020000
    [24/Apr/2014:17:20:14 +0200] NSMMReplicationPlugin - agmt="cn=meTo_localhost.localdomain:44389" (localhost:44389): Consumer failed to replay change (uniqueid d131a60c-cbc311e3-847ba7c2-9c316dbb, CSN 53592baa000000020000): Server is unwilling to perform (53). Will retry later.
    ...
    [24/Apr/2014:17:20:21 +0200] Consumer RUV:
    [24/Apr/2014:17:20:21 +0200] {replicageneration} 53590540000000010000
    [24/Apr/2014:17:20:21 +0200] {replica 1 ldap://localhost.localdomain:44389} 53590546000000010000 53592ba8000000010000 00000000
    [24/Apr/2014:17:20:21 +0200] {replica 2 ldap://localhost.localdomain:45389} 53592ba9000000020000 53592baa000100020000 00000000


So replication did not stop on 53592baa000000020000, and the update is skipped although the supplier intended to retry it. It is possibly related to the update being removed from the pending list, the next commited update 53592baa000100020000 can go to the RUV

Comment 1 Mike McCune 2016-03-28 23:13:32 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 9 Kamlesh 2016-08-02 09:34:23 UTC

Bug Verified
Work fine for me no error occure during the adding and deleteing entry in MMR setup

Step1) Set up MMR 

Step 2) Add 1 million user in instance using ldclt
ldclt  -D "cn=Directory Manager" -w test1234 -f cn=XXXXXX -b "ou=people,dc=example,dc=com" -e add,person,incr,noloop,commoncounter -r1 -R1000000 -I 68 -I 32 -H ldap://localhost:389

Step 3) delete same 1 million user 
ldclt  -D "cn=Directory Manager" -w test1234 -f cn=XXXXXX -b "ou=people,dc=example,dc=com" -e delete,person,incr,noloop,commoncounter -r1 -R1000000 -I 68 -I 32 -H ldap://localhost:389

Step 4) Check access log
[root@dhcp210-197 slapd-test1]# pwd
/var/log/dirsrv/slapd-test1
[root@dhcp210-197 slapd-test1]# grep -c err=53 access*
access:0
access.20160727-183248:0
access.20160728-064940:0
access.20160728-192225:0
access.20160801-090001:0
access.20160801-222119:0
access.rotationinfo:0


[root@dhcp210-197 slapd-test]# pwd
/var/log/dirsrv/slapd-test
[root@dhcp210-197 slapd-test]# grep -c err=53 access*
access:0
access.20160727-183209:0
access.20160728-063057:0
access.20160728-190940:0
access.20160801-093749:0
access.20160801-222320:0
access.rotationinfo:0

Comment 11 errata-xmlrpc 2016-11-03 20:39:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2594.html

Comment 12 Noriko Hosoi 2016-11-15 16:04:11 UTC

*** Bug 1339693 has been marked as a duplicate of this bug. ***