Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1391701

Summary:	do not treat missing csn as fatal
Product:	Red Hat Enterprise Linux 6	Reporter:	Noriko Hosoi <nhosoi>
Component:	389-ds-base	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.8	CC:	gparente, nkinder, rmeggins, tlavigne
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	389-ds-base-1.2.11.15-87.el6	Doc Type:	Bug Fix
Doc Text:	Fix: This fix removes the automatic choice of an alternative csn when the calculated anchor csn is not found. In that case it does no longer go to fatal state but will retry later. It also adds a configuration parameter to thr replication agreement to allow to pick a "next best" anchorcsn if the original is not found to keep replication going. Result: Even if a master runs into missing CSN, the replication does not fail, but it retries.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-21 10:23:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Noriko Hosoi 2016-11-03 19:20:08 UTC

This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/49020

There have been many tickets and fixes about managing csn in a replication session and it is not yet fully settled. 
There are situations where replication should backoff instead of going into fatal state.
A summary of the problem and status was discussed on a mailing list and is cited here:

{{{
In the last time I was haunted by the problem if an when we should pick an alternative start csn in a replication session if the calculated anchor csn cannot be found in the changelog.
This was investigated during the work on ticket #48766 (skipped-changes), the resulting failures in reliab15 and picked up again with the recent customer problems on 7.2 where a part of that fix was missing.
I was trying to analyze the scenarios which can arise and to be finally able to answer these two core questions:

1] If an anchor csn cannot be found should we choose an alternative starting csn (under which conditions) or should the repl session be aborted ?
2] if the answer to 1] is abort, should this be a fatal error or transient ?

I have been moving in circles, but hope to have a good understanding of the problem now and would really like to get this finally settled, so pleas read the follwing, even if it is a bit lengthy, and challenge my arguments.

Lets start looking back at the state before the "skipped-changes":
- if an anchorcsn was not found, and the purgRUV in the changelog did not contain csns an alternative start csn was chosen
- the alternative was the minCSN of the supplierRUV
- after an online init, a start iteration record was written to the changelog, which corresponded to the minCSN in the ruv after init.

This worked, but when working on the "skipped-changes" problem I noticed that the selection of the alterantive csn could lead to a loss of many changes (I have a test case to demonstrate this - TEST #1)
So, under the assumption that we should find an anchor or break and that the existing mechanism to select an alternative was incorrect, this fallback was removed from the original fix patch48766-0

Unfortunately reliab15 failed with this patch, tracked in ticket #48954
A first analysis showed that the failure was after init, when an anchor csn was looked up for a replicaID which did not have a start-iteration record, and a first attempt to fix it was to log start-iteration records for all RIDs with csns contained in the ruv after init, patch48954-1
This did not completely solve the initial phase of reliab15 and we decided to go back to the method of selecting an alternative start csn, but chosing a better one, as close as possible to the missing csn: patch48954-2.

This resolved the reliab15 problem is the current state.

In between Thierry detected that this change also changed behaviour if a replica did accept updates to early after initialization #48976
And I found the testcase TEST#1, where with the old selection of the alternative many changes can be lost, but with the new method still one change is lost

So I looked again closer to the failures we had in reliab15 and noticed that one reason that patch48954-1 did not work was that in the process of initializations M!->M2->.. the ruv for most replicIDs only contained a purl, but no csn. This could be improved by fixing the tickets: #48995 and #48999
With a fix for these tickets the reliab15 scenario worked without the need of chosing an alternative anchor csn.

So I am back to the question what could be a valid scenario where an anchor csn cannot be found.

From the following it should not happen:
If the ruv used for total init does contain a csn for rid N, a start-iteration csn-N will be created for this csn-N, the server will only receive updates csn-X> csn-N, it will have all updates and if the consumer csn csn-C >= csn-N the anchor csn csn-C always should be found.
if the ruv used for total init does not contain a csn for rid N, this means the server has not seen any changes for this rid and ALL changes will be received via replication, so no csn for rid N should ever be missing.

But it does. A potential scenario is in TEST #2, and the creation of a keep alive entry is a special case for this scenario, where the ADD is internal, not external.
I have opened ticket #49000 for this.

I think with the fixes for #48995, #48999 and  #49000 we should be safe not to use  fallback for an alternative anchor csn.
If an anchor csn no longer can be found it is because it was purged, or the server was initialized with a RUV>than the missing CSN (test #1) , or it accepted an update before it was in sync after an init (ticket #48976).

But in these cases the error is not permanent, if the consumer is updated by another server the required anchorcsn can increase and be found in the changelog, so a missing anchor csn should in my opinion not switch to FATAL state but to BACKOFF.


So to summarize here is my suggestion:
- Implement fixes for #48995, #48999 and  #49000
- treat missing anchor csn as error
- switch to backoff for missing anchor csn

There might be one reason to keep the choice of an alternatve anchor csn: I have seen customer requests to keep replication going even if there are some inconsistencies, they want to schedule a reinit or restore or whatever at their conveninece and not have an environment where replication is completely broken.


For completeness, here are my test scenarios:

TEST #1:

(the disabling and enabling of replication agreements is to enforce a sepcific timing and replication flow, it could happen like this by itself as well).
Have 3 masters A,B,C in triangle replication.
Total init A-->B
Disable agreement A-->B
Disable agreemnet C-->B
Add an entry cn=X on A
Total init A-->C
Add entries cn=Y1,.....cn=Y10 on A
Enable agreement C-->B again

Result: with patch 48766 entry cn=X is missing on B
with version before 48766 cn=X, cn=Y1,,.... cn=Y9 are missing

TEST #2:

master A,B
start a sequence of adds to A, adding cn=E1,cn=E2,.......
while this is running start a total init A-->B
when the total init is done, the incremental update starts and we see a couple of messages on B:
urp_add: csn=..... entry with cn=Ex already exists

the csns reported in these messages are then NOT in the changelog of B 
}}}

Comment 6 errata-xmlrpc 2017-03-21 10:23:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0667.html