1391700 – do not treat missing csn as fatal

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1391700 - do not treat missing csn as fatal

Summary: do not treat missing csn as fatal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1402325
TreeView+	depends on / blocked

Reported:	2016-11-03 19:17 UTC by Noriko Hosoi
Modified:	2020-12-14 07:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:	389-ds-base-1.3.6.1-3.el7
Doc Type:	Bug Fix
Doc Text:	Fix: This fix removes the automatic choice of an alternative csn when the calculated anchor csn is not found. In that case it does no longer go to fatal state but will retry later. It also adds a configuration parameter to thr replication agreement to allow to pick a "next best" anchorcsn if the original is not found to keep replication going. Result: Even if a master runs into missing CSN, the replication does not fail, but it retries.
Clone Of:
Clones:	1402325 (view as bug list)
Environment:
Last Closed:	2017-08-01 21:12:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 2079	0	None	closed	do not treat missing csn as fatal	2021-01-22 09:35:14 UTC
Red Hat Product Errata	RHBA-2017:2086	0	normal	SHIPPED_LIVE	389-ds-base bug fix and enhancement update	2017-08-01 18:37:38 UTC

Internal Links: 1409487

Description Noriko Hosoi 2016-11-03 19:17:56 UTC

This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/49020

There have been many tickets and fixes about managing csn in a replication session and it is not yet fully settled. 
There are situations where replication should backoff instead of going into fatal state.
A summary of the problem and status was discussed on a mailing list and is cited here:

{{{
In the last time I was haunted by the problem if an when we should pick an alternative start csn in a replication session if the calculated anchor csn cannot be found in the changelog.
This was investigated during the work on ticket #48766 (skipped-changes), the resulting failures in reliab15 and picked up again with the recent customer problems on 7.2 where a part of that fix was missing.
I was trying to analyze the scenarios which can arise and to be finally able to answer these two core questions:

1] If an anchor csn cannot be found should we choose an alternative starting csn (under which conditions) or should the repl session be aborted ?
2] if the answer to 1] is abort, should this be a fatal error or transient ?

I have been moving in circles, but hope to have a good understanding of the problem now and would really like to get this finally settled, so pleas read the follwing, even if it is a bit lengthy, and challenge my arguments.

Lets start looking back at the state before the "skipped-changes":
- if an anchorcsn was not found, and the purgRUV in the changelog did not contain csns an alternative start csn was chosen
- the alternative was the minCSN of the supplierRUV
- after an online init, a start iteration record was written to the changelog, which corresponded to the minCSN in the ruv after init.

This worked, but when working on the "skipped-changes" problem I noticed that the selection of the alterantive csn could lead to a loss of many changes (I have a test case to demonstrate this - TEST #1)
So, under the assumption that we should find an anchor or break and that the existing mechanism to select an alternative was incorrect, this fallback was removed from the original fix patch48766-0

Unfortunately reliab15 failed with this patch, tracked in ticket #48954
A first analysis showed that the failure was after init, when an anchor csn was looked up for a replicaID which did not have a start-iteration record, and a first attempt to fix it was to log start-iteration records for all RIDs with csns contained in the ruv after init, patch48954-1
This did not completely solve the initial phase of reliab15 and we decided to go back to the method of selecting an alternative start csn, but chosing a better one, as close as possible to the missing csn: patch48954-2.

This resolved the reliab15 problem is the current state.

In between Thierry detected that this change also changed behaviour if a replica did accept updates to early after initialization #48976
And I found the testcase TEST#1, where with the old selection of the alternative many changes can be lost, but with the new method still one change is lost

So I looked again closer to the failures we had in reliab15 and noticed that one reason that patch48954-1 did not work was that in the process of initializations M!->M2->.. the ruv for most replicIDs only contained a purl, but no csn. This could be improved by fixing the tickets: #48995 and #48999
With a fix for these tickets the reliab15 scenario worked without the need of chosing an alternative anchor csn.

So I am back to the question what could be a valid scenario where an anchor csn cannot be found.

From the following it should not happen:
If the ruv used for total init does contain a csn for rid N, a start-iteration csn-N will be created for this csn-N, the server will only receive updates csn-X> csn-N, it will have all updates and if the consumer csn csn-C >= csn-N the anchor csn csn-C always should be found.
if the ruv used for total init does not contain a csn for rid N, this means the server has not seen any changes for this rid and ALL changes will be received via replication, so no csn for rid N should ever be missing.

But it does. A potential scenario is in TEST #2, and the creation of a keep alive entry is a special case for this scenario, where the ADD is internal, not external.
I have opened ticket #49000 for this.

I think with the fixes for #48995, #48999 and  #49000 we should be safe not to use  fallback for an alternative anchor csn.
If an anchor csn no longer can be found it is because it was purged, or the server was initialized with a RUV>than the missing CSN (test #1) , or it accepted an update before it was in sync after an init (ticket #48976).

But in these cases the error is not permanent, if the consumer is updated by another server the required anchorcsn can increase and be found in the changelog, so a missing anchor csn should in my opinion not switch to FATAL state but to BACKOFF.


So to summarize here is my suggestion:
- Implement fixes for #48995, #48999 and  #49000
- treat missing anchor csn as error
- switch to backoff for missing anchor csn

There might be one reason to keep the choice of an alternatve anchor csn: I have seen customer requests to keep replication going even if there are some inconsistencies, they want to schedule a reinit or restore or whatever at their conveninece and not have an environment where replication is completely broken.


For completeness, here are my test scenarios:

TEST #1:

(the disabling and enabling of replication agreements is to enforce a sepcific timing and replication flow, it could happen like this by itself as well).
Have 3 masters A,B,C in triangle replication.
Total init A-->B
Disable agreement A-->B
Disable agreemnet C-->B
Add an entry cn=X on A
Total init A-->C
Add entries cn=Y1,.....cn=Y10 on A
Enable agreement C-->B again

Result: with patch 48766 entry cn=X is missing on B
with version before 48766 cn=X, cn=Y1,,.... cn=Y9 are missing

TEST #2:

master A,B
start a sequence of adds to A, adding cn=E1,cn=E2,.......
while this is running start a total init A-->B
when the total init is done, the incremental update starts and we see a couple of messages on B:
urp_add: csn=..... entry with cn=Ex already exists

the csns reported in these messages are then NOT in the changelog of B 
}}}

Comment 6 Marc Sauton 2016-11-14 22:48:58 UTC

do we need to install the hotfix also on consumers (and hub) ?

from the main description, I understand only if consumers are updated by several masters we may not have to install the hotfix on consumer, if consumers are only updated by one master we should install that hotfix on consumers too, not just masters.
is this correct?

Comment 7 Ludwig 2016-11-15 08:26:35 UTC

The fix affects the behaviour of a replication agreement. so it has to be installed an all servers which do actively replicate to others . masters and hubs. 
on consumers the code is not executed,

Comment 10 Amy Farley 2016-11-15 16:52:23 UTC

From IT:

They tested the hotfix in dev and did not see the issue appear.

However, they are reluctant to call it "fixed" as this issue is intermittent and only seems to appear on higher environments where there is more traffic.

Thanks!

Comment 17 Noriko Hosoi 2016-12-12 23:26:19 UTC

A new configuration parameter nsds5ReplicaIgnoreMissingChange has been introduced to the replication agreement.  The param takes either of the 3 values: never, once or always.

If the value is ...
- never | off: treat missing CSN as fatal
- once | on: ignore a missing CSN once; treat the second missing CSN as fatal
- always: ignore missing CSNs

The default value is once | on.

Ludwig: could you please review this comment on the new param?  Thanks!

Comment 23 Ludwig 2017-01-23 12:35:09 UTC

the needinfo was handled in the knowledge base article

Comment 26 Viktor Ashirov 2017-05-20 19:18:15 UTC

========================================================== test session starts ==========================================================
platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python
cachedir: .cache
metadata: {'Python': '2.7.5', 'Platform': 'Linux-3.10.0-663.el7.x86_64-x86_64-with-redhat-7.4-Maipo', 'Packages': {'py': '1.4.33', 'pytest': '3.0.7', 'pluggy': '0.4.0'}, 'Plugins': {'beakerlib': '0.7.1', 'html': '1.14.2', 'cov': '2.5.1', 'metadata': '1.5.0'}}
DS build: 1.3.6.1
389-ds-base: 1.3.6.1-14.el7
nss: 3.28.4-6.el7
nspr: 4.13.1-1.0.el7_3
openldap: 2.4.44-4.el7
svrcore: 4.1.3-2.el7

rootdir: /export/tests, inifile:
plugins: metadata-1.5.0, html-1.14.2, cov-2.5.1, beakerlib-0.7.1
collected 1 items 

tickets/ticket49020_test.py::test_ticket49020 PASSED

======================================================= 1 passed in 58.91 seconds =======================================================

Additionally, reliab15 test didn't show any issues with replication.

Marking as VERIFIED.

Comment 27 errata-xmlrpc 2017-08-01 21:12:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086

Note You need to log in before you can comment on or make changes to this bug.