Hide Forgot
This bug is created as a clone of upstream ticket: https://fedorahosted.org/389/ticket/49020 There have been many tickets and fixes about managing csn in a replication session and it is not yet fully settled. There are situations where replication should backoff instead of going into fatal state. A summary of the problem and status was discussed on a mailing list and is cited here: {{{ In the last time I was haunted by the problem if an when we should pick an alternative start csn in a replication session if the calculated anchor csn cannot be found in the changelog. This was investigated during the work on ticket #48766 (skipped-changes), the resulting failures in reliab15 and picked up again with the recent customer problems on 7.2 where a part of that fix was missing. I was trying to analyze the scenarios which can arise and to be finally able to answer these two core questions: 1] If an anchor csn cannot be found should we choose an alternative starting csn (under which conditions) or should the repl session be aborted ? 2] if the answer to 1] is abort, should this be a fatal error or transient ? I have been moving in circles, but hope to have a good understanding of the problem now and would really like to get this finally settled, so pleas read the follwing, even if it is a bit lengthy, and challenge my arguments. Lets start looking back at the state before the "skipped-changes": - if an anchorcsn was not found, and the purgRUV in the changelog did not contain csns an alternative start csn was chosen - the alternative was the minCSN of the supplierRUV - after an online init, a start iteration record was written to the changelog, which corresponded to the minCSN in the ruv after init. This worked, but when working on the "skipped-changes" problem I noticed that the selection of the alterantive csn could lead to a loss of many changes (I have a test case to demonstrate this - TEST #1) So, under the assumption that we should find an anchor or break and that the existing mechanism to select an alternative was incorrect, this fallback was removed from the original fix patch48766-0 Unfortunately reliab15 failed with this patch, tracked in ticket #48954 A first analysis showed that the failure was after init, when an anchor csn was looked up for a replicaID which did not have a start-iteration record, and a first attempt to fix it was to log start-iteration records for all RIDs with csns contained in the ruv after init, patch48954-1 This did not completely solve the initial phase of reliab15 and we decided to go back to the method of selecting an alternative start csn, but chosing a better one, as close as possible to the missing csn: patch48954-2. This resolved the reliab15 problem is the current state. In between Thierry detected that this change also changed behaviour if a replica did accept updates to early after initialization #48976 And I found the testcase TEST#1, where with the old selection of the alternative many changes can be lost, but with the new method still one change is lost So I looked again closer to the failures we had in reliab15 and noticed that one reason that patch48954-1 did not work was that in the process of initializations M!->M2->.. the ruv for most replicIDs only contained a purl, but no csn. This could be improved by fixing the tickets: #48995 and #48999 With a fix for these tickets the reliab15 scenario worked without the need of chosing an alternative anchor csn. So I am back to the question what could be a valid scenario where an anchor csn cannot be found. From the following it should not happen: If the ruv used for total init does contain a csn for rid N, a start-iteration csn-N will be created for this csn-N, the server will only receive updates csn-X> csn-N, it will have all updates and if the consumer csn csn-C >= csn-N the anchor csn csn-C always should be found. if the ruv used for total init does not contain a csn for rid N, this means the server has not seen any changes for this rid and ALL changes will be received via replication, so no csn for rid N should ever be missing. But it does. A potential scenario is in TEST #2, and the creation of a keep alive entry is a special case for this scenario, where the ADD is internal, not external. I have opened ticket #49000 for this. I think with the fixes for #48995, #48999 and #49000 we should be safe not to use fallback for an alternative anchor csn. If an anchor csn no longer can be found it is because it was purged, or the server was initialized with a RUV>than the missing CSN (test #1) , or it accepted an update before it was in sync after an init (ticket #48976). But in these cases the error is not permanent, if the consumer is updated by another server the required anchorcsn can increase and be found in the changelog, so a missing anchor csn should in my opinion not switch to FATAL state but to BACKOFF. So to summarize here is my suggestion: - Implement fixes for #48995, #48999 and #49000 - treat missing anchor csn as error - switch to backoff for missing anchor csn There might be one reason to keep the choice of an alternatve anchor csn: I have seen customer requests to keep replication going even if there are some inconsistencies, they want to schedule a reinit or restore or whatever at their conveninece and not have an environment where replication is completely broken. For completeness, here are my test scenarios: TEST #1: (the disabling and enabling of replication agreements is to enforce a sepcific timing and replication flow, it could happen like this by itself as well). Have 3 masters A,B,C in triangle replication. Total init A-->B Disable agreement A-->B Disable agreemnet C-->B Add an entry cn=X on A Total init A-->C Add entries cn=Y1,.....cn=Y10 on A Enable agreement C-->B again Result: with patch 48766 entry cn=X is missing on B with version before 48766 cn=X, cn=Y1,,.... cn=Y9 are missing TEST #2: master A,B start a sequence of adds to A, adding cn=E1,cn=E2,....... while this is running start a total init A-->B when the total init is done, the incremental update starts and we see a couple of messages on B: urp_add: csn=..... entry with cn=Ex already exists the csns reported in these messages are then NOT in the changelog of B }}}
do we need to install the hotfix also on consumers (and hub) ? from the main description, I understand only if consumers are updated by several masters we may not have to install the hotfix on consumer, if consumers are only updated by one master we should install that hotfix on consumers too, not just masters. is this correct?
The fix affects the behaviour of a replication agreement. so it has to be installed an all servers which do actively replicate to others . masters and hubs. on consumers the code is not executed,
From IT: They tested the hotfix in dev and did not see the issue appear. However, they are reluctant to call it "fixed" as this issue is intermittent and only seems to appear on higher environments where there is more traffic. Thanks!
A new configuration parameter nsds5ReplicaIgnoreMissingChange has been introduced to the replication agreement. The param takes either of the 3 values: never, once or always. If the value is ... - never | off: treat missing CSN as fatal - once | on: ignore a missing CSN once; treat the second missing CSN as fatal - always: ignore missing CSNs The default value is once | on. Ludwig: could you please review this comment on the new param? Thanks!
the needinfo was handled in the knowledge base article
========================================================== test session starts ========================================================== platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python cachedir: .cache metadata: {'Python': '2.7.5', 'Platform': 'Linux-3.10.0-663.el7.x86_64-x86_64-with-redhat-7.4-Maipo', 'Packages': {'py': '1.4.33', 'pytest': '3.0.7', 'pluggy': '0.4.0'}, 'Plugins': {'beakerlib': '0.7.1', 'html': '1.14.2', 'cov': '2.5.1', 'metadata': '1.5.0'}} DS build: 1.3.6.1 389-ds-base: 1.3.6.1-14.el7 nss: 3.28.4-6.el7 nspr: 4.13.1-1.0.el7_3 openldap: 2.4.44-4.el7 svrcore: 4.1.3-2.el7 rootdir: /export/tests, inifile: plugins: metadata-1.5.0, html-1.14.2, cov-2.5.1, beakerlib-0.7.1 collected 1 items tickets/ticket49020_test.py::test_ticket49020 PASSED ======================================================= 1 passed in 58.91 seconds ======================================================= Additionally, reliab15 test didn't show any issues with replication. Marking as VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2086