Bug 1460070 - 1.3.5.10-20 replication halt - pending list first CSN not committed, pending list increasing
1.3.5.10-20 replication halt - pending list first CSN not committed, pending ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
7.3
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: mreynolds
Viktor Ashirov
: ZStream
Depends On:
Blocks: 1420851 1472344 1476161
  Show dependency treegraph
 
Reported: 2017-06-08 21:01 EDT by Marc Sauton
Modified: 2018-04-10 10:17 EDT (History)
6 users (show)

See Also:
Fixed In Version: 389-ds-base-1.3.7.5-4.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1476161 (view as bug list)
Environment:
Last Closed: 2018-04-10 10:16:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0811 None None None 2018-04-10 10:17 EDT

  None (edit)
Description Marc Sauton 2017-06-08 21:01:51 EDT
Description of problem:

This is about a 389-ds-base-1.3.5.10-20 LDAP replication halt with a pending list first's CSN not committed, and pending list increasing

The issue happens randomly in a User Acceptance Testing environment.
There is no real pattern in terms of timeline.
In one case, it took 15 days to happen
According to the dse.ldif files, the Referential Integrity plugin is disabled on both RHDS instances.

Thierry B. started the investigation, and in the QE sync meeting, it was decided to open a new RHEL-7.5 bz to track and continue the investigation, so I tried to capture some key comments from the customer sf case 01717173 in this bug report, I edited some log entries.


Version-Release number of selected component (if applicable):
RHDS-10
389-ds-base-1.3.5.10-20
RHEL-7.3

How reproducible:
N/A

Steps to Reproduce:
1. N/A
2.
3.

Actual results:
LDAP replication halt

Expected results:


Additional info:

sf 01717173 comment 205
    - logsupplier.7z did not capture the begining of problem

    - System needs to be restarted  so that pending list is cleared and replication resume

    - Then with this level of log, do the monitoring until the problem happen again and then provide the logs
        Looking at the log, you can detect that the problem occurs again with

         --- Problem occurs ----

        [07/Jun/2017:18:41:10.387503690 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 59382ca5000000660000                                                                 
        [07/Jun/2017:18:41:10.388550743 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
        [07/Jun/2017:18:41:10.389608033 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
        [07/Jun/2017:18:41:10.390390745 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000100660000
        ...
        <many csn, and increasing list >
        ...
        [07/Jun/2017:18:41:10.806065565 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59382ca5000000660000
        [07/Jun/2017:18:41:10.806715673 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 59382ca5000000660000


         --- No problem ----
        [07/Jun/2017:18:57:36.898955934 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 59383081000000660000                                                         
        [07/Jun/2017:18:57:36.899544920 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59383081000000660000
        [07/Jun/2017:18:57:36.900929211 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 59383081000000660000


--

sf 01717173 comment 198

Backends:

cn: ou=BusinessOrganizations,dc=example,dc=com
nsslapd-backend: BusinessOrganizations

cn: "dc=example,dc=com"
nsslapd-backend: example

cn: ou=customer_people,dc=example,dc=com
nsslapd-backend: customer_people

cn: ou=disable,dc=example,dc=com
nsslapd-backend: disable



# supplier 02 direct update
[05/Jun/2017:11:17:53.914783248 +0200] conn=47084 fd=136 slot=136 connection from 10.64.0.61 to 10.64.0.217                         
[05/Jun/2017:11:17:53.914961066 +0200] conn=47084 op=0 BIND dn="uid=ldapuser,ou=someou,ou=people,dc=example,dc=com" method=128 version=3
[05/Jun/2017:11:17:53.916581311 +0200] conn=47084 op=0 RESULT err=0 tag=97 nentries=0 etime=0.002000 dn="uid=ldapuser,ou=someou,ou=people,dc=example,dc=com"

# ou=customer_people backend
[05/Jun/2017:11:41:43.111177199 +0200] conn=47084 op=753 MOD dn="uid=user2,ou=customer_people,dc=example,dc=com"
[05/Jun/2017:11:41:43.123433195 +0200] conn=47084 op=753 RESULT err=0 tag=103 nentries=0 etime=0.013000 csn=59352756000500660000    

# example
[05/Jun/2017:11:41:43.221039081 +0200] conn=47084 op=756 MOD dn="cn=custportal_maestro,cn=groups,dc=example,dc=com"
[05/Jun/2017:11:41:43.537226332 +0200] conn=47084 op=756 RESULT err=0 tag=103 nentries=0 etime=0.384000 csn=59352757000100660000

# ou=customer_people backend
[05/Jun/2017:11:44:16.376262019 +0200] conn=47084 op=771 ADD dn="uid=user2,ou=customer_people,dc=example,dc=com"
[05/Jun/2017:11:44:16.431234609 +0200] conn=47084 op=771 RESULT err=0 tag=105 nentries=0 etime=0.055000 csn=593527ef000000660000 


#
# First update on customer_people was successfull
# and soon taken into consideration during replication to VRHNSDLDABEN01
#
[05/Jun/2017:11:41:43.111558916 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 59352756000500660000 into pending list                                                                                                                 
[05/Jun/2017:11:41:43.112361201 +0200] NSMMReplicationPlugin - Purged state information from entry uid=user2,ou=customer_people,dc=example,dc=com up to CSN 592becd6000400660000
[05/Jun/2017:11:41:43.119285001 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fbfcdbd0600 for database /var/lib/dirsrv/slapd-someou/changelogdb/c571cf82-2b1411e7-a7b49621-53892c26_48689979000000040000.db
[05/Jun/2017:11:41:43.119952809 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fbfcdbd0600 for database /var/lib/dirsrv/slapd-someou/changelogdb/c571cf82-2b1411e7-a7b49621-53892c26_48689979000000040000.db
[05/Jun/2017:11:41:43.120590368 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 59352756000500660000
[05/Jun/2017:11:41:43.121243726 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000500660000
[05/Jun/2017:11:41:43.121924157 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 59352756000500660000
[05/Jun/2017:11:41:43.247287550 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFile: found DB object 7fbfcdbd0600 for database /var/lib/dirsrv/slapd-someou/changelogdb/c571cf82-2b1411e7-a7b49621-53892c26_48689979000000040000.db
[05/Jun/2017:11:41:43.248544997 +0200] _cl5PositionCursorForReplay (agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389)): Consumer RUV:
[05/Jun/2017:11:41:43.249271282 +0200] NSMMReplicationPlugin - agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389): {replicageneration} 48689979000000040000
[05/Jun/2017:11:41:43.250040383 +0200] NSMMReplicationPlugin - agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389): {replica 102 ldap://supplier02:389} 5719df06000000660000 59352755000000660000 00000000
[05/Jun/2017:11:41:43.250708816 +0200] _cl5PositionCursorForReplay (agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389)): Supplier RUV:
[05/Jun/2017:11:41:43.251459821 +0200] NSMMReplicationPlugin - agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389): {replicageneration} 48689979000000040000                                                                                                                  
[05/Jun/2017:11:41:43.252345308 +0200] NSMMReplicationPlugin - agmt="cn=VRHNSDLDABEN01" (VRHNSDLDABEN01:389): {replica 102 ldap://supplier02:389} 5719df06000000660000 59352756000500660000 59352756

# This internal update on the same entry as 'conn=47084 op=753' was inserted in the pending list, AFTER 59352756000500660000 was committed
[05/Jun/2017:11:41:43.515252373 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 59352756000700660000 into pending list                                         
[05/Jun/2017:11:41:43.516471672 +0200] NSMMReplicationPlugin - Purged state information from entry uid=user2,ou=customer_people,dc=example,dc=com up to CSN 592becd6000500660000

#
# second update on customer_people was successfull
[05/Jun/2017:11:44:16.379138694 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000000660000 into pending list
[05/Jun/2017:11:44:16.426047122 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fbfcdbd0600 for database /var/lib/dirsrv/slapd-someou/changelogdb/c571cf82-2b1411e7-a7b49621-53892c26_48689979000000040000.db               
[05/Jun/2017:11:44:16.426826618 +0200] NSMMReplicationPlugin - changelog program - _cl5GetDBFileByReplicaName: found DB object 7fbfcdbd0600 for database /var/lib/dirsrv/slapd-someou/changelogdb/c571cf82-2b1411e7-a7b49621-53892c26_48689979000000040000.db
[05/Jun/2017:11:44:16.427577481 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000000660000
[05/Jun/2017:11:44:16.428286087 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.428907537 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.429511023 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000000660000


    - Referential integrity was configured with 0 delay
        dn: cn=referential integrity postoperation,cn=plugins,cn=config
        referint-update-delay: 0



In Conclusion:
        - The problematic csn (59352756000700660000) is smaller than the second update (593527ef000000660000) so it was not related
          to the second update
        - The problematic csn (59352756000700660000) is looking like triggered by the first update (59352756000500660000), but it was inserted well after
          the first update was inserted/committed
        - No changelog or ruv_update_ruv failure, in that period so it is looking like 59352756000700660000 was not a primary csn and is waiting
          for its primary to be committed.
        - This is really looking like 59352756000700660000 was triggered by 59352756000500660000 but for some reason differed.
          Suspecting a mechanism like referential integrity delay.

Next steps:
----------

    - Still investigating data

    - As it seems that the problem occurs quite frequently would it be possible to remove 'referint-update-delay'
      from referential integrity.
      This is JUST A TEST to see if this delay (even if set to 0) could be problematic for the pending list mechanism

--


sf 01717173 comment 192


// Internal

  - We can see the same behavior where pending list is growing
[05/Jun/2017:11:41:43.515252373 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 59352756000700660000 into pending list
...
[05/Jun/2017:11:44:16.379138694 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000000660000 into pending list
[05/Jun/2017:11:44:16.427577481 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000000660000
[05/Jun/2017:11:44:16.428286087 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.428907537 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.429511023 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000000660000
...
[05/Jun/2017:11:44:16.437928762 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000100660000 into pending list
[05/Jun/2017:11:44:16.548065274 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000100660000
[05/Jun/2017:11:44:16.548710768 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.549291727 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.549948221 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000100660000
[05/Jun/2017:11:44:16.550567026 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000100660000
...
[05/Jun/2017:11:44:16.557933493 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000200660000 into pending list
[05/Jun/2017:11:44:16.562207533 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000200660000
[05/Jun/2017:11:44:16.562900880 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.563508542 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.564065181 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000100660000
[05/Jun/2017:11:44:16.564652151 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000200660000
[05/Jun/2017:11:44:16.565320680 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000200660000
...
[05/Jun/2017:11:44:16.568992234 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000300660000 into pending list
[05/Jun/2017:11:44:16.571981810 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000300660000
[05/Jun/2017:11:44:16.572798599 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.573414913 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.573984644 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000100660000
[05/Jun/2017:11:44:16.574555936 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000200660000
[05/Jun/2017:11:44:16.575151028 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000300660000
[05/Jun/2017:11:44:16.575826794 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000300660000
...
[05/Jun/2017:11:44:16.577159531 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 593527ef000400660000 into pending list
[05/Jun/2017:11:44:16.583712655 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 593527ef000400660000
[05/Jun/2017:11:44:16.584430973 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 59352756000700660000
[05/Jun/2017:11:44:16.585109104 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000000660000
[05/Jun/2017:11:44:16.585914672 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000100660000
[05/Jun/2017:11:44:16.586931469 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000200660000
[05/Jun/2017:11:44:16.587924339 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000300660000
[05/Jun/2017:11:44:16.588546631 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 593527ef000400660000
[05/Jun/2017:11:44:16.589199542 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 593527ef000400660000

  - Curiously we do not see a message like 
   "write_changelog_and_ruv - Can't add a change for..." that could be the signature of a DB_deadlock
   and the issue #49238

--


sf 01717173 comment 185

Looking at supplier7z log:

    - Each instance shares the same replicaID for each suffix. This is valid but is more difficult to diagnose

    - All replication session from the supplier are on {replicageneration} 5719cf8a000000660000
      (replica "dc=example,dc=com")
    
    - Looking at update on replica "ou=customer_people,dc=example,dc=com"
      More specifically at the update csn=592ef213000000660000 
    [31/May/2017:18:40:51.801495422 +0200] conn=398718 op=2 MOD dn="uid=user1,ou=customer_people,dc=example,dc=com"
    [31/May/2017:18:40:53.201942400 +0200] conn=398718 op=2 RESULT err=0 tag=103 nentries=0 etime=1.401000 csn=592ef213000000660000
    This update will not be replicated because there is no outgoing replication session (I do not know the reason)


    - When this update is committed, it appears that the pending list 
      of replica "ou=customer_people,dc=example,dc=com" already contains more than 1100 CSNs
        [31/May/2017:18:40:51.806799712 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 592ef213000000660000 into pending list
        [31/May/2017:18:40:52.124625858 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 592ef213000000660000
        [31/May/2017:18:40:52.125580696 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592406ac000000660000
        ...
        <1102 csnplCommitALL: processing data csn>
        ...
        [31/May/2017:18:40:53.196794613 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592ef02e000000660000
        [31/May/2017:18:40:53.197622451 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592ef062000000660000     
        [31/May/2017:18:40:53.198466638 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592ef213000000660000
        [31/May/2017:18:40:53.199415469 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 592ef213000000660000
      

    - The situation of long pending lists exist for a long time
        [31/May/2017:09:44:20.556455740 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 592e7454000000660000 into pending list
        [31/May/2017:09:44:20.560077268 +0200] NSMMReplicationPlugin - csnplCommitALL: committing all csns for csn 592e7454000000660000     
        [31/May/2017:09:44:20.560779238 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592406ac000000660000
        ...
        [31/May/2017:09:44:21.255133406 +0200] NSMMReplicationPlugin - csnplCommitALL: processing data csn 592e7454000000660000             
[31/May/2017:09:44:21.255756010 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 592e7454000000660000

            The pending list is increasing with newly committed update but is not rollUp.
            That means that the first csn in the pending list is not committed: 592406ac000000660000
            So rollUp can not find a max_csn to set into the RUV

    
    In conclusion:
        - You should consider to set unique replicaID in the full topology
          for example:
nsDS5ReplicaRoot: dc=example,dc=com
        ReplicaID: 100, 101, 102...
nsDS5ReplicaRoot: ou=BusinessOrganizations,dc=example,dc=com
        ReplicaID: 200, 201, 202...
nsDS5ReplicaRoot: ou=customer_people,dc=example,dc=com
        ReplicaID: 300, 301, 302...
nsDS5ReplicaRoot: ou=disable,dc=example,dc=com
        ReplicaID: 400, 401, 402...

      - Need to understand why there is no outgoing session for replica "ou=customer_people,dc=example,dc=com"
        Checking the RA status ?

      - A specific issue in "ou=customer_people,dc=example,dc=com" pending list
        is that 592406ac000000660000 was left uncommited
        (Could be a side effect of  https://pagure.io/389-ds-base/issue/49238 like Ludwig also pointed out, but the scenario is not crystal clear)
        Would you try to retrieve in error logs if there are DB_DEADLOCK when 592406ac000000660000 was applied
Comment 2 thierry bordaz 2017-06-13 08:54:25 EDT
Upstream ticket https://pagure.io/389-ds-base/issue/49287
Comment 4 Marc Sauton 2017-07-17 12:46:18 EDT
There has been a great collaboration for a patch in upstream ticket
https://pagure.io/389-ds-base/issue/49287
(Thank Ludwig, Thierry, William !)

for a sosreport provide in the salesforce case number 01717173 , the customer's consumer system is with
420-sosreport-EVAIN.01717173-20170607105939-consumer.tar.xz
389-ds-base-1.3.5.10-20.el7_3.x86_64
redhat-release-server-7.2-9.el7.x86_64

the patch
0001-Ticket-49287-v3-extend-csnpl-handling-to-multiple-ba.patch
from
https://pagure.io/389-ds-base/issue/raw/6d855f63e2e6968692eb06332c322aedbdc3e528232e63881344077864be75ec-0001-Ticket-49287-v3-extend-csnpl-handling-to-multiple-ba.patch
seem quite seriosu and "invasive", so:

is it "ok" to provide a hotfix?
I do not know how often this condition happened at the customer's site, but do we want to consider a 7.4.z ?
Comment 5 mreynolds 2017-07-20 11:24:10 EDT
Fixed upstream
Comment 12 errata-xmlrpc 2018-04-10 10:16:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0811

Note You need to log in before you can comment on or make changes to this bug.