Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1629055

Summary: ds-replcheck unreliable, showing false positives
Product: Red Hat Enterprise Linux 7 Reporter: Dave <dsimes>
Component: 389-ds-baseAssignee: mreynolds
Status: CLOSED ERRATA QA Contact: RHDS QE <ds-qe-bugs>
Severity: unspecified Docs Contact: Marc Muehlfeld <mmuehlfe>
Priority: high    
Version: 7.5CC: aadhikar, cpelland, dsimes, gparente, jvilicic, lkrispen, mreynolds, nkinder, pasik, rmeggins, spichugi, striker, tbordaz, tmihinto, vashirov
Target Milestone: rc   
Target Release: 7.7   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.9.1-7.el7 Doc Type: Bug Fix
Doc Text:
.The `ds-replcheck` utility no longer incorrectly reports non-matching tombstone entries on replicas Previously, if an administrator ran the `ds-replcheck` utility on different Directory Server replicas with tombstones present, `ds-replcheck` reported that one of the replicas was missing the tombstone entries. It is expected that tombstone entries do not match on each replica. With this update, `ds-replcheck` no longer searches for tombstone entries. As a result, the utility does not report missing tombstone entries as a problem.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-06 12:58:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Modified version to skip tombstones from missing entries report
none
New ds-replcheck
none
RHEL 7.7 ds-replcheck (python2)
none
ds-replcheck output from RHEL 7.6 version showing additional detail, as opposed to the 7.7 version output posted below none

Description Dave 2018-09-14 18:13:14 UTC
Description of problem:
this is in an IdM server env, we have 6 servers across 3 data centers
periodic runs of ds-replcheck show issues, when all server data is verified correct by running an ldapsearch on all servers and comparing the data. At this point, ds-replcheck is not a reliable tool the customer can use to verify ldap replication consistency

Version-Release number of selected component (if applicable):
kernel-3.10.0-862.9.1.el7.x86_64
ipa-server-4.5.4-10.el7_5.3.x86_64
389-ds-base-1.3.7.5-24.el7_5.x86_64

How reproducible:
issue shows up on a regular basis, more details in case as to which data is showing inconsistently

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
this is for a Red Hat Public Sector customer
troubleshooting and additional info/details in case https://access.redhat.com/support/cases/#/case/02150572

Comment 2 mreynolds 2018-09-14 19:01:57 UTC
From what I can tell the issue is that tombstones are showing up in the report as missing entries.  Correct?  This should have been clearly stated in the bug, but pointing to the case notes to get the details is not appropriate IMO.

Anyway, if that is the issue, it is probably is a false positive as already stated.  We were actually going to remove tombstones from the missing entry report, but we weren't sure if it was going to be a problem or not.  Well it's apparently a problem so we will work on adding a new option to ignore them.

Comment 5 Dave 2018-09-17 19:42:26 UTC
(In reply to mreynolds from comment #2)
> From what I can tell the issue is that tombstones are showing up in the
> report as missing entries.  Correct?  This should have been clearly stated
> in the bug, but pointing to the case notes to get the details is not
> appropriate IMO.
> 
> Anyway, if that is the issue, it is probably is a false positive as already
> stated.  We were actually going to remove tombstones from the missing entry
> report, but we weren't sure if it was going to be a problem or not.  Well
> it's apparently a problem so we will work on adding a new option to ignore
> them.

The customer was asked to run it with the following exclude options:

# ds-replcheck -D "cn=directory manager" -W -m ldap://ssc-prd-ipa-099 -r ldap://cl-rhm-0252 -b dc=masked,dc=domain --ignore memberof,idnssoaserial,entryusn,krblastsuccessfulauth,krblastfailedauth,krbloginfailedcount,nsuniqueid

but ds-replcheck is still showing periodic issues

should excluding nsuniqueid be ignoring tombstone entries? 
however, this did not resolve the issue for the customer

Comment 6 mreynolds 2018-09-17 19:50:55 UTC
(In reply to Dave from comment #5)
> 
> The customer was asked to run it with the following exclude options:
> 
> # ds-replcheck -D "cn=directory manager" -W -m ldap://ssc-prd-ipa-099 -r
> ldap://cl-rhm-0252 -b dc=masked,dc=domain --ignore
> memberof,idnssoaserial,entryusn,krblastsuccessfulauth,krblastfailedauth,
> krbloginfailedcount,nsuniqueid
> 
> but ds-replcheck is still showing periodic issues
> 
> should excluding nsuniqueid be ignoring tombstone entries? 
> however, this did not resolve the issue for the customer

Yeah that won't work.  The "exclude/ignore attributes" only applies when checking the difference between two entries.  It does not impact the "Missing Entries" report.

Comment 7 mreynolds 2018-09-17 21:56:34 UTC
Something else I want to mention...

Replication, when under load, is never going to be in-sync at any given moment.  So there are going to be times where the tool reports there are differences (because at that moment there are).  

To help avoid this there is a lag time you can adjust:

https://www.port389.org/docs/389ds/design/repl-diff-tool-design.html#usage

The default is 5 minutes, meaning anything out of sync that's within 5 minutes is ignored, but this might need to be increased depending on the customer's load.  Just an FYI.

Tomorrow I will prepare a test version of the script (that ignores missing tombstones).  If the customer could test this to make sure it works for the them I would appreciate it.

Comment 10 Dave 2018-09-18 14:34:50 UTC
(In reply to mreynolds from comment #7)
> Tomorrow I will prepare a test version of the script (that ignores missing
> tombstones).  If the customer could test this to make sure it works for the
> them I would appreciate it.

yes yes, we/they could definitely test! :)

Comment 11 mreynolds 2018-09-18 14:48:05 UTC
Created attachment 1484404 [details]
Modified version to skip tombstones from missing entries report

Comment 12 Dave 2018-09-18 19:45:20 UTC
(In reply to mreynolds from comment #11)
> Created attachment 1484404 [details]
> Modified version to skip tombstones from missing entries report

testing!!

Comment 14 Dave 2018-09-25 13:02:18 UTC
(even set lag to 10 hours - 6000)

Comment 15 mreynolds 2018-09-25 13:14:18 UTC
I'm assuming this is from the "online" mode.  I do see where it is picking it up from, so I need to do another revision.  But, do they see the same missing entries in offline mode (comparing two ldifs from "db2ldif -r")?

Comment 17 Dave 2019-01-04 16:33:18 UTC
is there a newer ds-replcheck we can test? the last provided/attached did not solve/fix the customer's issue

Comment 18 mreynolds 2019-01-04 16:37:09 UTC
Created attachment 1518440 [details]
New ds-replcheck

Try this one Dave.  Note -the usage changed quite a bit(it's a lot nicer), so run ds-replcheck --help first and try again...

Comment 20 Akshay Adhikari 2019-05-07 13:54:56 UTC
The version of ds-replcheck in the build: 389-ds-base-1.3.9.1-5.el7.x86_64 is 1.4 whereas in the attached file it is 2.0. Also, the changes could not be seen in /usr/bin/ds-replcheck as compared with the attachment. Therefore marking it as FAILED_QA.

Comment 22 Dave 2019-05-07 16:04:55 UTC
(In reply to mreynolds from comment #18)
> Created attachment 1518440 [details]
> New ds-replcheck
> 
> Try this one Dave.  Note -the usage changed quite a bit(it's a lot nicer),
> so run ds-replcheck --help first and try again...

/usr/bin/python3 not available on RHEL 7.x (tested on 7.6)

pointing to /usr/bin/python (python2) does not work either, as there are missing modules apparently specific to python3

using rhscl rh-python36-python (rh-python36-runtime) does not work either as there is no ldap module available

does not work for RHEL 7.x, assuming this will also not work for targeted RHEL 7.7 release

unable to test on RHEL 7.x (7.6 attempted)

Comment 23 mreynolds 2019-05-13 14:33:24 UTC
Dave,

looks like these changes were added to RHEL 7.7 (python2 version).  What was the last version the customer tested?

Anyway I am attaching what is in latest build for 7.7.  Since RHEL 7.7 is wrapping up really soon it would be great if we could get it verified. 

Thanks,
Mark

Comment 24 mreynolds 2019-05-13 14:36:54 UTC
Created attachment 1567987 [details]
RHEL 7.7 ds-replcheck (python2)

Comment 26 Dave 2019-05-13 17:42:23 UTC
(In reply to mreynolds from comment #24)
> Created attachment 1567987 [details]
> RHEL 7.7 ds-replcheck (python2)

This one runs (on 7.6) and looks clean, does not exhibit the previous tombstone issues

Looks good, customer has reviewed the new output

Comment 29 Akshay Adhikari 2019-05-21 09:20:09 UTC
Build Tested: 389-ds-base-1.3.9.1-7.el7.x86_64

1) Create 2 Master A and B but do NOT create agreements, etc.

2) Add and Delete a user on Master A

ldapadd -p 39001 -h localhost -D "cn=Directory Manager" -w password << EOF
dn: uid=test-user,ou=People,dc=example,dc=com
changetype: add
uid: test-user
objectClass: top
objectClass: account
objectClass: posixaccount
objectClass: inetOrgPerson
objectClass: person
objectClass: inetUser
objectClass: organizationalPerson
uidNumber: 1001
gidNumber: 1001
sn: surname
homeDirectory: /home/test-user
cn: common name
EOF
adding new entry "uid=test-user,ou=People,dc=example,dc=com"

ldapdelete -p 39001 -h localhost -D "cn=Directory Manager" -w password uid=test-user,ou=People,dc=example,dc=com

4) Run ds-replcheck and verify there are NO complaints about missing entries/tombstones

[root@master ~]# ds-replcheck -v -D "cn=directory manager" -w password -m ldap://`hostname`:39001 -r ldap://`hostname`:39002 -b dc=example,dc=com -l 1
Performing online report...
Connecting to servers...
Validating suffix ...
Gathering Master's RUV...
Gathering Replica's RUV...
Start searching and comparing...
Preparing final report...
================================================================================
         Replication Synchronization Report  (Wed May 15 09:26:29 2019)
================================================================================


Database RUV's
=====================================================

Master RUV:
  {replica 1 ldap://web9.testrelm.test:39001} 5cdc0edf000100010000 5cdc1354000000010000
  {replica 2 ldap://web9.testrelm.test:39002} 5cdc0ee8000100020000 5cdc0ee8000100020000
  {replicageneration} 5cdc0edf000000010000

Replica RUV:
  {replica 1 ldap://web9.testrelm.test:39001} 5cdc0edf000100010000 5cdc0ee5000200010000
  {replica 2 ldap://web9.testrelm.test:39002} 5cdc0ee8000100020000 5cdc0ee8000100020000
  {replicageneration} 5cdc0edf000000010000


Entry Counts
=====================================================

Master:  15
Replica: 14


Tombstones
=====================================================

Master:  1
Replica: 0

Marking it as VERIFIED.

Comment 31 Dave 2019-05-28 17:59:37 UTC
Created attachment 1574437 [details]
ds-replcheck output from RHEL 7.6 version showing additional detail, as opposed to the 7.7 version output posted below

Comment 32 Dave 2019-05-28 18:12:06 UTC
(In reply to mreynolds from comment #24)
> Created attachment 1567987 [details]
> RHEL 7.7 ds-replcheck (python2)

running an alternate data check tool, shows there is no data inconsistency, but ds-replcheck IS reporting an issue

we're under the impression that this new ds-replcheck does not resolve the issue for the customer and is not yet usable/reliable

Comment 36 mreynolds 2019-05-28 18:54:50 UTC
Sorry I'm really confused.  From what I can tell from comment 30 everything is running correctly, and the old output verifies that they were previously seeing false positives.  So the tool appears to be running correctly.  Is the problem now that there is no "Result" summary message?

Comment 37 Dave 2019-05-28 19:08:58 UTC
(In reply to mreynolds from comment #36)
> Sorry I'm really confused.  From what I can tell from comment 30 everything
> is running correctly, and the old output verifies that they were previously
> seeing false positives.  So the tool appears to be running correctly.  Is
> the problem now that there is no "Result" summary message?

The new version had been running as expected over the last couple weeks of testing, until this afternoon. We had been looking for this Result section, and could verify it was good when we saw:

No differences between Master and Replica

*Until today, when we did not see this line, noticed there was no Result section at all, 

*and these count issues:

Entry Counts
=====================================================
 
Master:  4608
Replica: 4603


Tombstones
=====================================================
 
Master:  86
Replica: 81


*The previous version would show some data when the counts were not equal, and we saw no data to try and dump/compare between servers.

Then with what you posted about it being a sorting issue that is being falsely reported, this new version seems to still be failing on sorting "multivalued attribute values" ??

We saw 3 (*) issues when it failed to report a "good" status

Comment 38 mreynolds 2019-05-28 19:53:06 UTC
(In reply to Dave from comment #37)
> (In reply to mreynolds from comment #36)
> > Sorry I'm really confused.  From what I can tell from comment 30 everything
> > is running correctly, and the old output verifies that they were previously
> > seeing false positives.  So the tool appears to be running correctly.  Is
> > the problem now that there is no "Result" summary message?
> 
> The new version had been running as expected over the last couple weeks of
> testing, until this afternoon. We had been looking for this Result section,
> and could verify it was good when we saw:
> 
> No differences between Master and Replica
> 
> *Until today, when we did not see this line, noticed there was no Result
> section at all, 
> 
> *and these count issues:
> 
> Entry Counts
> =====================================================
>  
> Master:  4608
> Replica: 4603
> 
> 
> Tombstones
> =====================================================
>  
> Master:  86
> Replica: 81
> 
> 
> *The previous version would show some data when the counts were not equal,
> and we saw no data to try and dump/compare between servers.
> 
> Then with what you posted about it being a sorting issue that is being
> falsely reported, this new version seems to still be failing on sorting
> "multivalued attribute values" ??
> 
> We saw 3 (*) issues when it failed to report a "good" status

These counts are not expected to be equal.  Tombstones can and will vary (they are not expected to ever be in sync), and the entry count is potentially always in flux.  We also ignore missing tombstones for the "missing entry" report.  So if the numbers are off and any missing entries are NOT tombstones it will report on them.  If you don't get a "missing entry" report then the counts don't mean a thing.  Really the counts are just informational (maybe they should be removed if its causing confusion?)

As for the "Result" line it is dictated in thecode as follows:

    if missing_report == "" and len(diff_report) == 0 and m_count == r_count:
        final_report += ('\nResult\n')
        final_report += ('=====================================================\n\n')
        final_report += ('No differences between Master and Replica\n')


In this case m_count(the Master count) is different than r_count(the Replica count) and we don't get our "Result" summary.  There is definitely a flaw with this algorithm since it is okay for m_count and r_count to be different.  The other issue is that we only write this "Result" line under these impossibly pristine conditions.  This needs to be made more robust, better reporting of the results, and always write a Result summary.  So we need a new bug to improve the Result summary.

Comment 39 Dave 2019-05-29 14:23:01 UTC
(In reply to mreynolds from comment #38)

...

> In this case m_count(the Master count) is different than r_count(the Replica
> count) and we don't get our "Result" summary.  There is definitely a flaw
> with this algorithm since it is okay for m_count and r_count to be
> different.  The other issue is that we only write this "Result" line under
> these impossibly pristine conditions.  This needs to be made more robust,
> better reporting of the results, and always write a Result summary.  So we
> need a new bug to improve the Result summary.

cool, sounds good.. the Result section is really what we're looking for to get the status.
Could you mention the new BZ for this missing Result issue, and/or would you like me to open one?

Comment 40 mreynolds 2019-05-29 14:43:44 UTC
(In reply to Dave from comment #39)
> (In reply to mreynolds from comment #38)
> 
> ...
> 
> > In this case m_count(the Master count) is different than r_count(the Replica
> > count) and we don't get our "Result" summary.  There is definitely a flaw
> > with this algorithm since it is okay for m_count and r_count to be
> > different.  The other issue is that we only write this "Result" line under
> > these impossibly pristine conditions.  This needs to be made more robust,
> > better reporting of the results, and always write a Result summary.  So we
> > need a new bug to improve the Result summary.
> 
> cool, sounds good.. the Result section is really what we're looking for to
> get the status.
> Could you mention the new BZ for this missing Result issue, and/or would you
> like me to open one?

Done!

https://bugzilla.redhat.com/show_bug.cgi?id=1715091

Comment 44 errata-xmlrpc 2019-08-06 12:58:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2152

Comment 46 mreynolds 2019-09-06 19:19:27 UTC
Dave,

So the offline report says there are no problem, but the online does?  Did you verify if the differences reported in the online report are actually NOT different in the ldif files?  Or, is the offline report incorrectly reporting there there are no differences but there are differences in the ldif files?

There is another bug too, the missing entries report has duplicates.  Not sure how that is possible, but it needs to be fixed.

Either way this is a different issue, so we should open a new bug for it once it is determined which report is wrong and why.

Thanks,
Mark

Comment 50 mreynolds 2019-09-11 17:16:22 UTC
Hmmm part of the issue could be that the ldif was not generated by db2ldif, but from ldapsearch instead.  The entries would look different, but I'm not sure if that is the issue here or not.  I think the offline might not be processing the replication state attributes correctly and finding false inconsistencies.  Any chance I could get the ldifs they used?  

Side note: there should really be another bug opened to address the differences between the online and offline mode, as this bug is closed as the original reported issue was resolved.