Bug 1629055
| Summary: | ds-replcheck unreliable, showing false positives | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dave <dsimes> |
| Component: | 389-ds-base | Assignee: | mreynolds |
| Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> |
| Severity: | unspecified | Docs Contact: | Marc Muehlfeld <mmuehlfe> |
| Priority: | high | ||
| Version: | 7.5 | CC: | aadhikar, cpelland, dsimes, gparente, jvilicic, lkrispen, mreynolds, nkinder, pasik, rmeggins, spichugi, striker, tbordaz, tmihinto, vashirov |
| Target Milestone: | rc | ||
| Target Release: | 7.7 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 389-ds-base-1.3.9.1-7.el7 | Doc Type: | Bug Fix |
| Doc Text: |
.The `ds-replcheck` utility no longer incorrectly reports non-matching tombstone entries on replicas
Previously, if an administrator ran the `ds-replcheck` utility on different Directory Server replicas with tombstones present, `ds-replcheck` reported that one of the replicas was missing the tombstone entries. It is expected that tombstone entries do not match on each replica. With this update, `ds-replcheck` no longer searches for tombstone entries. As a result, the utility does not report missing tombstone entries as a problem.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-06 12:58:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Dave
2018-09-14 18:13:14 UTC
From what I can tell the issue is that tombstones are showing up in the report as missing entries. Correct? This should have been clearly stated in the bug, but pointing to the case notes to get the details is not appropriate IMO. Anyway, if that is the issue, it is probably is a false positive as already stated. We were actually going to remove tombstones from the missing entry report, but we weren't sure if it was going to be a problem or not. Well it's apparently a problem so we will work on adding a new option to ignore them. (In reply to mreynolds from comment #2) > From what I can tell the issue is that tombstones are showing up in the > report as missing entries. Correct? This should have been clearly stated > in the bug, but pointing to the case notes to get the details is not > appropriate IMO. > > Anyway, if that is the issue, it is probably is a false positive as already > stated. We were actually going to remove tombstones from the missing entry > report, but we weren't sure if it was going to be a problem or not. Well > it's apparently a problem so we will work on adding a new option to ignore > them. The customer was asked to run it with the following exclude options: # ds-replcheck -D "cn=directory manager" -W -m ldap://ssc-prd-ipa-099 -r ldap://cl-rhm-0252 -b dc=masked,dc=domain --ignore memberof,idnssoaserial,entryusn,krblastsuccessfulauth,krblastfailedauth,krbloginfailedcount,nsuniqueid but ds-replcheck is still showing periodic issues should excluding nsuniqueid be ignoring tombstone entries? however, this did not resolve the issue for the customer (In reply to Dave from comment #5) > > The customer was asked to run it with the following exclude options: > > # ds-replcheck -D "cn=directory manager" -W -m ldap://ssc-prd-ipa-099 -r > ldap://cl-rhm-0252 -b dc=masked,dc=domain --ignore > memberof,idnssoaserial,entryusn,krblastsuccessfulauth,krblastfailedauth, > krbloginfailedcount,nsuniqueid > > but ds-replcheck is still showing periodic issues > > should excluding nsuniqueid be ignoring tombstone entries? > however, this did not resolve the issue for the customer Yeah that won't work. The "exclude/ignore attributes" only applies when checking the difference between two entries. It does not impact the "Missing Entries" report. Something else I want to mention... Replication, when under load, is never going to be in-sync at any given moment. So there are going to be times where the tool reports there are differences (because at that moment there are). To help avoid this there is a lag time you can adjust: https://www.port389.org/docs/389ds/design/repl-diff-tool-design.html#usage The default is 5 minutes, meaning anything out of sync that's within 5 minutes is ignored, but this might need to be increased depending on the customer's load. Just an FYI. Tomorrow I will prepare a test version of the script (that ignores missing tombstones). If the customer could test this to make sure it works for the them I would appreciate it. (In reply to mreynolds from comment #7) > Tomorrow I will prepare a test version of the script (that ignores missing > tombstones). If the customer could test this to make sure it works for the > them I would appreciate it. yes yes, we/they could definitely test! :) Created attachment 1484404 [details]
Modified version to skip tombstones from missing entries report
(In reply to mreynolds from comment #11) > Created attachment 1484404 [details] > Modified version to skip tombstones from missing entries report testing!! (even set lag to 10 hours - 6000) I'm assuming this is from the "online" mode. I do see where it is picking it up from, so I need to do another revision. But, do they see the same missing entries in offline mode (comparing two ldifs from "db2ldif -r")? is there a newer ds-replcheck we can test? the last provided/attached did not solve/fix the customer's issue Created attachment 1518440 [details]
New ds-replcheck
Try this one Dave. Note -the usage changed quite a bit(it's a lot nicer), so run ds-replcheck --help first and try again...
The version of ds-replcheck in the build: 389-ds-base-1.3.9.1-5.el7.x86_64 is 1.4 whereas in the attached file it is 2.0. Also, the changes could not be seen in /usr/bin/ds-replcheck as compared with the attachment. Therefore marking it as FAILED_QA. (In reply to mreynolds from comment #18) > Created attachment 1518440 [details] > New ds-replcheck > > Try this one Dave. Note -the usage changed quite a bit(it's a lot nicer), > so run ds-replcheck --help first and try again... /usr/bin/python3 not available on RHEL 7.x (tested on 7.6) pointing to /usr/bin/python (python2) does not work either, as there are missing modules apparently specific to python3 using rhscl rh-python36-python (rh-python36-runtime) does not work either as there is no ldap module available does not work for RHEL 7.x, assuming this will also not work for targeted RHEL 7.7 release unable to test on RHEL 7.x (7.6 attempted) Dave, looks like these changes were added to RHEL 7.7 (python2 version). What was the last version the customer tested? Anyway I am attaching what is in latest build for 7.7. Since RHEL 7.7 is wrapping up really soon it would be great if we could get it verified. Thanks, Mark Created attachment 1567987 [details]
RHEL 7.7 ds-replcheck (python2)
(In reply to mreynolds from comment #24) > Created attachment 1567987 [details] > RHEL 7.7 ds-replcheck (python2) This one runs (on 7.6) and looks clean, does not exhibit the previous tombstone issues Looks good, customer has reviewed the new output Build Tested: 389-ds-base-1.3.9.1-7.el7.x86_64
1) Create 2 Master A and B but do NOT create agreements, etc.
2) Add and Delete a user on Master A
ldapadd -p 39001 -h localhost -D "cn=Directory Manager" -w password << EOF
dn: uid=test-user,ou=People,dc=example,dc=com
changetype: add
uid: test-user
objectClass: top
objectClass: account
objectClass: posixaccount
objectClass: inetOrgPerson
objectClass: person
objectClass: inetUser
objectClass: organizationalPerson
uidNumber: 1001
gidNumber: 1001
sn: surname
homeDirectory: /home/test-user
cn: common name
EOF
adding new entry "uid=test-user,ou=People,dc=example,dc=com"
ldapdelete -p 39001 -h localhost -D "cn=Directory Manager" -w password uid=test-user,ou=People,dc=example,dc=com
4) Run ds-replcheck and verify there are NO complaints about missing entries/tombstones
[root@master ~]# ds-replcheck -v -D "cn=directory manager" -w password -m ldap://`hostname`:39001 -r ldap://`hostname`:39002 -b dc=example,dc=com -l 1
Performing online report...
Connecting to servers...
Validating suffix ...
Gathering Master's RUV...
Gathering Replica's RUV...
Start searching and comparing...
Preparing final report...
================================================================================
Replication Synchronization Report (Wed May 15 09:26:29 2019)
================================================================================
Database RUV's
=====================================================
Master RUV:
{replica 1 ldap://web9.testrelm.test:39001} 5cdc0edf000100010000 5cdc1354000000010000
{replica 2 ldap://web9.testrelm.test:39002} 5cdc0ee8000100020000 5cdc0ee8000100020000
{replicageneration} 5cdc0edf000000010000
Replica RUV:
{replica 1 ldap://web9.testrelm.test:39001} 5cdc0edf000100010000 5cdc0ee5000200010000
{replica 2 ldap://web9.testrelm.test:39002} 5cdc0ee8000100020000 5cdc0ee8000100020000
{replicageneration} 5cdc0edf000000010000
Entry Counts
=====================================================
Master: 15
Replica: 14
Tombstones
=====================================================
Master: 1
Replica: 0
Marking it as VERIFIED.
Created attachment 1574437 [details]
ds-replcheck output from RHEL 7.6 version showing additional detail, as opposed to the 7.7 version output posted below
(In reply to mreynolds from comment #24) > Created attachment 1567987 [details] > RHEL 7.7 ds-replcheck (python2) running an alternate data check tool, shows there is no data inconsistency, but ds-replcheck IS reporting an issue we're under the impression that this new ds-replcheck does not resolve the issue for the customer and is not yet usable/reliable Sorry I'm really confused. From what I can tell from comment 30 everything is running correctly, and the old output verifies that they were previously seeing false positives. So the tool appears to be running correctly. Is the problem now that there is no "Result" summary message? (In reply to mreynolds from comment #36) > Sorry I'm really confused. From what I can tell from comment 30 everything > is running correctly, and the old output verifies that they were previously > seeing false positives. So the tool appears to be running correctly. Is > the problem now that there is no "Result" summary message? The new version had been running as expected over the last couple weeks of testing, until this afternoon. We had been looking for this Result section, and could verify it was good when we saw: No differences between Master and Replica *Until today, when we did not see this line, noticed there was no Result section at all, *and these count issues: Entry Counts ===================================================== Master: 4608 Replica: 4603 Tombstones ===================================================== Master: 86 Replica: 81 *The previous version would show some data when the counts were not equal, and we saw no data to try and dump/compare between servers. Then with what you posted about it being a sorting issue that is being falsely reported, this new version seems to still be failing on sorting "multivalued attribute values" ?? We saw 3 (*) issues when it failed to report a "good" status (In reply to Dave from comment #37) > (In reply to mreynolds from comment #36) > > Sorry I'm really confused. From what I can tell from comment 30 everything > > is running correctly, and the old output verifies that they were previously > > seeing false positives. So the tool appears to be running correctly. Is > > the problem now that there is no "Result" summary message? > > The new version had been running as expected over the last couple weeks of > testing, until this afternoon. We had been looking for this Result section, > and could verify it was good when we saw: > > No differences between Master and Replica > > *Until today, when we did not see this line, noticed there was no Result > section at all, > > *and these count issues: > > Entry Counts > ===================================================== > > Master: 4608 > Replica: 4603 > > > Tombstones > ===================================================== > > Master: 86 > Replica: 81 > > > *The previous version would show some data when the counts were not equal, > and we saw no data to try and dump/compare between servers. > > Then with what you posted about it being a sorting issue that is being > falsely reported, this new version seems to still be failing on sorting > "multivalued attribute values" ?? > > We saw 3 (*) issues when it failed to report a "good" status These counts are not expected to be equal. Tombstones can and will vary (they are not expected to ever be in sync), and the entry count is potentially always in flux. We also ignore missing tombstones for the "missing entry" report. So if the numbers are off and any missing entries are NOT tombstones it will report on them. If you don't get a "missing entry" report then the counts don't mean a thing. Really the counts are just informational (maybe they should be removed if its causing confusion?) As for the "Result" line it is dictated in thecode as follows: if missing_report == "" and len(diff_report) == 0 and m_count == r_count: final_report += ('\nResult\n') final_report += ('=====================================================\n\n') final_report += ('No differences between Master and Replica\n') In this case m_count(the Master count) is different than r_count(the Replica count) and we don't get our "Result" summary. There is definitely a flaw with this algorithm since it is okay for m_count and r_count to be different. The other issue is that we only write this "Result" line under these impossibly pristine conditions. This needs to be made more robust, better reporting of the results, and always write a Result summary. So we need a new bug to improve the Result summary. (In reply to mreynolds from comment #38) ... > In this case m_count(the Master count) is different than r_count(the Replica > count) and we don't get our "Result" summary. There is definitely a flaw > with this algorithm since it is okay for m_count and r_count to be > different. The other issue is that we only write this "Result" line under > these impossibly pristine conditions. This needs to be made more robust, > better reporting of the results, and always write a Result summary. So we > need a new bug to improve the Result summary. cool, sounds good.. the Result section is really what we're looking for to get the status. Could you mention the new BZ for this missing Result issue, and/or would you like me to open one? (In reply to Dave from comment #39) > (In reply to mreynolds from comment #38) > > ... > > > In this case m_count(the Master count) is different than r_count(the Replica > > count) and we don't get our "Result" summary. There is definitely a flaw > > with this algorithm since it is okay for m_count and r_count to be > > different. The other issue is that we only write this "Result" line under > > these impossibly pristine conditions. This needs to be made more robust, > > better reporting of the results, and always write a Result summary. So we > > need a new bug to improve the Result summary. > > cool, sounds good.. the Result section is really what we're looking for to > get the status. > Could you mention the new BZ for this missing Result issue, and/or would you > like me to open one? Done! https://bugzilla.redhat.com/show_bug.cgi?id=1715091 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2152 Dave, So the offline report says there are no problem, but the online does? Did you verify if the differences reported in the online report are actually NOT different in the ldif files? Or, is the offline report incorrectly reporting there there are no differences but there are differences in the ldif files? There is another bug too, the missing entries report has duplicates. Not sure how that is possible, but it needs to be fixed. Either way this is a different issue, so we should open a new bug for it once it is determined which report is wrong and why. Thanks, Mark Hmmm part of the issue could be that the ldif was not generated by db2ldif, but from ldapsearch instead. The entries would look different, but I'm not sure if that is the issue here or not. I think the offline might not be processing the replication state attributes correctly and finding false inconsistencies. Any chance I could get the ldifs they used? Side note: there should really be another bug opened to address the differences between the online and offline mode, as this bug is closed as the original reported issue was resolved. |