Description of problem: Migrating content into PressGang will cause multiple duplicate topics if the books have multiple parallel maintenance streams. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. consider two streams of the one book (trunk and branch) 2. import each of them into PressGang via chopbook or similar 3. most or all of the imported topics from the second stream will be exact or near-duplicates of the topics from the first stream. Actual results: Do this across a large suite of books and you will have hundreds, thousands, or tens of thousands of duplicate or near-duplicate topics in the database. Expected results: Exact dupes should be detected and rejected at import time, and topic maps adjusted accordingly. Near dupes should probably be imported as different revisions of the existing content and topic maps adjusted accordingly. Even better if user interaction or a flag during import (or reference to the relevant Revision Histories) could determine the relationship between these revisions. Additional info:
We could use a checksum and a minhash to detect exact and likely duplicates without much processing overhead.
In case it's of any use, algorithms for measuring string similarity are already out there; I only know because I maintain a Perl module for Fedora that implements one. See: http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm for details and maybe a head start on what to look for. Publican uses this module to implement string matching in PO files, to replace what we once used GNU gettext for. This is nice because you can tune the closeness of the match. In this specific instance, degree of closeness might determine dupe/new revision/new topic.
At a quick glance we could also use Apache Lucene to accomplish this since it includes matching algorithms, which can also be customised (see http://docs.jboss.org/hibernate/search/3.2/reference/en/html/search-lucene-native.html). http://stackoverflow.com/questions/7657673/how-to-find-similar-documents is a basic example on how this can be achieved and http://www.mirkosertic.de/doku.php/architecturedesign/hibernatesearch is a slightly more complex example. Also thanks Rudi for the links, I know Java has some similar implementations floating around (I've actually used the Apache implementation of the Levenshtein distance algorithm in the csprocessor).
*** Bug 829773 has been marked as a duplicate of this bug. ***
Distance algorithms are not going to scale, as they requires you to compare the incoming topic with every other topic individually. With 20000+ topics now (and the potential for a great many more in the future), this kind of workload will become impractical. I have added a minhash column to the topic table, and implemented an algorithm to calculate the min hash based on the same xml breakdown we use for the translations. You can query for topics with the same minhash using a URL like http://localhost:8080/pressgang-ccms-ui/#SearchResultsAndTopicView;query;minHash=-1251642234
The endpoint http://localhost:8080/pressgang-ccms/rest/1/minhash/get/json can be used to post some XML and get a minhash using the same algorithm that generates the minhash property on the topics when they are saved.
Post to http://localhost:8080/pressgang-ccms/rest/1/minhash/recalculate to regenerate the min hash values for all topics. This is useful when the feature is first rolled out, or if the algorithm changes.
On second thought, this might be best left to a tool like https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
The dev server is computing the last million or so minhashes, but functionally everything is ok to test.
Ran into an additional detail with this. If the MinHashXORs table hasn't been initialised then you won't be able to save/update any topics. The error message you'll get is: java.lang.IllegalStateException: Did not find a minhash xor int for function 1
I also gave this a quick test since I had to have a quick look into the issue above. The implementation appears to be working correctly (that is a POST to the minhashsimilar endpoint returns similar topics), however it doesn't actually solve the issue originally reported. As per the expected results this should stop topics that are exactly the same from being created, so at this stage you still need to manually check every topic and see if it exists before attempting to import/create new topics, which isn't very different to the current system (the only difference is now you can find similar xml content based on a match ratio instead of looking for a 100% match). So since this doesn't solve the original problem I'm going to set this back to ASSIGNED.
Just adding to my comment above, I'm mainly referring to the create/bulk import functionality in the UI.
I'll have to create another bug for any processes that make use of this functionality, as it may be that it is used by chopbook, or we create an import feature in the ui itself. Once this is live on the prod server I'll talk to Dan about how we move to the next step.
In that case I've cloned this bug to handle the Bulk Import operation in the UI (see BZ#1035547) and I'm changing the name of this RFE to make it more descriptive of what it provides.
I'm going to keep this as ASSIGNED as well because I found the following issues during testing: - Unable to create or edit topics until the MinHashXOR table has been populated. This should either be populated at start up if it's empty, or be changed so that the content can be saved without the table being populated. - A call to the topics minHash query will return the topic itself, since this is supposed to be used to find topics similiar to the topic specified it probably should return itself. - If you enter a value for the similar minHash or topic minHash endpoints that is outside of the 0.6-0.9 change then you get no error and you assume that this just worked, however this isn't the case. In this circumstance I'd expect an error to be thrown otherwise I'd assume that if I specified 0.4 that I'd find all topics that were a 40% match. Also verified that the following works: - Generate the minhash xors using the specified endpoint. - Generate the minhash value for topics using either the full or missing endpoints - Verified that the get minhash endpoint returns the calculated minhash values. - Verified that the minhashsimilar endpoints returns topics that are similar to the XML sent (tested with formatted and non formatted XML). - Changing the comparison ratio (between 0.6 and 0.9) returns different sets of topics with the lower the value the greater the distance between the topics.
The end of the second issue in the comment above should be "topic specified probably shouldn't return itself."
Verified that all the issues in Comment #21 have been fixed.