1011929 – RFE: Provide a way to detect duplicate topics when importing new content

Bug 1011929 - RFE: Provide a way to detect duplicate topics when importing new content

Summary: RFE: Provide a way to detect duplicate topics when importing new content

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	PressGang CCMS
Classification:	Community
Component:	CCMS-Core
Sub Component:
Version:	1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	1.3
Assignee:	Matthew Casperson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	829773 (view as bug list)
Depends On:
Blocks:	1012194 1035547 1035948
TreeView+	depends on / blocked

Reported:	2013-09-25 12:10 UTC by Ruediger Landmann
Modified:	2014-08-04 22:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Clones:	1035547 (view as bug list)
Environment:
Last Closed:	2013-12-03 22:16:06 UTC
Embargoed:

Attachments	(Terms of Use)

Description Ruediger Landmann 2013-09-25 12:10:41 UTC

Description of problem:

Migrating content into PressGang will cause multiple duplicate topics if the books have multiple parallel maintenance streams. 

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. consider two streams of the one book (trunk and branch)
2. import each of them into PressGang via chopbook or similar
3. most or all of the imported topics from the second stream will be exact or near-duplicates of the topics from the first stream. 

Actual results:
Do this across a large suite of books and you will have hundreds, thousands, or tens of thousands of duplicate or near-duplicate topics in the database.

Expected results:
Exact dupes should be detected and rejected at import time, and topic maps adjusted accordingly. Near dupes should probably be imported as different revisions of the existing content and topic maps adjusted accordingly. Even better if user interaction or a flag during import (or reference to the relevant Revision Histories) could determine the relationship between these revisions.

Additional info:

Comment 1 Matthew Casperson 2013-09-25 21:44:02 UTC

We could use a checksum and a minhash to detect exact and likely duplicates without much processing overhead.

Comment 2 Ruediger Landmann 2013-09-25 23:53:14 UTC

In case it's of any use, algorithms for measuring string similarity are already out there; I only know because I maintain a Perl module for Fedora that implements one. See:

http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm

for details and maybe a head start on what to look for. Publican uses this module to implement string matching in PO files, to replace what we once used GNU gettext for. 

This is nice because you can tune the closeness of the match. In this specific instance, degree of closeness might determine dupe/new revision/new topic.

Comment 4 Lee Newson 2013-09-26 03:57:25 UTC

At a quick glance we could also use Apache Lucene to accomplish this since it includes matching algorithms, which can also be customised (see http://docs.jboss.org/hibernate/search/3.2/reference/en/html/search-lucene-native.html). http://stackoverflow.com/questions/7657673/how-to-find-similar-documents is a basic example on how this can be achieved and http://www.mirkosertic.de/doku.php/architecturedesign/hibernatesearch is a slightly more complex example.

Also thanks Rudi for the links, I know Java has some similar implementations floating around (I've actually used the Apache implementation of the Levenshtein distance algorithm in the csprocessor).

Comment 5 Lee Newson 2013-10-09 05:34:40 UTC

*** Bug 829773 has been marked as a duplicate of this bug. ***

Comment 6 Matthew Casperson 2013-10-31 02:23:44 UTC

Distance algorithms are not going to scale, as they requires you to compare the incoming topic with every other topic individually. With 20000+ topics now (and the potential for a great many more in the future), this kind of workload will become impractical.

I have added a minhash column to the topic table, and implemented an algorithm to calculate the min hash based on the same xml breakdown we use for the translations.

You can query for topics with the same minhash using a URL like

http://localhost:8080/pressgang-ccms-ui/#SearchResultsAndTopicView;query;minHash=-1251642234

Comment 7 Matthew Casperson 2013-10-31 02:42:46 UTC

The endpoint

http://localhost:8080/pressgang-ccms/rest/1/minhash/get/json

can be used to post some XML and get a minhash using the same algorithm that generates the minhash property on the topics when they are saved.

Comment 8 Matthew Casperson 2013-10-31 03:00:43 UTC

Post to 

http://localhost:8080/pressgang-ccms/rest/1/minhash/recalculate

to regenerate the min hash values for all topics. This is useful when the feature is first rolled out, or if the algorithm changes.

Comment 9 Matthew Casperson 2013-10-31 08:16:27 UTC

On second thought, this might be best left to a tool like https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

Comment 14 Matthew Casperson 2013-11-15 01:42:54 UTC

The dev server is computing the last million or so minhashes, but functionally everything is ok to test.

Comment 16 Lee Newson 2013-11-21 00:40:44 UTC

Ran into an additional detail with this. If the MinHashXORs table hasn't been initialised then you won't be able to save/update any topics. The error message you'll get is:

java.lang.IllegalStateException: Did not find a minhash xor int for function 1

Comment 17 Lee Newson 2013-11-21 00:50:30 UTC

I also gave this a quick test since I had to have a quick look into the issue above. The implementation appears to be working correctly (that is a POST to the minhashsimilar endpoint returns similar topics), however it doesn't actually solve the issue originally reported.

As per the expected results this should stop topics that are exactly the same from being created, so at this stage you still need to manually check every topic and see if it exists before attempting to import/create new topics, which isn't very different to the current system (the only difference is now you can find similar xml content based on a match ratio instead of looking for a 100% match). So since this doesn't solve the original problem I'm going to set this back to ASSIGNED.

Comment 18 Lee Newson 2013-11-25 06:37:51 UTC

Just adding to my comment above, I'm mainly referring to the create/bulk import functionality in the UI.

Comment 19 Matthew Casperson 2013-11-28 02:30:22 UTC

I'll have to create another bug for any processes that make use of this functionality, as it may be that it is used by chopbook, or we create an import feature in the ui itself.

Once this is live on the prod server I'll talk to Dan about how we move to the next step.

Comment 20 Lee Newson 2013-11-28 03:59:15 UTC

In that case I've cloned this bug to handle the Bulk Import operation in the UI (see BZ#1035547) and I'm changing the name of this RFE to make it more descriptive of what it provides.

Comment 21 Lee Newson 2013-11-28 04:37:42 UTC

I'm going to keep this as ASSIGNED as well because I found the following issues during testing:

- Unable to create or edit topics until the MinHashXOR table has been populated. This should either be populated at start up if it's empty, or be changed so that the content can be saved without the table being populated.

- A call to the topics minHash query will return the topic itself, since this is supposed to be used to find topics similiar to the topic specified it probably should return itself.

- If you enter a value for the similar minHash or topic minHash endpoints that is outside of the 0.6-0.9 change then you get no error and you assume that this just worked, however this isn't the case. In this circumstance I'd expect an error to be thrown otherwise I'd assume that if I specified 0.4 that I'd find all topics that were a 40% match.


Also verified that the following works:

- Generate the minhash xors using the specified endpoint.

- Generate the minhash value for topics using either the full or missing endpoints

- Verified that the get minhash endpoint returns the calculated minhash values.

- Verified that the minhashsimilar endpoints returns topics that are similar to the XML sent (tested with formatted and non formatted XML).

- Changing the comparison ratio (between 0.6 and 0.9) returns different sets of topics with the lower the value the greater the distance between the topics.

Comment 22 Lee Newson 2013-11-28 04:44:15 UTC

The end of the second issue in the comment above should be "topic specified probably shouldn't return itself."

Comment 25 Lee Newson 2013-11-29 00:52:57 UTC

Verified that all the issues in Comment #21 have been fixed.

Note You need to log in before you can comment on or make changes to this bug.