Bug 1035547 - RFE -- de-dupe imported topics
Summary: RFE -- de-dupe imported topics
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: PressGang CCMS
Classification: Community
Component: Web-UI
Version: 1.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Matthew Casperson
QA Contact:
URL:
Whiteboard:
: 1035948 (view as bug list)
Depends On: 1011929
Blocks: 1012194
TreeView+ depends on / blocked
 
Reported: 2013-11-28 03:56 UTC by Lee Newson
Modified: 2014-08-04 22:27 UTC (History)
4 users (show)

Fixed In Version:
Clone Of: 1011929
Environment:
Last Closed: 2014-03-09 20:08:17 UTC
Embargoed:


Attachments (Terms of Use)

Description Lee Newson 2013-11-28 03:56:50 UTC
Cloned to take care of detecting duplicate topics when doing a Bulk Import in the UI.

+++ This bug was initially created as a clone of Bug #1011929 +++

Description of problem:

Migrating content into PressGang will cause multiple duplicate topics if the books have multiple parallel maintenance streams. 

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. consider two streams of the one book (trunk and branch)
2. import each of them into PressGang via chopbook or similar
3. most or all of the imported topics from the second stream will be exact or near-duplicates of the topics from the first stream. 

Actual results:
Do this across a large suite of books and you will have hundreds, thousands, or tens of thousands of duplicate or near-duplicate topics in the database.

Expected results:
Exact dupes should be detected and rejected at import time, and topic maps adjusted accordingly. Near dupes should probably be imported as different revisions of the existing content and topic maps adjusted accordingly. Even better if user interaction or a flag during import (or reference to the relevant Revision Histories) could determine the relationship between these revisions.

Additional info:

Comment 1 Matthew Casperson 2013-11-29 01:00:50 UTC
*** Bug 1035948 has been marked as a duplicate of this bug. ***

Comment 2 Matthew Casperson 2014-01-13 01:39:50 UTC
Going forward I think we should handle duplication on import like so:

1. Exact matches are replaced automatically. We'll need to store a common hash (by common I mean common between programming languages, like SHA256 or MD5) against all topics, expose the hash to via a search parameter in a REST query, and have any import applications first hash new topics to see if they already exist.
2. If no exact match exists, import the topics as is.
3. Any near matches should be resolved once the topics have been imported through the reporting available in PressGang or DocBuilder.

Editing a topic that is a close match to imported content would require quite a bit of work. The author has to decide to use conditionals, entities, different revisions or simply keep the two topics separate. It would be very time consuming to make these decisions at import time.

Importing near matches means that they can be dealt with incrementally, without forcing an import process to be completed in one go or being able to save an import half way through.

The downside to importing near matches and sorting them out later is that the database will be filled with a lot of potentially unused topics. But purging unused topics is something that we have to do regardless of any additional junk topics, so this is not a deal breaker.

The additional size of the database is something to consider (Rudi estimates that imported books will double or triple the current database). We can either ignore this (a 20 GB database is not really that manageable), compress the database, or rely on purging topics to get the size back down.

Comment 3 Lee Newson 2014-01-13 04:12:47 UTC
What about the use case where a user would want to use the similar topic (ie just different whitespace) in place of the one they are uploading? In this case I feel we should have a "Save or Reject" option for similar options when uploading them (even if it's just matches over 90%), as I agree doing the editing during the upload would be a large effort.

Comment 4 Matthew Casperson 2014-01-13 21:15:46 UTC
Save and Reject is probably a good compromise between importing everything and sorting out the duplicates later and expecting the user to deal with all duplicates during the import. Implementing it would require the import application to be a graphical wizard which displayed the rendered view (or rendered diff) of the duplicates.

Comment 5 Matthew Casperson 2014-01-20 04:59:36 UTC
Topics now have a TopicContentHash field that is set to the SHA-256 hash of the XML. It is updated whenever a topic is created or saved.

A post to the http://localhost:8080/pressgang-ccms/rest/1/contenthash/recalculatemissing end point will recalculate any missing content hashes.

Comment 6 Matthew Casperson 2014-01-21 05:34:25 UTC
A post to http://localhost:8080/pressgang-ccms/rest/1/topic/createormatch/json?message=mcasperson%3A+Initial+Topic+Creation&flag=2&userId=89 will attempt to match the new topic's contents to an existing topic. It normalises the XML (so whitespace differences are ignored) and compares the SHA 256 XML hash to existing topics.

If a match is found, that topic is returned. If no match is found, a new topic is created.

The return value is a wrapper around the usual RESTTopicV1 object with a matchedExistingTopic property that indicates if the topic was created or matched to an existing topic, like so:

{
topic: {/* Usual topic details are here */},
matchedExistingTopic: true
}

All that an import tool would have to do to deal with 100% matches is use this endpoint over the usual http://localhost:8080/pressgang-ccms/rest/1/topic/create/json endpoint, and then account for the slightly changed return value.

Comment 7 Matthew Casperson 2014-01-21 05:51:19 UTC
To test this, post the following xml:

{"xml":"<section><title>Title</title><para>test content</para></section>", "configuredParameters":["xml"]}

then post the same XML (insert some spaces, line breaks and tabs between elements if you want) and confirm that an existing topic is returned.

Comment 8 Matthew Casperson 2014-01-21 06:22:49 UTC
This has been deployed to the dev server.

Comment 9 Lee Newson 2014-01-22 01:48:54 UTC
Is SHA-256 really needed for this? As generally we want our requests to be as fast as possible, so to me MD5 should be fine for this task as it's much less CPU intensive.

Some links to benchmarks for different hashing algorithms:
http://www.codinghorror.com/blog/2012/04/speed-hashing.html
http://thepasswordproject.com/oclhashcat_benchmarking
http://atodorov.org/blog/2013/02/05/performance-test-md5-sha1-sha256-sha512/

Comment 10 Matthew Casperson 2014-01-23 04:19:18 UTC
http://en.wikipedia.org/wiki/MD5: "In 2004 it was shown that MD5 is not collision resistant." 

How much this would affect us with our usage is unclear, but it seems prudent not to rely on an algorithm that has a demonstrated weakness that could potentially affect our processes.

Comment 11 Matthew Casperson 2014-01-23 04:30:53 UTC
Images also need a hash to detect duplicates.

Comment 12 Lee Newson 2014-01-23 04:32:56 UTC
(In reply to Matthew Casperson from comment #10)
> http://en.wikipedia.org/wiki/MD5: "In 2004 it was shown that MD5 is not
> collision resistant." 
> 
> How much this would affect us with our usage is unclear, but it seems
> prudent not to rely on an algorithm that has a demonstrated weakness that
> could potentially affect our processes.

Fair enough, in that case we could still use SHA-1 over SHA-256 as it's still less resource intensive (granted the difference is much less than MD5).

It's probably worth mentioning that the MD5 issue appears to only relate to signed certificates (see http://www.cvedetails.com/cve/CVE-2004-2761/), but I do agree it's not worth the risk and we should steer clear of it.

Comment 13 Matthew Casperson 2014-01-23 04:34:02 UTC
See https://github.com/pressgang-ccms/chopbook for the current state of chopbook with duplicate topic detection enabled.

Comment 16 Matthew Casperson 2014-01-23 04:54:38 UTC
http://en.wikipedia.org/wiki/SHA-1: In 2005, cryptanalysts found attacks on SHA-1 suggesting that the algorithm might not be secure enough for ongoing use.

If we increased the size of our database 3 fold and saved every topic once in a day, moving to SHA1 would save about 8 seconds total cpu time. It seems like a small price to pay to use an algorithm that appears to have no known flaws.

Comment 18 Matthew Casperson 2014-01-24 01:22:53 UTC
http://localhost:8080/pressgang-ccms/rest/1/image/createormatch/json has been added, which all attempt to match the language image files for an uploaded image with any that already exist. If all language image files match, then the parent image is returned. If any don't match then the image is created.

http://localhost:8080/pressgang-ccms/rest/1/contenthash/recalculatemissing will calculate hashes for any language image entities that don't have them in addition to generating the hashes for topics.

Comment 19 Matthew Casperson 2014-01-27 22:05:42 UTC
As of this morning a test import of two versions of an old RHEL book resulted in all but 1 topic being reused. Since we now have the logic in place to de-dupe exact matches as well as identify close matches for manual reconciliation after an import I'll mark this as on-qa.

Comment 23 Lee Newson 2014-01-28 01:14:48 UTC
Also marking this as Failed QA, since it doesn't work for the initial RFE condition (the Bulk Import operation in the UI).

Note: I haven't tested chopbook as this specific RFE is about the UI.

Comment 25 Matthew Casperson 2014-02-07 03:00:40 UTC
Code for new UI import tool can be found at https://github.com/mcasperson/PressGangCCMSImport

Comment 26 Matthew Casperson 2014-02-07 03:11:37 UTC
Need to add the ability to extract a preface from an existing book.
Need to add index if present.
Need to remove revision history if extracted as a separate topic.

Comment 27 Matthew Casperson 2014-02-07 06:06:46 UTC
Need to catch <book status="draft"> as draft.

Comment 28 Lee Newson 2014-02-07 06:29:02 UTC
Had a glance over this quickly and looks good so far. I like the wizard type workflow to upload the content.

-----------------------------------------------------------------------------

Here are a couple of usability things I thought would improve it:

- The ZIP uploaded should be able to use something like:
  - <BOOK_NAME>
    - en-US
    - publican.cfg
  as I imagine most people are going to attempt to upload a ZIP that way instead of the contents of the book dir.

- Users should be able to specify the commit message, either that or it should be more meaningful then "Initial Topic Creation"

-----------------------------------------------------------------------------

Also a couple of bugs/issues I found (though I'm fairly sure this was an issue in chopbook as well):

- You are able to close the process while it is uploading without any sort of confirmation.
- Uploading a ZIP, going back and changing the ZIP leaves you with the old ZIP until you refresh.
- Multiple publican.cfg files aren't uploaded.
- Articles fail to upload (due to the app looking for the <bookinfo> element, when it should be looking for <articleinfo>).
- <xref>s in Revision Histories aren't resolved to injections.
- Additional files aren't uploaded (ie those in the en-US/files/ directory).
- <preface> blocks aren't uploaded (the default Preface.xml should be ignored).
- Some entities (HOLDER and YEAR primarily) shouldn't be uploaded in the Entities block and instead should use the "Copyright Holder" and "Copyright Year" metadata fields.

-----------------------------------------------------------------------------

Lastly some things that will need to be added at some stage (I'll file an RFE at a later stage):

- The intro topics should be formatted in the new "Initial Text" format, although the parser will currently still parse the old format.
- Support DocBook 5 content.
- Upload other languages/translations, as atm it only works with en-US (this value should be read from the publican.cfg).

Comment 29 Lee Newson 2014-02-07 06:32:08 UTC
Oops ignore the "<preface> blocks aren't uploaded (the default Preface.xml should be ignored)." issue I mentioned. I missed that Matt had already mentioned it :(

Comment 30 Lee Newson 2014-02-07 07:11:28 UTC
Had a look over the source code and found some additional issues:

- The Revision History, Author Group and Abstract Tag ids are hardcoded when they should be coming from the Server Settings (ie: from /rest/1/settings/get/json)
- Custom Legal Notices also aren't getting uploaded.
- The removeXmlPreamble method will remove any Doctype or xml processing instructions from examples. ie the content below will have the xml processing instruction from the CDATA:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE section>
<section>
    <title>Test Topic</title>
    <para>Below is an example of what your pressgang-ds.xml should look like:</para>
    <programlisting><[CDATA[<?xml version="1.0" encoding="UTF-8" ?>
<datasource>
...
</datasource>]]></programlisting>
</section>

- (Possible Issue) The replaceEntitiesInText methods entity regular expression is too broad atm it will replace something like this java source code example (assuming it is in CDATA):

buffer.append("You should do this & this & that.");

Ideally the regex should be something like: /&[#\w\d]*?;/ (assuming you want to catch character references as well). However since this is replaced later and should cause issues with any other processing it shouldn't be an issue.


And another change that should be done is that "DTD" should be changed to "Format" to go with the changes for DocBook 5.0 support.

Comment 31 Lee Newson 2014-02-09 10:52:16 UTC
Was looking at the spec today and the /&[#\w\d]*?;/ regex will actually miss a few cases, so something like /&\S+;/ would perhaps be better (as whitespace definitely isn't allowed in entity names).

Comment 32 Lee Newson 2014-02-09 11:37:22 UTC
/&\S+;/ should have been /&\S+?;/ (I missed the non-greedy quantifier)

Comment 33 Lee Newson 2014-02-09 23:05:20 UTC
Revision Histories are extracted incorrectly. It should be wrapped in an abstract (so that the title is included and can be changed), however at the moment only the <revhistory> element is extracted.

Comment 34 Lee Newson 2014-02-09 23:44:22 UTC
Usability: Items in the publican /tmp directory are shown in the "Select the main XML file" page. These files should be excluded as the shouldn't be used for importing.

Comment 35 Lee Newson 2014-02-10 00:07:06 UTC
Usability: When an error occurs during the actual import process, an error message is displayed, you click okay but then you are stuck on the progress page without knowing what to do. From the looks of it the process has stopped due to the error (in my case it was a network dropout during the similarity tests). So I should be able to restart the process or at least move from the page without having to refresh.

Bug: When an error occurs half of the upload process exists in the system, while the other half does not. As such everything should be uploaded in one request (or as little as possible). To do this we should probably create an endpoint for createOrMatching multiple topics (ie /rest/1/topics/createOrMatch/json) and the update call should be done in bulk, not individually.

Comment 36 Lee Newson 2014-02-10 01:17:01 UTC
Parts with content are imported incorrectly. ie

<part>
	<title>Test Part</title>
	<partintro>
		<para>
			This is a test of a &lt;partintro&gt; import.
		</para>
		<section>
			<title>Test partintro</title>
			<para/>
		</section>
	</partintro>
        <chapter>
                <title>Configuring Red Hat High Availability Add-On With <application>Conga</application></title>
                ...
        </chapter>
</part>

creates the following topic:

<section>
	<title>Test Part</title>
	<partintro>
		<para>
			This is a test of a &lt;partintro&gt; import.
		</para>
		<section>
			<title>Test partintro</title>
			<para/>
		</section>
	</partintro>
</section>

whereas it should be imported as two topics like so:

<section>
	<title>Test Part</title>
	<para>
		This is a test of a &lt;partintro&gt; import.
	</para>
</section>

<section>
	<title>Test partintro</title>
	<para/>
</section>

and the Content Spec should be:

Part: Test Part
  Initial Text:
    Test Part [1234]
  Test partintro [5678]
  Chapter: Configuring Red Hat High Availability Add-On With <application>Conga</application>
    ...

or for the older syntax:

Part: Test Part [1234]
  Test partintro [5678]
  Chapter: Configuring Red Hat High Availability Add-On With <application>Conga</application>
    ...

Comment 38 Lee Newson 2014-02-11 06:12:49 UTC
Usability: The last step in the wizard should probably be "Start" (or something similar) instead of "Next" to show that once you move on you can't go back.

Comment 44 Matthew Casperson 2014-02-13 06:04:30 UTC
The above fix has highlighted a limitation with injections: https://bugzilla.redhat.com/show_bug.cgi?id=1064693

Comment 45 Matthew Casperson 2014-02-13 06:36:00 UTC
YEAR and HOLDER are now extracted into Copyright Year and Copyright Holder content spec fields.

This has highlighted a limitation with entities: https://bugzilla.redhat.com/show_bug.cgi?id=1064707

Comment 47 Matthew Casperson 2014-02-13 23:20:43 UTC
xrefs to containers without any initial text are now handled by injections to targets.


Note You need to log in before you can comment on or make changes to this bug.