1077439 – RFE: Use lucene indexes to do Copy Trans.

Bug 1077439 - RFE: Use lucene indexes to do Copy Trans.

Summary: RFE: Use lucene indexes to do Copy Trans.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Zanata
Classification:	Retired
Component:	Component-CopyTrans
Sub Component:
Version:	3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.4
Assignee:	Alex Eng
QA Contact:	Zanata-QA Mailling List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1088122
TreeView+	depends on / blocked

Reported:	2014-03-18 02:47 UTC by Carlos Munoz
Modified:	2014-07-17 06:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:	3.4.0-SNAPSHOT (git-server-3.3.1-244-gcebf76a)
Story Points:	8
Clone Of:
Environment:
Last Closed:	2014-07-17 06:39:36 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1076995	0	unspecified	CLOSED	Zanata does not copy the most recent translation	2021-02-22 00:41:40 UTC

Internal Links: 1076995

Description Carlos Munoz 2014-03-18 02:47:25 UTC

Since TM merge is already using the underlying indexing to find translation matches, copy trans should use the same to have a single place where this is happening. At the moment, having two implementations for what is essentially the same feature is giving us issues, specially when users expect their TM matches to be the same was what copy trans has copied.

Comment 1 David Mason 2014-03-19 01:32:02 UTC

This is combining the backend for CopyTrans and TM Merge.

Concerns about how obsolete documents will be handled - CopyTrans doesn't ignore them at the moment, TM Merge does.

Hash column could still be used for 100% matches (currently only used by copytrans.

# Testing

Will need to make sure test data is indexed before tests are run.

Could test different aspects separately:

 - test that indexes can be generated properly
 - test that indexes are used properly during searches
 - test that indexes are updated when new data is added

Comment 2 Carlos Munoz 2014-03-19 05:02:28 UTC

Secondary: camunoz

Comment 3 Ding-Yi Chen 2014-03-28 05:40:46 UTC

The test should also cover CJK languages, both Han character and punctuation, as we did have bugs on lucence search with CJK before.

Comment 4 Sean Flanigan 2014-03-31 01:16:44 UTC

(In reply to Ding-Yi Chen from comment #3)
> The test should also cover CJK languages, both Han character and
> punctuation, as we did have bugs on lucence search with CJK before.

Thanks.  Do you know the bug numbers?  We should make sure we have tests.

Comment 5 Sean Flanigan 2014-04-08 02:17:49 UTC

Development branch is here: https://github.com/zanata/zanata-server/commits/rhbz1077439

Comment 6 Ding-Yi Chen 2014-04-16 02:16:12 UTC

I don't think it is recorded in Bugzilla, it was discovered in Translate Editor search and fixed straight away when it was discovered. 

Yet I can come up with some test cases like:
"性": U+6027 CJK UNIFIED IDEOGRAPH-6027, -ity, nature, character
"。": U+3002 IDEOGRAPHIC FULL STOP
"、": U+3001 IDEOGRAPHIC COMMA (Used to separate items in list)

Comment 7 Sean Flanigan 2014-04-16 02:34:55 UTC

Pull request is here: https://github.com/zanata/zanata-server/pull/418

Although Hibernate Search/Lucene is indexing the translated contents, we won't be using those fields in the CopyTrans query.  So those CJK characters shouldn't give us any trouble unless they appear in source contents.  (And right now we are querying by contentHash, which doesn't care about CJK-compatible Lucene Analyzers.)

Comment 8 Alex Eng 2014-05-01 00:36:25 UTC

Pull request:
https://github.com/zanata/zanata-server/pull/418

We've implemented lucene search for Copy Trans (same as TM Merge) but disabled at the moment due to performance. This pull request now is mainly for refactoring of unit test.

Comment 9 Damian Jansen 2014-05-05 04:32:24 UTC

Verified

Note You need to log in before you can comment on or make changes to this bug.