Bug 554481

Summary: Environment var TESSDATA_PREFIX not set; causes gscan2pdf to ignore tesseract language data.
Product: [Fedora] Fedora Reporter: Daniel Berlin <dan.btown>
Component: gscan2pdfAssignee: Bernard Johnson <bjohnson>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: bjohnson, karlikt, rpandit, vitor.dominor
Target Milestone: ---Keywords: EasyFix, i18n
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: gscan2pdf-0.9.30-2.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-06 03:48:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Berlin 2010-01-11 20:08:35 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.6) Gecko/20100107 Fedora/3.5.6-1.fc12 Firefox/3.5.6

Observed with:
tesseract 2.04-1.f12
tesseract-langpacks 2.00-5.f12
gscan2pdf 0.9.27-5.f12

After installing the abovementioned RPMs, the environment variable TESSDATA_PREFIX remains unset (or set to an empty string).

This causes gscan2pdf not to see the installed tesseract language data in the directory /usr/share/tesseract/tessdata; thus it is not possible to choose from the installed language packs in the gscan2pdf dialogue Tools>OCR.


Reproducible: Always

Steps to Reproduce:
1. Install the stated packages
2. Start gscan2pdf
3. Go to dialogue Tools>OCR.
4. Choose Tesseract from the first selection list.
5. Try to choose other languages than English from the second selection list.
Actual Results:  
2nd selection list only contains option "English"

Expected Results:  
2nd selection list should also contain names of other installed language packs (like "French", "German", etc.)

The environment variable TESSDATA_PREFIX should be set to "/usr/share/tesseract" to fix the problem.

The bug reporter is not sure if this should be fixed in the tesseract RPMs or the gscan2pdf RPM of Fedora.

As a possible hint, the author of gscan2pdf gave the following clues (on the gscan2pdf help mailing list) when approached regarding the bug reported above:

"The problem is that gscan2pdf looks in
 
 /usr/share/tessdata
 /usr/local/share/tessdata
 $TESSDATA_PREFIX/tessdata
 /usr/share/tesseract-ocr/tessdata (tesseract-ocr is the Debian
 package name)
 
So if you set TESSDATA_PREFIX to /usr/share/tesseract then you should
be good to go",

and, secondly,

"To be fair to [the Fedora packagers], for tesseract on the command line, you only need
to set TESSDATA_PREFIX if you put the language packages in a different
place than specified when compiling. Fedora would be better off
patching gscan2pdf to look where they put tesseract. Now you have told
me, I'll fix the next release [of gscan2pdf] to look in /usr/share/tesseract as well."

Comment 1 Daniel Berlin 2010-01-11 21:44:31 UTC
(In reply to comment #0)

Regarding the bugzilla ticket at hand, the author of gscan2pdf reported later today on the gscan2pdf help mailing list:

"I have already fixed the development version. As soon as I have fixed
a particularly nasty (different) little bug that has been eluding me
for the last couple of months, I'll release it as 0.9.30."

Comment 2 Daniel Berlin 2010-01-17 15:47:56 UTC
(In reply to comment #0)
Pls. also note this feature request:

http://code.google.com/p/tesseract-ocr/issues/detail?id=89&can=1

Comment 3 Daniel Berlin 2010-01-17 16:01:47 UTC
(In reply to comment #0)

The bug reporter reclassified this ticket from the tesseract component to the gscan2pdf component because the unexpected behaviour shows when using gscan2pdf, not when using tesseract and the tesseract language packs alone.

Comment 4 Fedora Update System 2010-02-03 06:29:39 UTC
gscan2pdf-0.9.30-1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/gscan2pdf-0.9.30-1.fc12

Comment 5 Fedora Update System 2010-02-05 01:49:32 UTC
gscan2pdf-0.9.30-1.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update gscan2pdf'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2010-1506

Comment 6 Fedora Update System 2010-02-12 23:48:43 UTC
gscan2pdf-0.9.30-2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/gscan2pdf-0.9.30-2.fc12

Comment 7 Fedora Update System 2010-02-16 13:23:04 UTC
gscan2pdf-0.9.30-2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update gscan2pdf'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2010-1936

Comment 8 Fedora Update System 2010-03-06 03:48:14 UTC
gscan2pdf-0.9.30-2.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 vitor.dominor 2011-08-19 20:38:52 UTC
In addition, it is necessary to change some lines in /usr/bin/gscan2pdf, so that it can locate the tesseract 3.00 language files now with the extension of .traineddata. In the lines 11098, 11101 and 11106, the occurrences of unicharset need to be replaced with traineddata. After that, gscanpdf correctly detects and lists the languages in the OCR dialog.

Comment 10 Bernard Johnson 2011-08-19 21:46:58 UTC
(In reply to comment #9)
> In addition, it is necessary to change some lines in /usr/bin/gscan2pdf, so
> that it can locate the tesseract 3.00 language files now with the extension of
> .traineddata. In the lines 11098, 11101 and 11106, the occurrences of
> unicharset need to be replaced with traineddata. After that, gscanpdf correctly
> detects and lists the languages in the OCR dialog.

You should provide this information to the upstream developer since it's a change in the way gscan2pdf works.

Comment 11 vitor.dominor 2011-08-20 01:58:47 UTC
I didn't know this before, but it looks someone has already done that: http://sourceforge.net/tracker/?func=detail&aid=3246957&group_id=174140&atid=868098.