User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.6) Gecko/20100107 Fedora/3.5.6-1.fc12 Firefox/3.5.6 Observed with: tesseract 2.04-1.f12 tesseract-langpacks 2.00-5.f12 gscan2pdf 0.9.27-5.f12 After installing the abovementioned RPMs, the environment variable TESSDATA_PREFIX remains unset (or set to an empty string). This causes gscan2pdf not to see the installed tesseract language data in the directory /usr/share/tesseract/tessdata; thus it is not possible to choose from the installed language packs in the gscan2pdf dialogue Tools>OCR. Reproducible: Always Steps to Reproduce: 1. Install the stated packages 2. Start gscan2pdf 3. Go to dialogue Tools>OCR. 4. Choose Tesseract from the first selection list. 5. Try to choose other languages than English from the second selection list. Actual Results: 2nd selection list only contains option "English" Expected Results: 2nd selection list should also contain names of other installed language packs (like "French", "German", etc.) The environment variable TESSDATA_PREFIX should be set to "/usr/share/tesseract" to fix the problem. The bug reporter is not sure if this should be fixed in the tesseract RPMs or the gscan2pdf RPM of Fedora. As a possible hint, the author of gscan2pdf gave the following clues (on the gscan2pdf help mailing list) when approached regarding the bug reported above: "The problem is that gscan2pdf looks in /usr/share/tessdata /usr/local/share/tessdata $TESSDATA_PREFIX/tessdata /usr/share/tesseract-ocr/tessdata (tesseract-ocr is the Debian package name) So if you set TESSDATA_PREFIX to /usr/share/tesseract then you should be good to go", and, secondly, "To be fair to [the Fedora packagers], for tesseract on the command line, you only need to set TESSDATA_PREFIX if you put the language packages in a different place than specified when compiling. Fedora would be better off patching gscan2pdf to look where they put tesseract. Now you have told me, I'll fix the next release [of gscan2pdf] to look in /usr/share/tesseract as well."
(In reply to comment #0) Regarding the bugzilla ticket at hand, the author of gscan2pdf reported later today on the gscan2pdf help mailing list: "I have already fixed the development version. As soon as I have fixed a particularly nasty (different) little bug that has been eluding me for the last couple of months, I'll release it as 0.9.30."
(In reply to comment #0) Pls. also note this feature request: http://code.google.com/p/tesseract-ocr/issues/detail?id=89&can=1
(In reply to comment #0) The bug reporter reclassified this ticket from the tesseract component to the gscan2pdf component because the unexpected behaviour shows when using gscan2pdf, not when using tesseract and the tesseract language packs alone.
gscan2pdf-0.9.30-1.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/gscan2pdf-0.9.30-1.fc12
gscan2pdf-0.9.30-1.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update gscan2pdf'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2010-1506
gscan2pdf-0.9.30-2.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/gscan2pdf-0.9.30-2.fc12
gscan2pdf-0.9.30-2.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update gscan2pdf'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F12/FEDORA-2010-1936
gscan2pdf-0.9.30-2.fc12 has been pushed to the Fedora 12 stable repository. If problems still persist, please make note of it in this bug report.
In addition, it is necessary to change some lines in /usr/bin/gscan2pdf, so that it can locate the tesseract 3.00 language files now with the extension of .traineddata. In the lines 11098, 11101 and 11106, the occurrences of unicharset need to be replaced with traineddata. After that, gscanpdf correctly detects and lists the languages in the OCR dialog.
(In reply to comment #9) > In addition, it is necessary to change some lines in /usr/bin/gscan2pdf, so > that it can locate the tesseract 3.00 language files now with the extension of > .traineddata. In the lines 11098, 11101 and 11106, the occurrences of > unicharset need to be replaced with traineddata. After that, gscanpdf correctly > detects and lists the languages in the OCR dialog. You should provide this information to the upstream developer since it's a change in the way gscan2pdf works.
I didn't know this before, but it looks someone has already done that: http://sourceforge.net/tracker/?func=detail&aid=3246957&group_id=174140&atid=868098.