Bug 2124585
| Summary: | Bad encoding since 9.56 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | daniel.debaerdemaeker <daniel.debaerdemaeker> | ||||||
| Component: | ghostscript | Assignee: | Richard Lescak <rlescak> | ||||||
| Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 35 | CC: | akhaitovich, mjg, mosvald, rlescak, zdohnal | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2022-12-13 18:08:34 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
daniel.debaerdemaeker
2022-09-06 14:38:02 UTC
Could you try your gs invocations with '-dNEWPDF=false'? This reverts back to the legacy interpreter. Also, can you provide a reproducer? As I understand, signature and ocr don't really play a role, but you are combining image and text layers (from OCR) using a specific gs command line to get parsable text. On a side note: If gs were built with tesseract it could do this directly (though ocrmypdf is reported to do it better). mupdf might be an alternative, it is built against tesseract in Fedora (and comes from Artifex just like gs). as i understand from ocrmypdf, it has to do with the new pdf interpreter, so they added the switch -dNEWPDF=false (we are using ocrmypdf). (In reply to daniel.debaerdemaeker from comment #2) > as i understand from ocrmypdf, it has to do with the new pdf interpreter, so > they added the switch -dNEWPDF=false > (we are using ocrmypdf). ... but they test for 9.56.0 only, not for 9.56.0 and higher ... My recommendation for a quick fix is still the same, as is the request for a reproducer. Upstream is keen to get the new interpreter bug-free, as it is much faster and the old one will be deprecated soon. Maybe ocrmypdf has a pertaining github issue, or they have filed a gs bz already? So, I've tried out ocrmypdf-13.7.0-1.fc36 with ghostscript-9.56.1-2.fc36.x86_64 on signed (and unsigned) PDFs, with --force-ocr and with --skip-text, and everything works as expected (evince, mupdf, zathura, chrome, firefox-104.0.1-1 as in current F36). Note that this is with the new PDF interpreter! Upstream uses "-dNEWPDF=false" (to force the old interpreter) only with gs 9.56.0, the commit says "Speculation: Ghostscript 9.56 new PDF interpreter breaks things" (https://github.com/ocrmypdf/OCRmyPDF/commit/84b9d4d021113560948274f35712668381d00ea2), and in a side remark in some issue claims that gs 9.56 removes ocr'ed text layers, which does not appear to be the case, at least not generally. So, I'll set needinfo, because we will have to close this without a clearer report, ideally consosting of: - the used PDF - the used ocrmypdf command line - output of ocrmypdf run with "-v 2" - a description of "missing a lot of content" - the firefox version which is used Thanks! Created attachment 1910259 [details]
original signed pdf
Created attachment 1910261 [details]
after ocr with ocr mypdf
ocrmypdf -l nld+fra --optimize 3 --rotate-pages --rotate-pages-threshold 2.0 --pdf-renderer auto --skip-text betaling\ fietsvergoeding-oct2021.pdf betaling\ fietsvergoeding-ocr.pdf
version
ocrmypdf --version
12.7.2
ghostscript-9.56.1-1.fc35.x86_64
if i do it
whith ghostscript-9.55.0-2.fc35.x86_64 it works fine
firefox version : 91.13.0esr
Thanks, this helps quite a bit! I can extract text from ...-ocr.pdf with pdftotext, zathura-pdf-mupdf, google-chrome as expected, including accents and trema. I can also do this with Firefox (firefox-104.0.1-1.fc36) without any problems. So, either I misunderstand where the encoding problem shows (I suspected accents etc.), or it depends on the Firefox version. To nail this down further, I would ask you to: - try with Firefox 104 (on a test profile/test account, it can break things for your ESR FF!) - provide the ...-ocr.pdf from gs 9.55.0 (I could produce one, but better go by your original) Right now I suspect that gs 9.56 is doing something differently and pdf.js in FF 91 is doing something wrong. Just in case the need should arise: Are you OK with me forwarding your PDFs to upstream (Artifex is the producer of ghostscript and the ecosystem)? It contains personal data (which might be available publicly already, but none the less). you may forward it to artifex (it has only my name and electronic signature) i tested with firefox 104 64 bit on our windows environment : not ok on firefox 100 64 bit (fedora 35 3 months ago) : ok We don't have the original (produced with gs 9.55.0 in the OP's setup), but I ran ocrmypdf on that input using different gs versions. Indeed, gs 9.55.0 handles the Myriad Pro font differently - apparantly, it converts it to unicode, and gs 9.56.1 leaves it in its original encoding. I don't know if this is "wrong" or just a different choice - after all, the resulting pdf works on F35 up with Firefox as shipped. (I'm still not sure what "not ok/does work" means, though.) This message is a reminder that Fedora Linux 35 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '35'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 35 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed. Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13. Fedora Linux 35 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed. |