Bug 2124585

Summary: Bad encoding since 9.56
Product: [Fedora] Fedora Reporter: daniel.debaerdemaeker <daniel.debaerdemaeker>
Component: ghostscriptAssignee: Richard Lescak <rlescak>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 35CC: akhaitovich, mjg, mosvald, rlescak, zdohnal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-13 18:08:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
original signed pdf
none
after ocr with ocr mypdf none

Description daniel.debaerdemaeker 2022-09-06 14:38:02 UTC
Description of problem:
We are using ocrmypdf for ocr-ing documents
We did a dnf update, and so the ghostscript is updated from version 9.55 to version 9.56.
Pdf's signed, and afterwards ocred (they are losing the signature) are missing a lot of content using pdf.js (used in firefox) but are displaying well in chromium 

Version-Release number of selected component (if applicable):
9.56


How reproducible:
Allways

Steps to Reproduce:
1.sign a pdf
2.ocr the pdf by tesseract 4.1.3-1
3.create pdf-a with ghostscript

Actual results:
information lost when displaying in firefox
information correct in chromium

Expected results:
information correct in both environments

Additional info:

Comment 1 Michael J Gruber 2022-09-06 15:00:07 UTC
Could you try your gs invocations with '-dNEWPDF=false'? This reverts back to the legacy interpreter.

Also, can you provide a reproducer? As I understand, signature and ocr don't really play a role, but you are combining image and text layers (from OCR) using a specific gs command line to get parsable text.

On a side note: If gs were built with tesseract it could do this directly (though ocrmypdf is reported to do it better). mupdf might be an alternative, it is built against tesseract in Fedora (and comes from Artifex just like gs).

Comment 2 daniel.debaerdemaeker 2022-09-06 15:26:01 UTC
as i understand from ocrmypdf, it has to do with the new pdf interpreter, so they added the switch -dNEWPDF=false
(we are using ocrmypdf).

Comment 3 Michael J Gruber 2022-09-06 15:58:05 UTC
(In reply to daniel.debaerdemaeker from comment #2)
> as i understand from ocrmypdf, it has to do with the new pdf interpreter, so
> they added the switch -dNEWPDF=false
> (we are using ocrmypdf).

... but they test for 9.56.0 only, not for 9.56.0 and higher ...

My recommendation for a quick fix is still the same, as is the request for a reproducer. Upstream is keen to get the new interpreter bug-free, as it is much faster and the old one will be deprecated soon. Maybe ocrmypdf has a pertaining github issue, or they have filed a gs bz already?

Comment 4 Michael J Gruber 2022-09-07 09:08:58 UTC
So, I've tried out ocrmypdf-13.7.0-1.fc36 with ghostscript-9.56.1-2.fc36.x86_64 on signed (and unsigned) PDFs, with --force-ocr and with --skip-text, and everything works as expected (evince, mupdf, zathura, chrome, firefox-104.0.1-1 as in current F36). Note that this is with the new PDF interpreter!

Upstream uses "-dNEWPDF=false" (to force the old interpreter) only with gs 9.56.0, the commit says "Speculation: Ghostscript 9.56 new PDF interpreter breaks things" (https://github.com/ocrmypdf/OCRmyPDF/commit/84b9d4d021113560948274f35712668381d00ea2), and in a side remark in some issue claims that gs 9.56 removes ocr'ed text layers, which does not appear to be the case, at least not generally.

So, I'll set needinfo, because we will have to close this without a clearer report, ideally consosting of:

- the used PDF
- the used ocrmypdf command line
- output of ocrmypdf run with "-v 2"
- a description of "missing a lot of content"
- the firefox version which is used

Thanks!

Comment 5 daniel.debaerdemaeker 2022-09-07 15:59:34 UTC
Created attachment 1910259 [details]
original signed pdf

Comment 6 daniel.debaerdemaeker 2022-09-07 16:08:26 UTC
Created attachment 1910261 [details]
after ocr with ocr mypdf

 ocrmypdf -l nld+fra --optimize 3 --rotate-pages --rotate-pages-threshold 2.0 --pdf-renderer auto --skip-text betaling\ fietsvergoeding-oct2021.pdf betaling\ fietsvergoeding-ocr.pdf

version
 ocrmypdf --version
12.7.2

ghostscript-9.56.1-1.fc35.x86_64


if i do it 
whith ghostscript-9.55.0-2.fc35.x86_64 it works fine

firefox version : 91.13.0esr

Comment 7 Michael J Gruber 2022-09-08 08:26:14 UTC
Thanks, this helps quite a bit!

I can extract text from ...-ocr.pdf with pdftotext, zathura-pdf-mupdf, google-chrome as expected, including accents and trema.

I can also do this with Firefox (firefox-104.0.1-1.fc36) without any problems. So, either I misunderstand where the encoding problem shows (I suspected accents etc.), or it depends on the Firefox version. To nail this down further, I would ask you to:

- try with Firefox 104 (on a test profile/test account, it can break things for your ESR FF!)
- provide the ...-ocr.pdf from gs 9.55.0 (I could produce one, but better go by your original)

Right now I suspect that gs 9.56 is doing something differently and pdf.js in FF 91 is doing something wrong.

Just in case the need should arise: Are you OK with me forwarding your PDFs to upstream (Artifex is the producer of ghostscript and the ecosystem)? It contains personal data (which might be available publicly already, but none the less).

Comment 8 daniel.debaerdemaeker 2022-09-08 09:51:21 UTC
you may forward it to artifex (it has only my name and electronic signature)
i tested with firefox 104 64 bit on our windows environment : not ok
on firefox 100 64 bit (fedora 35 3 months ago) : ok

Comment 9 Michael J Gruber 2022-09-11 11:10:50 UTC
We don't have the original (produced with gs 9.55.0 in the OP's setup), but I ran ocrmypdf on that input using different gs versions. Indeed, gs 9.55.0 handles the Myriad Pro font differently - apparantly, it converts it to unicode, and gs 9.56.1 leaves it in its original encoding.

I don't know if this is "wrong" or just a different choice - after all, the resulting pdf works on F35 up with Firefox as shipped. (I'm still not sure what "not ok/does work" means, though.)

Comment 10 Ben Cotton 2022-11-29 19:00:44 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 11 Ben Cotton 2022-12-13 18:08:34 UTC
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.