Bug 969925

Summary: poppler source pdf encoding
Product: [Fedora] Fedora Reporter: Aleksandar Kostadinov <akostadi>
Component: popplerAssignee: Marek Kašík <mkasik>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: mkasik, rdieter
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-05 21:38:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aleksandar Kostadinov 2013-06-03 04:39:28 UTC
I'm not sure if this is a bug or a feature request. I have a PDF that I'd like to provide as a text version for a screen reading software to read.

The PDF opens and looks on display just fine but trying to copy any text results in strange characters.

> $ pdffonts mypdf.pdf
> name                                 type              encoding         emb sub uni object ID
> ------------------------------------ ----------------- ---------------- --- --- --- ---------
> LPHKPA+MSTT31c5f0                    Type 1C           Custom           yes yes no       4  0
> LPHKPA+MSTT31c660                    Type 1C           Custom           yes yes no       5  0
> LPHKPA+MSTT31c66b                    Type 1C           Custom           yes yes no       6  0
> LPHKPA+MSTT31c676                    Type 1C           Custom           yes yes no      13  0
> LPHKPA+MSTT31c68c                    Type 1C           Custom           yes yes no      14  0
> MSTT31c697                           Type 1C           Custom           yes no  no      15  0
> LPHKPA+MSTT31c6a2                    Type 1C           Custom           yes yes no      16  0
> LPHKPA+MSTT31c6ac                    Type 1C           Custom           yes yes no      20  0
> Symbol                               Type 1            Symbol           no  no  no      21  0
> LPHONI+MSTT31c6b7                    Type 1C           Custom           yes yes no      25  0
> LPHONI+MSTT31c6c2                    Type 1C           Custom           yes yes no      26  0
> LPHONI+MSTT31c6d8                    Type 1C           Custom           yes yes no      30  0
> LPHONI+MSTT31c6e4                    Type 1C           Custom           yes yes no      34  0
> LPICMA+MSTT31c5fc                    Type 1C           Custom           yes yes no      70  0
> LPIGKI+MSTT31c6ef                    Type 1C           Custom           yes yes no      74  0
> LPIGKI+MSTT31c6fb                    Type 1C           Custom           yes yes no      75  0
> Times-Bold                           Type 1            Custom           no  no  no      82  0
> LPIGKI+MSTT31c721                    Type 1C           Custom           yes yes no      86  0
> LPIGKI+MSTT31c745                    Type 1C           Custom           yes yes no      87  0
> LPIKJA+MSTT31c750                    Type 1C           Custom           yes yes no     119  0
> LPIKJA+MSTT31c739                    Type 1C           Custom           yes yes no     124  0
> LPIKJA+MSTT31c75c                    Type 1C           Custom           yes yes no     125  0

It seems encoding information is missing from the file. I found out that the actual encoding is the Cyrillic cp1251. But I see no way to extract the text with the source cp1251 encoding. "pdftotext -listenc" does not list that encoding at all and it is an output encoding. I actually need to specify input encoding.

Alternatively I'd like to extract the text without modifying encoding in any way, e.g. do not apply any transcoding of characters. This way I can open it in the original encoding of the text and convert with iconv to whatever I want. I can't find out how to specify source encoding or disable any transformations.

The best I could achieve was to use "Latin1" as output encoding, then open the file as cp1251 but I still see a couple of characters like quotes and bullet points are missing from output. Or if I output in UTF8 and then convert from utf8 to cp1252 with iconv, I'm getting these characters as question marks.

So it seems to me the  most basic thing that can help is output the text without modifying encoding in any way. Can such feature be added?

Comment 1 Fedora End Of Life 2013-12-21 13:50:05 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 2 Fedora End Of Life 2014-02-05 21:38:34 UTC
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.