I'm not sure if this is a bug or a feature request. I have a PDF that I'd like to provide as a text version for a screen reading software to read.
The PDF opens and looks on display just fine but trying to copy any text results in strange characters.
> $ pdffonts mypdf.pdf
> name type encoding emb sub uni object ID
> ------------------------------------ ----------------- ---------------- --- --- --- ---------
> LPHKPA+MSTT31c5f0 Type 1C Custom yes yes no 4 0
> LPHKPA+MSTT31c660 Type 1C Custom yes yes no 5 0
> LPHKPA+MSTT31c66b Type 1C Custom yes yes no 6 0
> LPHKPA+MSTT31c676 Type 1C Custom yes yes no 13 0
> LPHKPA+MSTT31c68c Type 1C Custom yes yes no 14 0
> MSTT31c697 Type 1C Custom yes no no 15 0
> LPHKPA+MSTT31c6a2 Type 1C Custom yes yes no 16 0
> LPHKPA+MSTT31c6ac Type 1C Custom yes yes no 20 0
> Symbol Type 1 Symbol no no no 21 0
> LPHONI+MSTT31c6b7 Type 1C Custom yes yes no 25 0
> LPHONI+MSTT31c6c2 Type 1C Custom yes yes no 26 0
> LPHONI+MSTT31c6d8 Type 1C Custom yes yes no 30 0
> LPHONI+MSTT31c6e4 Type 1C Custom yes yes no 34 0
> LPICMA+MSTT31c5fc Type 1C Custom yes yes no 70 0
> LPIGKI+MSTT31c6ef Type 1C Custom yes yes no 74 0
> LPIGKI+MSTT31c6fb Type 1C Custom yes yes no 75 0
> Times-Bold Type 1 Custom no no no 82 0
> LPIGKI+MSTT31c721 Type 1C Custom yes yes no 86 0
> LPIGKI+MSTT31c745 Type 1C Custom yes yes no 87 0
> LPIKJA+MSTT31c750 Type 1C Custom yes yes no 119 0
> LPIKJA+MSTT31c739 Type 1C Custom yes yes no 124 0
> LPIKJA+MSTT31c75c Type 1C Custom yes yes no 125 0
It seems encoding information is missing from the file. I found out that the actual encoding is the Cyrillic cp1251. But I see no way to extract the text with the source cp1251 encoding. "pdftotext -listenc" does not list that encoding at all and it is an output encoding. I actually need to specify input encoding.
Alternatively I'd like to extract the text without modifying encoding in any way, e.g. do not apply any transcoding of characters. This way I can open it in the original encoding of the text and convert with iconv to whatever I want. I can't find out how to specify source encoding or disable any transformations.
The best I could achieve was to use "Latin1" as output encoding, then open the file as cp1251 but I still see a couple of characters like quotes and bullet points are missing from output. Or if I output in UTF8 and then convert from utf8 to cp1252 with iconv, I'm getting these characters as question marks.
So it seems to me the most basic thing that can help is output the text without modifying encoding in any way. Can such feature be added?
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '18'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 18's end of life.
Thank you for reporting this issue and we are sorry that we may not be
able to fix it before Fedora 18 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior to Fedora 18's end of life.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.