Bug 153724 - pdflatex generates pdf file that xpdf and Adobe Acroread cannot search for underscores in
Summary: pdflatex generates pdf file that xpdf and Adobe Acroread cannot search for un...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: tetex
Version: 5
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jindrich Novy
QA Contact: David Lawrence
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-04-05 11:24 UTC by James Hunt
Modified: 2013-07-02 23:07 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2006-09-25 13:14:54 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description James Hunt 2005-04-05 11:24:07 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Firefox/1.0.2 Fedora/1.0.2-1.3.1

Description of problem:
If you search in a PDF file created with 'pdflatex' from a LaTeX source file for a string containing an underscore character ('_'), the search will fail. The PDF view can be 'xpdf' or Adobe Acroread - the result is the same.

Version-Release number of selected component (if applicable):
tetex-latex-2.0.2-21.3

How reproducible:
Always

Steps to Reproduce:
1. Create a LaTeX source file containing an underscore.
2. Run, "pdflatex file.tex" (up to 3 times as necessary).
3. Run, "xpdf file.pdf".
4. Search for a string containing an underscore character by pressing 'f' key
   and entering the search string.
5. Press return.


Actual Results:  Nothing - the string was not found.

Expected Results:  xpdf should have found the string and highlighted it.

Additional info:

To recreate the problem, put the 4 lines below in a file called "file.tex", and follow the steps above:

\documentclass[12pt]{article}
\begin{document}
hello\_world.
\end{document}
______________

If you use this document, you can search for "hello" and this will be found. You can search for "world" and this will be found. However, if you search for "hello_world", this will *NOT* be found.

I initially suspected a problem with xpdf, however, I now believe the problem is with the pdflatex command since I downloaded a PDF from http://www.w3c.org and searched for underscores and these _are_ found by xpdf. This is what I did:

1. curl -O http://www.w3.org/TR/html401/html40.pdf.gz 
2. gunzip html40.pdf.gz
3. xpdf html40.pdf
4. Search for string "section_2" by typing 'f' and then typing "section_2"
   followed by return.
5. The string will be found on page 20 in section, "2.1.2 Fragment identifiers".

I then repeated the steps above using Adobe Acroread ("rpm -q acroread" shows, "acroread-5.07-2"). Again, the string was found.


Note: "pdfinfo file.pdf" returns:

Creator:        TeX
Producer:       pdfTeX-1.10b
CreationDate:   Tue Apr  5 11:47:00 2005
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
File size:      6919 bytes
Optimized:      no
PDF version:    1.4

...whilst "pdfinfo html40.pdf" shows:

Title:          HTML 4.01 Specification
Subject:        
Keywords:       
Author:         
Creator:        html2ps version 1.0 beta2 patched by Arnaud Le Hors 19990806
Producer:       GNU Ghostscript 5.10
CreationDate:   Fri Dec 24 18:35:43 1999
Tagged:         no
Pages:          389
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      3009579 bytes
Optimized:      no
PDF version:    1.2

Is the PDF version relevant I wonder? Is pdflatex not generating correct PDF version 1.4 output???

This bug is a major irritant as I've got some very large PDF documents that have a lot of underscores in them and it's a real pain having to scan them by hand to find the sections I want.

Comment 1 Matthew Miller 2006-07-10 21:58:20 UTC
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!


Comment 2 James Hunt 2006-07-11 19:47:26 UTC
Yep, it's still a problem. Here are the current versions of my PDF viewers:

xpdf-3.01-12.1
gpdf-2.8.2-4.2
kdegraphics-3.5.3-0.2.fc5 (kpdf)

All 3 pdf readers suffer from the same problem:

- Search for "hello"         - finds it
- Search for "world"         - finds it
- Search for "hello_world"   - doesn't find it
- Search for "_"             - doesn't find it.





Comment 3 Jindrich Novy 2006-09-25 13:14:54 UTC
The problem is that teTeX renders underscore like graphics and not a letter so
that one couldn't search for underscore directly. Note that even pdftotext
outputs a space character instead of underscore so that it's not visible to
other pdf viewing utilities as well.

Comment 4 Pykler 2009-04-09 23:43:03 UTC
I have an update on this issue, Quoting Karl Berry from the Mac-Tex user group:

>   the TeX engine generates a weird graphic rather than using
>   the underscore character

You are correct about that (except I wouldn't call it "weird").  The
standard definition of \_ is
\def\_{\leavevmode \kern.06em \vbox{\hrule width.3em}}

>   (maybe for a good reason).

Yes, the reason is that it would have been crazy for Knuth to waste a
precious slot in the original 1980s fonts (limited to 128 chars) on a
character that could perfectly well be created by a rule.

The answer is, don't use \_.  Instead, put your address in \tt and use
the actual _ character.  In plain TeX:

$<${\tt first\char`\_last}$>$

Then the _ will be pastable (and the output will look better, too).

I'm not sure if you're using LaTeX.  If you are, and you load url or
hyperref, you'll have a command \url that will let you type it without
the extra \char sequence:
 \url{first_last}

(And you'll get better line breaking behavior, too.)  Of course a
personal definition could be made to do the same thing with plain.


Similar things could be done with other fonts that provide an _
character if you want something other than typewriter, but I don't have
recipes at hand.  Almost everything besides Knuth's original cm* fonts
does have an _ character.


If you feel like reposting this to any of the bug systems, feel free.

Hope this helps.

------

I have a recipe as mentioned above to change to use something other than the default CM font. This recipe was contributed by Herb Schulz, also from MacTex support:

Try using the Latin Modern font with T1 encoding; that font is an updated design of CM with more characters and built-in, rather than constructed, accented characters. To use the Latin Modern font with T1 encoding add the lines

\usepackage{lmodern}
\usepackage[T1]{fontenc}
\usepackage{textcomp}

to your preamble.

------


Note You need to log in before you can comment on or make changes to this bug.