Description of problem: Most of the title links displayed from DocSearch contain a "Â". Here's an example of what the titles display on a regular doc search Chapter 3. Building Custom Packages 3.2.2. RHN SSL Maintenance Tool Options Chapter 6. Manually Scripting the Configuration Channel Management Guide Index 3.2.2. Signing packages It looks like the help docs contain an unprinted character and lucene is translating this to  Here is a snippet of the HTML as per Firefox view source <p id="title"><a href="/rhn/help/release-notes/satellite/index.jsp"><strong>Chapter 3. Building Custom Packages</strong></a></p> Here is a snippet of the HTML from wget with a "cat -A": <p id="title"><a href="/rhn/help/release-notes/satellite/index.jsp"><strong>ChapterM-BM- 3.M-BM- Building Custom Packages</strong></a></p> Notice the "M-BM-" correspond to where we see "Â" from what lucene stores example of the data from lucene "Chapter 3. Building Custom Packages" Version-Release number of selected component (if applicable): Satellite-5.3.0-RHEL5-re20090619.0-i386-embedded-oracle.iso How reproducible: Always Steps to Reproduce: 1. In English Locale, do a doc search for "channel" Actual results: Chapter 3. Building Custom Packages Expected results: Chapter 3. Building Custom Packages Additional info:
We figured out the character in question is part of   in UTF-8 encoding. In hex it is c2a0
Setting nutch to use utf8 for the default encoding <property> <name>parser.character.encoding.default</name> <value>utf8</value> <description>The character encoding to fall back to when no other information is available</description> </property> commit 3ab14eea49d0e1c64e84b1cfe2b5e98f98c8bddd Refs: rhn-i18n-guides-5.3.0.8-1-1-g3ab14ee Author: John Matthews <jmatthew> AuthorDate: Mon Jul 6 14:32:06 2009 -0400 Commit: John Matthews <jmatthew> CommitDate: Mon Jul 6 14:32:06 2009 -0400 507687 - DocSearch force nutch to use "UTF-8" encoding. --- doc-indexes/NUTCH_CONF_TEMPLATE/nutch-site.xml | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/doc-indexes/NUTCH_CONF_TEMPLATE/nutch-site.xml b/doc-indexes/NUTCH_CONF_TEMPLATE/nutch-site.xml index 70975c2..dcd22a6 100644 --- a/doc-indexes/NUTCH_CONF_TEMPLATE/nutch-site.xml +++ b/doc-indexes/NUTCH_CONF_TEMPLATE/nutch-site.xml @@ -55,4 +55,11 @@ <name>file.content.limit</name> <value>-1</value> </property> +<property> + <name>parser.character.encoding.default</name> + <value>utf8</value> + <description>The character encoding to fall back to when no other information + is available</description> +</property> +
These are the package versions that have the fix Package: satellite-doc-indexes-5.3.17-1.el5sat Package: satellite-doc-indexes-5.3.17-1.el4sat
verified. Not seeing  in doc search links now.
Verified in stage -> RELEASE_PENDING
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1434.html