427004 – No RedHat Bugzilla bugs are indexed in Google (or other search engines).

Bug 427004 - No RedHat Bugzilla bugs are indexed in Google (or other search engines).

Summary: No RedHat Bugzilla bugs are indexed in Google (or other search engines).

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Bugzilla
Classification:	Community
Component:	Query/Bug List
Sub Component:
Version:	devel
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	David Lawrence
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	250595 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-12-29 22:05 UTC by Michael De La Rue
Modified:	2018-10-20 02:57 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-12-31 01:27:26 UTC
Embargoed:

Attachments	(Terms of Use)
Script to generate sitemap files nightly (v3) (6.99 KB, patch) 2008-12-16 22:23 UTC, David Lawrence	nelhawar: review+	Details \| Diff
View All

Description Michael De La Rue 2007-12-29 22:05:11 UTC

Description of problem:
When searching google (or for that matter other search engines) RedHat bugzilla
bugs don't show up.  This is a great shame since they are a good source of
knowlege on many topics and workarounds (or even solutions) included in the bugs
would often help many people. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
http://www.google.com/search?q=site%3Abugzilla.redhat.com+fedora&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a
2. look no bugs
  
Actual results:
no visible bug info

Expected results:
search engines should show the most interesting bugs related to Fedora

Additional info:
There are some privacy implications and public image implications from resolving
this bug.  Whilst it is true that the search function will make it easier to
identify that there are bugs in RedHat's products, the benefits would outweigh
this.  A) RedHat becomes a source of solution information for more people and so
gets extra publicity.  B) It becomes more obvious to all that RedHat is honest
about and communicates about it's product flaws.  

Privacy implications for bug posters definitely exist are somewhat limited; the
bugs are already public, but inclusion in search engines such as Google leads to
data aggregation.  However the remaining issues can be largely mitigated.  

A) the indexed version of the bugs should not include email addresses.  
B) indexing of future bugs could require opt in both from a RedHat staff member
and from the original reporter
C) hostname and other data could be automatically (but imperfectly) removed from
the indexed version.  

As a data point, some other Linux distributions such as Ubuntu already index
their bugs.

Comment 4 David Lawrence 2008-05-28 19:53:03 UTC

Created attachment 306980 [details]
Script to generate sitemap files nightly (v2)

Incorporated kbaker's suggestions. Added constant MAX_SITEMAP_LINKS, upped it
to 50k which should be fine, and also fixed the use lib line.

Please review
Dave

Comment 6 Christoph Drieschner 2008-06-05 10:08:37 UTC

put up background and options at
https://docspace.corp.redhat.com/blogs/knowledge/2008/06/04/indexing-red-hat-bugzilla-part-1-background-and-options
here it is

Indexing Red Hat Bugzilla - Part 1: Background and options
Posted by cdriesch Jun 4, 2008

Interesting discussion on how to make the content of bugzilla.redhat.com
available to public google as well as our new search at redhat.com/search. This
post tries to summarize discussions between Tushar Gandotra from my team, Dan
Fisher (GSS reporting), my manager Lee Mewshaw, Red Hat Bugzilla maintainer Dave
Lawrence and this ticket in the Red Hat Bugzilla itself, and point to caveats
and potential solutions.
Why it's not trivial
Currently, Red Hat Bugzilla disallows crawling, for mainly two reasons:

    * Performance. Blindly allowing the crawling of Bugzilla could create quite
some load, so it's currently disabled in robots.txt
    * Security. Some of the data is private, some public. Additionally, there is
some kind of "security through obscurity" as it's the interface is mainly a
search; there is no list view (that I'm aware of) that would give you easy
access to all of the entries in sequence (which makes it particularly hard to
crawl, even if it were allowed through robots.txt)

Note that the particularities of the UI make it hard to crawl the site, even if
it were allowed.
Two target indexes: google.com and redhat.com/search

(let google.com stand for any public search engine). I don't see a reason for
any content to be accessible through the public search at redhat.com but not
through google, or vice versa: If we decide a bug to be visible and searchable
by the public, then why hide it in one of the systems.

Unfortunately, google.com and the google search engine need different treatment
if they can't crawl the content. For public search engines, the sitemap protocol
is the agreed standard for the major players. Our own search engine would either
be pointed directly at a database, or fed via uploading xml files with the content.

Options for creating the search index for our own search: Crawling and Feeding
If you own the search engine, there are two general options to get data into the
search index: You either let the search engine's spider crawl the content, or
you actively feed it through an xml-based upload, or through access to a
database holding the data
Advantages of crawling

    * It doesn't need any extra infrastructure (like a script to generate and
feed an xml with new content)
    * Security: Since you only put into the index what's publicly available, you
can't incidently publish private data

Advantages of feeding

    * You have much more finegrained control over what you enter into the index
at what time/frequency
    * Better options for metadata indexing
    * Crawling rules are hard to impossible to define for complex UIs that don't
offer a simple standard list view into the data, if you want to reach complete
coverage, but at the same time don't want to spam the index with duplicates.

Internal search conclusion
Sitemap files for public crawling

Dave Lawrence and Kevin Baker have already created a sample sitemap.xml file for
telling google what to crawl, and what not to crawl.

Advantages of a sitemap file

    * Enabling visability where crawling fails: In cases like the current
bugzilla.redhat.com where crawling is disabled or impractical, a sitemap is the
only way of telling the search engine which pages to index.
    * Fine grained control over what will be found by the public search engines
    * Abilitly to influence the ranking of the page (I don't know though how
much that really influences the page's ultimate position in the search results)

Disadvantages of a sitemap file

    * if the features request on this site isn't out of date, we cannot re-use
the generated sitemap for our internal search - it will only be useful for
searches from google, yahoo etc. We'll open a support case with google to find
out details or workarounds.
    * The sitemap standard doesn't have a (widely accepted) way to define crawl
delays or other means of giving the crawled system a break in terms of the load
it generates
    * even though the major players agreed on the sitemap standard, it's not all
search engines that adhere to it.

Sitemap file conclusions

    * a sitemap will be the best solution to get the bugs crawled by public,
outside search engines, if we can't allow crawling
    * a sitemap should start very conservative (maybe with just a smal subset of
bugs to show) for evaluating the performance impact
    * the sitemap will unfortunately not resolve the problem of access by our
own search at redhat.com/search.

Possible third option: Cached lists?
If bugzilla had as part of the UI a serial, simple, and complete listing of bugs
that

    * is very well cached
    * has a correct "last-updated" timestamp (meaning, search engines and
browsers see if they need to recrawl at all)
    * makes sure that only non private content is shown

...then we might get around much of the work of duplicate solutions (sitemap and
feed) and simply point the internal and external searches at this list through
the robots.txt file, and still survive the load. Maybe some of the effort for
creating the sitemap.xml file can be used to create this static list view into
the public bugs.
Tags: bugzilla, collections, crawling, gsa

Comment 7 Jan Iven 2008-10-07 09:10:57 UTC

(FYI, added as a feature request IT#227198)

Comment 8 Noura El hawary 2008-11-27 06:50:28 UTC

Dave, Do you still need me to review this patch?

Comment 9 David Lawrence 2008-12-01 04:08:11 UTC

*** Bug 250595 has been marked as a duplicate of this bug. ***

Comment 13 David Lawrence 2008-12-31 01:27:26 UTC

bugzilla.redhat.com is now generating nightly sitemap files that google is now indexing. Closing this bug.

Note You need to log in before you can comment on or make changes to this bug.