Description of problem: When searching google (or for that matter other search engines) RedHat bugzilla bugs don't show up. This is a great shame since they are a good source of knowlege on many topics and workarounds (or even solutions) included in the bugs would often help many people. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. http://www.google.com/search?q=site%3Abugzilla.redhat.com+fedora&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a 2. look no bugs Actual results: no visible bug info Expected results: search engines should show the most interesting bugs related to Fedora Additional info: There are some privacy implications and public image implications from resolving this bug. Whilst it is true that the search function will make it easier to identify that there are bugs in RedHat's products, the benefits would outweigh this. A) RedHat becomes a source of solution information for more people and so gets extra publicity. B) It becomes more obvious to all that RedHat is honest about and communicates about it's product flaws. Privacy implications for bug posters definitely exist are somewhat limited; the bugs are already public, but inclusion in search engines such as Google leads to data aggregation. However the remaining issues can be largely mitigated. A) the indexed version of the bugs should not include email addresses. B) indexing of future bugs could require opt in both from a RedHat staff member and from the original reporter C) hostname and other data could be automatically (but imperfectly) removed from the indexed version. As a data point, some other Linux distributions such as Ubuntu already index their bugs.
Created attachment 306980 [details] Script to generate sitemap files nightly (v2) Incorporated kbaker's suggestions. Added constant MAX_SITEMAP_LINKS, upped it to 50k which should be fine, and also fixed the use lib line. Please review Dave
put up background and options at https://docspace.corp.redhat.com/blogs/knowledge/2008/06/04/indexing-red-hat-bugzilla-part-1-background-and-options here it is Indexing Red Hat Bugzilla - Part 1: Background and options Posted by cdriesch Jun 4, 2008 Interesting discussion on how to make the content of bugzilla.redhat.com available to public google as well as our new search at redhat.com/search. This post tries to summarize discussions between Tushar Gandotra from my team, Dan Fisher (GSS reporting), my manager Lee Mewshaw, Red Hat Bugzilla maintainer Dave Lawrence and this ticket in the Red Hat Bugzilla itself, and point to caveats and potential solutions. Why it's not trivial Currently, Red Hat Bugzilla disallows crawling, for mainly two reasons: * Performance. Blindly allowing the crawling of Bugzilla could create quite some load, so it's currently disabled in robots.txt * Security. Some of the data is private, some public. Additionally, there is some kind of "security through obscurity" as it's the interface is mainly a search; there is no list view (that I'm aware of) that would give you easy access to all of the entries in sequence (which makes it particularly hard to crawl, even if it were allowed through robots.txt) Note that the particularities of the UI make it hard to crawl the site, even if it were allowed. Two target indexes: google.com and redhat.com/search (let google.com stand for any public search engine). I don't see a reason for any content to be accessible through the public search at redhat.com but not through google, or vice versa: If we decide a bug to be visible and searchable by the public, then why hide it in one of the systems. Unfortunately, google.com and the google search engine need different treatment if they can't crawl the content. For public search engines, the sitemap protocol is the agreed standard for the major players. Our own search engine would either be pointed directly at a database, or fed via uploading xml files with the content. Options for creating the search index for our own search: Crawling and Feeding If you own the search engine, there are two general options to get data into the search index: You either let the search engine's spider crawl the content, or you actively feed it through an xml-based upload, or through access to a database holding the data Advantages of crawling * It doesn't need any extra infrastructure (like a script to generate and feed an xml with new content) * Security: Since you only put into the index what's publicly available, you can't incidently publish private data Advantages of feeding * You have much more finegrained control over what you enter into the index at what time/frequency * Better options for metadata indexing * Crawling rules are hard to impossible to define for complex UIs that don't offer a simple standard list view into the data, if you want to reach complete coverage, but at the same time don't want to spam the index with duplicates. Internal search conclusion Sitemap files for public crawling Dave Lawrence and Kevin Baker have already created a sample sitemap.xml file for telling google what to crawl, and what not to crawl. Advantages of a sitemap file * Enabling visability where crawling fails: In cases like the current bugzilla.redhat.com where crawling is disabled or impractical, a sitemap is the only way of telling the search engine which pages to index. * Fine grained control over what will be found by the public search engines * Abilitly to influence the ranking of the page (I don't know though how much that really influences the page's ultimate position in the search results) Disadvantages of a sitemap file * if the features request on this site isn't out of date, we cannot re-use the generated sitemap for our internal search - it will only be useful for searches from google, yahoo etc. We'll open a support case with google to find out details or workarounds. * The sitemap standard doesn't have a (widely accepted) way to define crawl delays or other means of giving the crawled system a break in terms of the load it generates * even though the major players agreed on the sitemap standard, it's not all search engines that adhere to it. Sitemap file conclusions * a sitemap will be the best solution to get the bugs crawled by public, outside search engines, if we can't allow crawling * a sitemap should start very conservative (maybe with just a smal subset of bugs to show) for evaluating the performance impact * the sitemap will unfortunately not resolve the problem of access by our own search at redhat.com/search. Possible third option: Cached lists? If bugzilla had as part of the UI a serial, simple, and complete listing of bugs that * is very well cached * has a correct "last-updated" timestamp (meaning, search engines and browsers see if they need to recrawl at all) * makes sure that only non private content is shown ...then we might get around much of the work of duplicate solutions (sitemap and feed) and simply point the internal and external searches at this list through the robots.txt file, and still survive the load. Maybe some of the effort for creating the sitemap.xml file can be used to create this static list view into the public bugs. Tags: bugzilla, collections, crawling, gsa
(FYI, added as a feature request IT#227198)
Dave, Do you still need me to review this patch?
*** Bug 250595 has been marked as a duplicate of this bug. ***
bugzilla.redhat.com is now generating nightly sitemap files that google is now indexing. Closing this bug.