Bug 872746 - RHN: Cleaning technical debt by enabling search (GSA) to crawl packages and errata
Summary: RHN: Cleaning technical debt by enabling search (GSA) to crawl packages and...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Network
Classification: Retired
Component: RHN/Maintenance
Version: MR48 (AMS)
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
Assignee: Jared Blashka
QA Contact: Nicole Yancey
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-02 21:35 UTC by Nicky
Modified: 2013-08-06 00:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-25 13:43:09 UTC
Embargoed:


Attachments (Terms of Use)

Description Nicky 2012-11-02 21:35:57 UTC
Description of problem:
There is a plan for the Google Search Appliance (GSA) to expand the search capabilities in the Customer Portal and crawl all browsing experience for the customer.  This involves making changes with Package Search and Errata Search in RHN.  

Proposed Resolution (To do above the following needs to happen):

(1) AMS would need to edit / add metatags to the Packages and Errata.  This can be done by editing the content itself or via the feed.  Method can be determined by AMS.  

(2) Updating the feed by enabling any changes to the packages or erratas to be seen by the GSA in a more timely fashion, thereby allowing the GSA to reindex the changes made and ultimately increasing the speed of search capability.  

(3) (this one is not as time sensitive, nor a blocker in accomplishing the goals of this BZ) - Deleting the code and eliminating the technical debt in RHN.  Specifically, there is Public Errata (PERL) and Private Errata (JAVA).  The Public is a list of packages, but w/o links; while the Private is a list of packages, but w/ links.  The suggestion from the RHN Product Owner is to remove the Public Errata altogether.  

Timeline:  In the meeting that took place on 11/1/2012, Mike mentioned that we wanted to have this accomplished asap.  

Contacts:  
Mike Amburn, Product Owner RHN
Nicky Bronson, Sr BA RHN

Comment 1 Nicky 2012-11-13 14:02:27 UTC
US27779 added to AMS backlog.

Comment 2 Jared Blashka 2013-01-08 20:02:28 UTC
Spoke with Mike Amburn.

He wants new pages created that output information on RHN Errata & Packages. 
The pages should output XML formatted like the Errata & Package examples from the GSA feed structure document:
https://docspace.corp.redhat.com/docs/DOC-125957

The pages should allow limiting the search results by suppling constraints on the content's update date (updatedBefore & updatedAfter).
If no dates are supplied the feed returns results for all Errata/Packages.

Page should probably be secured in some fashion to prevent abuse as the underlying query is resource intensive and time consuming to run if unlimited.
Method of securing the page is still undecided.

Comment 3 Nicky 2013-01-10 19:31:31 UTC
Gov Board update 1/10/13 - will be in MR45 which is scheduled for 1st week of Feb release.

Comment 4 Jonathon Turel 2013-01-11 15:49:31 UTC
Code reviewed, good.

Comment 5 Jared Blashka 2013-01-11 17:45:11 UTC
http://git.corp.redhat.com/cgit/rhn/rhn/commit/?id=a02269017bc95b08a5e58256c917c3510c82c89c

Feed pages are
/rhn/gsa/PackageFeed.do
/rhn/gsa/ErrataFeed.do

Each page can have one, neither, or both of the following arguments in the URL:
updatedAfter
updatedBefore

Date format is YYYY-MM-DD for those arguments (ex. .../rhn/gsa/ErrataFeed.do?updatedAfter=2012-10-01&updatedBefore=2012-12-31). Maxiumum time span that is searchable is one year.

If updatedAfter isn't supplied, page sets updatedAfter=2000-01-01. If updatedBefore isn't supplied, page sets updatedBefore to tomorrow's date.

Deployed on rhn.webdev.

Comment 6 Nicole Yancey 2013-01-14 21:25:02 UTC
Link to test run - https://tcms.engineering.redhat.com/run/54834/

Comment 7 Nicole Yancey 2013-01-15 16:27:42 UTC
verified on am-qa

Comment 8 Jared Blashka 2013-01-22 16:00:36 UTC
Spoke with Ian Hands, who will be creating the crawler for this page. He requested a few changes. 

He'd like to be able to specify the time down to the second for the updatedAfter & updatedBefore parameters.

Rather than limiting the searching to a two month time span, the results should be paginated (&page=0 fetches xml for first n records, &page=1 fetches the next n records after that, etc). When requesting a timespan, the crawler will increment the page parameter until it doesn't get back any results in the xml file.

The results returned by the xml page should be ordered by updated date in ascending order.

The addition of a meta field that has some sort description of the errata/package (ex. first 120 characters of the description)

The url for the errata/packages in the xml should include the hostname, as GSA will have no idea where the xml was generated.

This won't be going out with MR45 anymore.

Comment 9 Nicky 2013-01-23 18:55:07 UTC
Jared, 
Thank you.  I understand these changes could not be consumed in time to make the QA push.  Therefore this bug was pushed to AMS's next release. Please ensure that you have all the changes needed to prevent further delays.

Overall, I am really glad that the changes were caught in time to prevent issues, though I do need to watch this delay.  

Please let me know if there is anything else I can do to facilitate / ensure that this gets into MR46.

Comment 10 Nicky 2013-01-24 20:57:31 UTC
Gov Board update 1/24/13 - due to the incoming RHEL7 extended QA work scheduled for Feb, AMS will try hard to get this into MR46 (late Feb), however at this time there are no guarantees.  We need to ensure that no further requirements are needed from GSS side, and no further changes are requested of the AMS developer.

Comment 11 Nicole Yancey 2013-01-28 16:23:13 UTC
Move to AMS backlog - US30282

Comment 12 Nicky 2013-02-07 19:40:02 UTC
Gov Board update 2/7/13 - confirmed that this is in sprint.

Comment 13 Ian Page Hands 2013-02-07 19:41:20 UTC
There are a couple of things that might need to change.
For reference have a quick look at the page setup to browse packages : https://access.devgssci.devlab.phx1.redhat.com/search/beta/browse/packages 

1) One thing that appears missing immediately is the ability to filter packages based off of RHEL version. This might be an RFE though.

This is because there is no RHEL version data in the feed metadata. Example:
          <meta name="portal_id" content="749185"/>
          <meta name="portal_title" content="vino-debuginfo-2.28.1-8.el6_3.x86_64.rpm"/>
          <meta name="portal_description" content="This package provides debug information for package vino. Debug information is useful when developing applications that "/>
          <meta name="portal_product" content="vino-debuginfo"/>
          <meta name="portal_product_version" content="2.28.1"/>
          <meta name="portal_architecture" content="x86_64"/>
          <meta name="portal_package_version" content="4.8.0"/>
          <meta name="portal_publication_date" content="2013-01-21T17:30:12"/>
          <meta name="portal_update_date" content="2013-01-21T17:30:11"/>
          <meta name="portal_requires_subscription" content="no"/>


I think the values in portal_product and portal_product_version might be useful at some point, but what is more useful is the parent product/version.

For example in the meta's given above we *could* build a filter on portal_product, but almost every record is going to have a unique product name so the filter would be thousands of entries long... and less useful as a filter.

Instead if this vino-debuginfo had the product of "Red Hat Enterprise Linux" and the version of "5". And another vino-debuginfo record had the "Red Hat Enterprise Linux" and the version of "6", then we could easily build the filter for RHEL 5 and RHEL 6.

Is there any way to relate a record with it's "parent" product/version (where parent product is ??in all cases?? the RHEL prod/vers)?
If so I'd like to see portal_product and portal_product_version be the parent. You can continue to provide the info you currently do in a portal_package and portal_package_version or some similar meta field.

FYI: errata currently behaves this way, where the portal_product is like RHEL, RHEL AS, RHEL WS, etc.
see: https://access.devgssci.devlab.phx1.redhat.com/search/beta/browse/errata




2) I have performed a few full crawls and noticed that the packages counts seem lower than expected:
scratch/errata/.state.yml: 7021
scratch/package/.state.yml: 3148

Is this number 3148 (total uniqed records after the crawl from 1980 to today) right? If it is just a "AM-QA only has a subset of data" thing I understand, and it is probably not much to worry about.

I think I recall crawling one full crawl and seeing a much larger number though.

Comment 14 Jared Blashka 2013-02-13 19:24:28 UTC
Ignore any previous comments detailing functionality.

Refer to https://docspace.corp.redhat.com/docs/DOC-125957 for updated feed format.
In addition to the fields listed in the above doc, the feed includes a           <meta name="portal_description"/> field.


Feed pages are:
/rhn/gsa/PackageFeed.do
/rhn/gsa/ErrataFeed.do

Each of these pages can either, none, or both of the following parameters:
onOrAfter (restricts returned results to those with a last_modified time on or after the supplied time)
onOrBefore (restricts returned results to those with a last_modified time on or before the supplied time)

These parameters are a timestamp of the following format:
YYYY-MM-DDTHH24:MI:SS (That is literally a "T" character in the timestamp)
This is Year-Month-DayTHour:Minute:Second. Ex: 2010-05-15T15:20:11

There is no restriction on the length of time queried by the feed. 
Rather than restricting a time length, the feed will limit the returned results to 200 entries. 
If a page=# parameter is appended to the url, the feed will return the #th set of 200 results. 
Page # starts at 1. Ex. page=1 is 1st 200, page=2 is 2nd 200. 
If there are no results to return, the XML returned will have no <record> elements.

Records are returned in ascending order of timestamps (Oldest records are first). 
If multiple records share the same last_update timestamp then those records are ordered by their relevant id (package/errata) in ascending order.

If a package/errata has no associated products (Red Hat etc. etc.) it will not be returned by the feed.


Example feed page in dev:
https://rhn.webdev.redhat.com/rhn/gsa/PackageFeed.do?page=2&onOrBefore=2012-11-30T00:00:00&onOrAfter=2012-10-10T00:00:00

Deployed on rhn.webdev

Comment 15 Jared Blashka 2013-02-13 19:25:14 UTC
Updated the doc to reflect current feed format

Comment 16 Jared Blashka 2013-02-13 19:46:30 UTC
Must use gsa-doc-crawler login

Comment 17 Nicole Yancey 2013-02-15 14:40:50 UTC
Testing blocked due to change in requirements

Comment 18 Nicole Yancey 2013-02-20 14:52:17 UTC
Link to test run - https://tcms.engineering.redhat.com/run/57217/

Need to update the error message to reflect latest requirements.

Comment 19 Nicole Yancey 2013-02-20 15:41:26 UTC
verified on rhn.webdev.redhat.com

Comment 20 Nicky 2013-02-22 20:03:32 UTC
Gov Board update 2/21/13 - there is still ambiguity around the fine tuning of this bug.  Until then we cannot release it.  A meeting will be held to figure this out.

Comment 21 Nicky 2013-02-26 20:09:13 UTC
Per meeting on 2/26/13 with Vkumar and Jared this Bug is ready to go to QA soon as QA is available.

Comment 22 Nicky 2013-03-07 21:00:39 UTC
Gov Board Update 3/7/13 - confirmed this is verified and ready to push to prod when environment is available.

Comment 23 Brooks 2013-04-08 19:19:24 UTC
This will be released with the RHN MR48 GSA release. Changing Version to MR48.

Check our release schedule for RHN MR48 release date:
https://docspace.corp.redhat.com/docs/DOC-126420/

Comment 24 Richard Bernleithner 2013-04-08 20:05:49 UTC
Currently scheduled for RHN MR48 on 4/17

Comment 25 Nicole Yancey 2013-04-11 17:44:20 UTC
fail on QA - bad queries

Comment 26 Nicole Yancey 2013-04-16 16:38:42 UTC
fail on qa and stage

Proxy error generated on the following pages:
1) /rhn/gsa/ErrataFeed.do?&onOrBefore=2012-11-30T00:00:00&page=1
2) /rhn/gsa/ErrataFeed.do?page=2&onOrBefore=2012-11-30T00:00:00
3) /rhn/gsa/ErrataFeed.do?&onOrBefore=2012-11-30T00:00:00&page=3
4) /rhn/gsa/ErrataFeed.do?&onOrBefore=2012-11-30T00:00:00
5) /rhn/gsa/ErrataFeed.do?

Comment 27 Nicole Yancey 2013-04-16 17:19:19 UTC
verified on rhn.webdev - https://tcms.engineering.redhat.com/run/60849/

Comment 28 Nicole Yancey 2013-04-18 13:08:32 UTC
verified in qa - https://tcms.engineering.redhat.com/run/60616/

Comment 29 Richard Bernleithner 2013-04-18 17:18:28 UTC
This is schduled for release on 4/22 according to https://docspace.corp.redhat.com/docs/DOC-139955

Comment 31 Nicole Yancey 2013-04-19 15:23:35 UTC
verified in stage

Comment 32 Jared Blashka 2013-04-19 17:28:45 UTC
Stage DB is running slow, which is the cause of the Proxy error timeouts in the UI.

Comment 33 Richard Bernleithner 2013-04-22 20:44:27 UTC
This is now scheduled for release on 4/24, but won't be made available until architectual review sign off.

Comment 34 Nicole Yancey 2013-04-25 13:43:09 UTC
Released to Production 4/24


Note You need to log in before you can comment on or make changes to this bug.