Description of problem:
Reported by multiple customers with similar symptoms after upgrading to sat 5.3.0 such as the following:
* When clicking on rhn/systems/Overview.do, after about 2 minutes the webserver reports "system temporary unavailable".
* After the upgrade it takes lot of time (~2-5 mins) to list all available System Groups. Via the WebUI: System -> Systemgroups. Using the API displays all groups immediately. The relevant page is rhn/systems/SystemGroupList.do. The satellite has ~60 systems subscribed into ~30 groups.
Attached you can find an oracle statspack and a fresh satellite-debug.
* Due to a huge SQL request taking more than half of the machine's computing power, attempts to display all groups on the satellite 5.3 produce a timeout from the JSP server
This is a very important issue because the URL is an entry point for many other important links like kickstart, system groups etc.
Version-Release number of selected component (if applicable):
Red Hat Network (RHN) Satellite 5.3.0
Steps to Reproduce:
1. Have a satellite with ~11000 systems (or in the thousands) registered and some RHEL base channels sync'd (we should have a customer db-control backup dump soon).
2. Click on https://SATELLITEFQDN/rhn/systems/Overview.do and rhn/systems/SystemGroupList.do
See load on Satellite server go up, either page takes several minutes to load or end in "temporarily unavailable" error due to timeout.
Page loads in reasonable time (in the order of seconds depending on deployment size, at least comparable than earlier versions - <5.3.0 - of satellite), no timeout/temporarily unavailable errors or high loads.
Already being worked with Engineering.
Per IRC with Justin, engineering is working on optimizing the query (the final version may be different than proposed), in the meantime, can continue to use proposed query change from GPS consultant as a temporary workaround (great work! :)).
jsherrill_laptop xixi, so that kinda makes senese, but i'm surprised it made *that* much of a difference
xixi looks like it's mainly "rhnServerNeededPackageCache" to "rhnServerNeededErrataCache"
jsherrill_laptop i wonder if it would be faster to just use rhnServerNeededCache
jsherrill_laptop which is the table that is underneath both those views
xixi so use the table instead of the view
xixi jsherrill_laptop: sure we can try that
jsherrill_laptop yeah, both those views used to be tables
jsherrill_laptop but i abstracted it out into one table (rhnServerNeededCache)
xixi jsherrill_laptop: ok
jsherrill_laptop and then created two views to 'simulate' the old tables
xixi jsherrill_laptop: so do we want to have the other affected customers try a modified version as you mentioned above?
jsherrill_laptop couldn't hurt, and once we have an internal reproducer, we can test the two and see if there's much difference
jsherrill_laptop: yeah we definitely want to try out in-house
(In reply to comment #2)
> jsherrill_laptop i wonder if it would be faster to just use
> jsherrill_laptop which is the table that is underneath both those views
I tried this, and it didn't make any difference for the execution time.
The reason why the rhnServerNeededPackageCache view is slow is the "max" and "group by". This forces Oracle to take all package rows for a server, and hold them in memory as a temporary result (or in the case here, in temporary tablespace, because the default PGA was no large enough).
spacewalk master 5713ce473236d273c2a4c1608ec3f7f144e9b4e8
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.