Bug 883546
Summary: | Provides/Conflicts/Requires Repodata inconsistency due to caching in rhnPackageRepodata | ||
---|---|---|---|
Product: | [Community] Spacewalk | Reporter: | Stephen Herr <sherr> |
Component: | Server | Assignee: | Stephen Herr <sherr> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Satellite QA List <satqe-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.9 | CC: | cperry, fdewaley, michele, mkarg, nigjones, pep, pmutha, tlestach, xdmoon |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 860881 | Environment: | |
Last Closed: | 2013-03-06 18:35:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 860881, 883623 | ||
Bug Blocks: | 917805 |
Description
Stephen Herr
2012-12-04 20:44:33 UTC
Committed to Spacewalk master: 165189729784f9134f5bd9d5090df2f9588eff54 Problem #1: Jan brought to my attention that my earlier description can not be exactly right if the CapabilityIterator is executed before the PackageDto objects are created. The query gets executed and the cursor created when we call PreparedStatement.exectueQuery() (which I have verified), and the rows it will return for the CapabilityIterator are set in stone at that time. However, the actual values of those rows are not read until the ResultSet lazy-loads their group. If another process has committed changes to some of the same rows (like if a package is shared between two channels and the other process commits its changes), then when Oracle goes to read the rows it has to re-construct what the data looked like at the time of the cursor creation. It does this by using the rollback segment to figure out what the data used to look like. So far everything is fine, but it is possible that Oracle is unable to reconstruct the original data if enough has happened between the cursor execution time and the read time (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923). See original bug and notice the observation that this usually happens under "high load". If so, Oracle will throw a SQLException. The PackageCapabilityIterator (see line 171) currently catches the SQLException and continues on as if nothing has happened. I have verified that if a SQLException occurs it can generate xml exactly like what we're seeing here. Problem #2: There is currently a config option, user_db_repodata, that forces the code to ignore any cached xml in the database and generate the repodata xml from scratch. However, it is currently broken and will never work. The PackageCapabilityIterators will not load the capabilities for a package that has stored primary_xml, regardless of what user_db_repodata is set to. This means that if you ever try to use the config option the code will faithfully regenerate and store the xml back to the database cache, but every package xml it generates will have exactly 0 capabilities and the primary.xml channel that gets generated will be broken until you manually delete all the data in the database cache. Solution #1: The minimal-change way to fix these errors would be to have the PackageCapabilityIterators throw SQLException errors on up, which would force the repodata generation to exit and try again in a few minuets, to fix Problem 1. The next time the repodata generation should succeed unless there is yet another channel changing things, in which case it keeps trying until it eventually succeeds. To fix Problem 2 we could simply remove the constraint for the PackageCapabilityIterators to only load capabilities for packages that have null primary_xml, like I did in Comment 1. This would allow the Iterators to find the correct information of the config option is set, and it would proceed to generate correct xml. These solutions are problematic. The first change forces multiple runs to occur in case something goes wrong. While that is certainly better than writing bad data if something goes wrong, it is not as good as everything just working the first time. The second change forces us to load every capability for every package in a channel every time we regenerate the repodata. Just adding a single package to the channel would require literally millions of rows to be read, and the performance of that implementation is clearly sub-optimal. Solution #2: We can avoid both of these problems by re-structuring how the PackageCapabilityIterators work. Currently we have the queries select all capabilities for all packages in a channel. Then we have to deal with fetch sizes and lazy-loading and re-constructing the data like it was when we first ran the select, and then we trying to match up the capability with the package by comparing package_id. Instead we should have the query only return capabilities for the package we are currently operating on (and we wouldn't care if the package already has primary_xml). The PackageDto object already knows whether it has primary_xml or not, so if it does (and we're using the database cache) we may never need to execute any of the package capability queries. This would save around a minute of time right off the bat of any repometadata task. Furthermore, if we need to generate the xml (either because it has never been generated before or because the config option is set) then it would work correctly every time. This way there is no need to have a long-running cursor open that could potentially run into SQLExceptions. We would simply run the query, get all the results immediately, and close the cursor. We would still want to throw any SQLExceptions that occur on up, but they should never really have an opportunity to happen. The downside to this approach is that we'd have to execute the smaller query many times in some cases (like the initial import of a channel), which may slow down the reporegeneration process. However, one-time slowness is in my opinion a small price to pay for fixing these deeper issues, as well as giving customers a simple config option they can set once to fix these issues if they ever occur again. I will proceed to code a fix for Solution #2, and revert the commit in Comment 1. Committed to Spacewalk master: a5e6dad5668f09a998b497d5d61ce44df5690509 Marking bug as ON_QA since tonight's build of Spacewalk nightly is a release candidate for Spacewalk 1.9. Spacewalk 1.9 has been released. https://fedorahosted.org/spacewalk/wiki/ReleaseNotes19 |