Red Hat Bugzilla – Bug 883546
Provides/Conflicts/Requires Repodata inconsistency due to caching in rhnPackageRepodata
Last modified: 2013-03-06 13:35:39 EST
So, as you know, the yum repodata is regenerated for a channel every time a package gets added / removed from it. This process can be fairly slow, depending on how many packages are in the channel and how powerful the server is.
Somewhere along the way, someone figured out that once we've generated the xml for X package, it's silly to regenerate it every time another package gets added to the channel. After all, X's xml isn't going to change because of that. So we added a cache in the database so that once we've generated xml for a package we should never have to do it again. If the xml exists in the database for X, then we use that, otherwise we generate it and store it in the database and then use it.
The problem here is that somehow some of the xml that got stored in the database was invalid, missing X's provides / requires / conflicts / obsoletes data. This is why it regenerates the repodata incorrectly every time, because it is using an incorrect cache.
So to get an affected Spacewalk's repodata generation functioning properly again, simply delete the cached xml in the database. Sadly there is no other way, you have to do this directly to the database. The only effect this will have is forcing a full regeneration of the package xml data the next time the repodata task runs, which will make the first repodata generation slower, but also will produce a correct cache and repodata file.
You can delete all the cached xml associated with an affected channel by doing something like:
delete from rhnPackageRepodata where package_id in (select cp.package_id from rhnChannelPackage cp, rhnChannel c where c.id = cp.channel_id and c.label = 'rhel-i386-server-6');
delete from rhnPackageRepodata;
Also works fine, just wiping out the entire cache for every channel.
The astute reader will have noticed that this does not explain how invalid xml got stored in the first place. That is an excellent question, and investigation is still ongoing. However I fear that the answer may be extremely transient and difficult to reproduce or track down. Hopefully the root cause will present itself soon and we can get this fixed permanently.
--- Additional comment from Stephen Herr on 2012-12-03 17:42:01 EST ---
It seems to me that every package that displays this issue is used in more than one channel. IE, usually it is a noarch or i386 package that is included in both the i386 and x86_64 versions of a channel. I suspect that this is related to the channel metadata being generated simultaneously for those two channels and clobbering each other.
In particular, see the comment in PackageCapabilityIterator. Results from the provides / requires / obsoletes / depends queries are lazy-loaded 200 results at a time. Also note that if a package has cached xml data it will not be loaded by those queries.
In contrast, it looks to me like the PackageDtos are loaded all at once at the beginning of the repo regeneration task (RepositoryWriter:180). This creates a timing hole where the PackageDto object may not have any cached *_xml data filled out, but by the time the the PackageCapabilityIterator gets to it the other process has already saved the *_xml data.
More specifically I believe it happens like this:
1) Package bind-libs-9.8.2-0.10.rc1.el6_3.3.i686.rpm is added to both the i386 and x86_64 versions of rhel-*-server-6
2) Taskomatic faithfully picks up both jobs and distributes them to two different workers so it can parallelize the work and get done quicker.
3) Both the i386 and x86_64 repomd jobs create a PackageDto object for this version of bind-libs. It has never had repomd xml created for it before, so the primary_xml entry in the Dtos are null
4) One of the processes reaches the package first, let's say the i386 channel. The [provides,requires,conficts,obsoletes]Iterator in PrimaryXmlWriter.addPackagePrcoData() correctly lazy-loads everything. The xml gets correctly written and saved back to rhnPackageRepodata. The i386 channel will end up having correct xml.
5) The other process reaches the package. This time when our Iterators attempt to lazy-load the capabilities, they get nothing because our capability queries "helpfully" do not include info for packages that have primary_xml data already saved. The empty xml tag gets created since there are no capabilites, and the resultant incorrect xml gets saved back to rhnPackageRepodata. The x86_64 channel will end up having incorrect xml.
6) The next time either the i386 or x86_64 channel gets the repodata recreated, it fetches the incorrect xml from the database and writes a bad primary.xml file. The only fix at this point is to delete the bad cached data.
Testing and working on a fix will follow tomorrow, but I'm pretty sure that this is what's happening. It may not be possible to know for sure though until we release a fix for this issue and see if the customer reports go away.
--- Additional comment from Stephen Herr on 2012-12-04 15:23:18 EST ---
Furthermore, I believe this bug will never manifest on a postgresql database, only oracle, because of the difference in the default handling of concurrent transactions.
Postgresql only allows a transaction to see data that was committed before the transaction began, so in this case both sets of CapabilityIterators will return all the capabilities for the packages and this will not happen.
Oracle by default uses a Read Committed isolation level which reduces the number of read locks required but also can lead to non-repeatable reads in a single transaction which is the heart of this issue. http://docs.oracle.com/cd/E11882_01/timesten.112/e21631/concurrent.htm
Committed to Spacewalk master: 165189729784f9134f5bd9d5090df2f9588eff54
Jan brought to my attention that my earlier description can not be exactly right if the CapabilityIterator is executed before the PackageDto objects are created. The query gets executed and the cursor created when we call PreparedStatement.exectueQuery() (which I have verified), and the rows it will return for the CapabilityIterator are set in stone at that time.
However, the actual values of those rows are not read until the ResultSet lazy-loads their group. If another process has committed changes to some of the same rows (like if a package is shared between two channels and the other process commits its changes), then when Oracle goes to read the rows it has to re-construct what the data looked like at the time of the cursor creation. It does this by using the rollback segment to figure out what the data used to look like.
So far everything is fine, but it is possible that Oracle is unable to reconstruct the original data if enough has happened between the cursor execution time and the read time (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:275215756923). See original bug and notice the observation that this usually happens under "high load". If so, Oracle will throw a SQLException. The PackageCapabilityIterator (see line 171) currently catches the SQLException and continues on as if nothing has happened. I have verified that if a SQLException occurs it can generate xml exactly like what we're seeing here.
There is currently a config option, user_db_repodata, that forces the code to ignore any cached xml in the database and generate the repodata xml from scratch. However, it is currently broken and will never work. The PackageCapabilityIterators will not load the capabilities for a package that has stored primary_xml, regardless of what user_db_repodata is set to. This means that if you ever try to use the config option the code will faithfully regenerate and store the xml back to the database cache, but every package xml it generates will have exactly 0 capabilities and the primary.xml channel that gets generated will be broken until you manually delete all the data in the database cache.
The minimal-change way to fix these errors would be to have the PackageCapabilityIterators throw SQLException errors on up, which would force the repodata generation to exit and try again in a few minuets, to fix Problem 1. The next time the repodata generation should succeed unless there is yet another channel changing things, in which case it keeps trying until it eventually succeeds.
To fix Problem 2 we could simply remove the constraint for the PackageCapabilityIterators to only load capabilities for packages that have null primary_xml, like I did in Comment 1. This would allow the Iterators to find the correct information of the config option is set, and it would proceed to generate correct xml.
These solutions are problematic. The first change forces multiple runs to occur in case something goes wrong. While that is certainly better than writing bad data if something goes wrong, it is not as good as everything just working the first time. The second change forces us to load every capability for every package in a channel every time we regenerate the repodata. Just adding a single package to the channel would require literally millions of rows to be read, and the performance of that implementation is clearly sub-optimal.
We can avoid both of these problems by re-structuring how the PackageCapabilityIterators work. Currently we have the queries select all capabilities for all packages in a channel. Then we have to deal with fetch sizes and lazy-loading and re-constructing the data like it was when we first ran the select, and then we trying to match up the capability with the package by comparing package_id. Instead we should have the query only return capabilities for the package we are currently operating on (and we wouldn't care if the package already has primary_xml).
The PackageDto object already knows whether it has primary_xml or not, so if it does (and we're using the database cache) we may never need to execute any of the package capability queries. This would save around a minute of time right off the bat of any repometadata task. Furthermore, if we need to generate the xml (either because it has never been generated before or because the config option is set) then it would work correctly every time.
This way there is no need to have a long-running cursor open that could potentially run into SQLExceptions. We would simply run the query, get all the results immediately, and close the cursor. We would still want to throw any SQLExceptions that occur on up, but they should never really have an opportunity to happen.
The downside to this approach is that we'd have to execute the smaller query many times in some cases (like the initial import of a channel), which may slow down the reporegeneration process. However, one-time slowness is in my opinion a small price to pay for fixing these deeper issues, as well as giving customers a simple config option they can set once to fix these issues if they ever occur again.
I will proceed to code a fix for Solution #2, and revert the commit in Comment 1.
Committed to Spacewalk master: a5e6dad5668f09a998b497d5d61ce44df5690509
Marking bug as ON_QA since tonight's build of Spacewalk nightly is a release candidate for Spacewalk 1.9.
Spacewalk 1.9 has been released.