Description of problem:
Yum will incorrectly use its own package cache on subsequent operations if an RPM command directly manipulates the RPM database shortly after a yum install operation.
Version-Release number of selected component (if applicable):
How to reproduce:
NOTE - for reasons to be explained below this is much more likely to happen on a filesystem like ext2/ext3 where mtime resolution is limited to 1 second. On modern filesystems like ext4 this race is much less likely to occur.
The following bash script reproduces the problem for me:
while true; do yum -y install nc || break; rpm -e nc; done
It should fail shortly with an error like:
Rpmdb checksum is invalid: pkg checksums: nc-0:1.84-22.el6.x86_64
The script should spin forever.
The error is being thrown in yum/rpmsack.py. The check that is failing is within preloadPackageChecksums and is reproduced below:
rpmdbv = self.simpleVersion(main_only=True)
fo = open(self._cachedir + '/pkgtups-checksums')
frpmdbv = fo.readline()
if not frpmdbv or rpmdbv != frpmdbv[:-1]:
rpmdbv here is supposed to be the version of the RPM database from /var/lib/rpm/Packages *but* if you follow the trace you'll eventually find out it takes a shortcut in _get_cached_simpleVersion_main() [the same file]:
# This test is "obvious" and the only thing to come out of:
# ...if anything gets implemented, we should change.
rpmdbvfname = self._cachedir + "/version"
rpmdbfname = self.root + "/var/lib/rpm/Packages"
if os.path.exists(rpmdbvfname) and os.path.exists(rpmdbfname):
# See if rpmdb has "changed" ...
nmtime = os.path.getmtime(rpmdbvfname)
omtime = os.path.getmtime(rpmdbfname)
if omtime <= nmtime:
rpmdbv = open(rpmdbvfname).readline()[:-1]
self._have_cached_rpmdbv_data = rpmdbv
Basically it compares the mtime on the local cache with the mtime on the Packages and if they are <= it will take the cache instead of getting the version information from the real Packages file.
The bug is that if the yum install completes in the *same second* as the rpm -e operation the '/var/lib/rpm/Packages' file will have the *same* mtime as the cache even though it is different than the cache.
The program then errors out since it will search for all the packages found in the cache and it can't find the package that was just removed by the 'rpm -e' comamnd.
The correct fix is to change the '<=' to '<' which ensures the cache *is* more recent than the actual RPM database itself.
You can see this in action by adding a sleep after the yum install in the for loop:
while true; do yum -y install nc || break; sleep 1; rpm -e nc; done
The above doesn't hang and will loop forever. The sleep 1 means that the rpm -e command will have a larger mtime thus forcing yum to read the rpmdb version from the database itself.
On modern filesystems like ext4 that have higher mtime resolution this problem should occur less since it will be much harder to have both commands finish in the same millsecond (or microsecond).
> The correct fix is to change the '<=' to '<'
ACK, making rpmdb caching a bit more conservative should not hurt. This should also go upstream.
Removing dev. ACK flag, as this patch breaks the caching for the common case.
Why are you using rpm directly?
> On modern filesystems like ext4 that have higher mtime resolution this problem should occur less since it will be much harder to have both commands finish in the same millsecond (or microsecond).
Also this is a slight understatement, rpm is unlikely to be able to run erase transactions 1000000000x faster than it currently does anytime soon (ext4 has nano second resolution).
This problem is being triggered for us by puppet. Specifically the 'remove' package action in puppet uses "rpm -e" while the 'install' package action will use yum directly. That said, I don't see why using RPM directly should be a problem here.
Can you explain how this breaks the common caching case? As long as yum does not write out the caches until the RPM install operation is complete they should have a newer timestamp than the RPM caches. I agree that on filesystems with lower mtime resolution you will have the problem where the timestamp is the same.
From my read, the fallback logic here is also not to invalidate the cache but rather check the version number on the RPM cache against the version number of the yum cache to make sure the databases match. Assuming they match I believe the cache will still be used?
I don't think this is a matter of 'making the caching more conservative'. The caching logic as-is will sometimes take invalid caches which seems like a bug... Making sure the yum caches are newer (> instead of >=) is the only way to guarentee the cache is safe without inspecting the RPM database versions.
I agree with you that it's very unlikely to happen under ext4 but there are still a lot of machines (including the ones we're seeing this on) that run ext3.
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.
Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.
(In reply to greg.2.harris from comment #4)
> This problem is being triggered for us by puppet. Specifically the 'remove'
> package action in puppet uses "rpm -e" while the 'install' package action
> will use yum directly. That said, I don't see why using RPM directly should
> be a problem here.
That's just as broken, although maybe less observable, than the other way around.
history and yumdb are the two obvious things that aren't done when you go to rpm directly.
> Can you explain how this breaks the common caching case?
Because the common case is that rpmdb.simpleVersion => _put_cached_simpleVersion_main is called within a second of rpmdb finishing the transaction. So Changing the <= to < means we might as well just not bother writing the cache at all.
> From my read, the fallback logic here is also not to invalidate the cache
> but rather check the version number on the RPM cache against the version
> number of the yum cache to make sure the databases match. Assuming they
> match I believe the cache will still be used?
There are multiple layers to the caching. Changing this just breaks the /version cache, so yes if we have to regenerate that and it's valid then we'll still be able to use the /conflicts etc. caches that rely on it. But it's still significant. Eg. compare:
yum version nogroups
...the first is directly reading just the /version cache, and the second is regenerating it (because it doesn't cache the groups data). The first is roughly the same speed as python/yum init. ... the second is almost 4x that _if_ the rpmdb is in page cache, and like 12x that otherwise.
> I agree with you that it's very unlikely to happen under ext4 but there are
> still a lot of machines (including the ones we're seeing this on) that run
You also need to have puppet (or something) running rpm directly, during the transaction.
If you want you can have puppet run "yum clean rpmdb", after it alters the rpmdb directly ... which will delete all the cached rpmdb data, which is likely what you want when all that data is going to be bad anyway.