Red Hat Bugzilla – Bug 142831
Removal of old package versions inappropriately erases files
Last modified: 2007-11-30 17:07:05 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Description of problem:
If I have two versions of an RPM installed (say, because an rpm -Uvh
of that RPM did not complete successfully), and I remove the old one,
some subset of files is removed inappropriately, causing the new
package to be incomplete.
I was unable to determine whether this problem is fixed in RPM 4.3 (as
provided in Fedora Core 3) because I could not find a way to install
that version of RPM on a RHEL-3 system.
This problem is dramatically compromising system stability in MIT's
environment. Whenever an update is interrupted and resumed later,
whole swaths of system files go missing, including parts of RPM
itself. It is vitally important that we receive an errata addressing
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Grab copies of the "at" RPM from RHEL-3 and RHEL-3-updates.
rpm -e at
rpm -ivh at-3.1.8-46.i386.rpm
rpm -ivh --force at-3.1.8-48.ent.i386.rpm
rpm -e at-3.1.8-46
rpm -V at
Actual Results: -bash-2.05b# rpm -e at
-bash-2.05b# rpm -ivh at-3.1.8-46.i386.rpm
warning: at-3.1.8-46.i386.rpm: V3 DSA signature: NOKEY, key ID db42a60e
-bash-2.05b# rpm -ivh --force at-3.1.8-48.ent.i386.rpm
warning: at-3.1.8-48.ent.i386.rpm: V3 DSA signature: NOKEY, key ID
-bash-2.05b# rpm -q at
-bash-2.05b# rpm -V at-3.1.8-48.ent
-bash-2.05b# rpm -e at-3.1.8-46
-bash-2.05b# rpm -q at
-bash-2.05b# rpm -V at
missing d /usr/share/doc/at-3.1.8/ChangeLog
missing d /usr/share/doc/at-3.1.8/Copyright
missing d /usr/share/doc/at-3.1.8/Problems
missing d /usr/share/doc/at-3.1.8/README
missing d /usr/share/doc/at-3.1.8/timespec
Expected Results: The final "rpm -V at" invocation should have run clean.
"Dramatically compromising system stability" seems a bit
extreme if/when the the files are in /usr/share/doc imho.
Gah. That was just a random example. The real instances we've been
encountering involve systems losing large chunks of rpm itself,
because of an aborted update which included an upgrade of the rpm
rpms. (Large swaths of other system files also turn up missing, but
the missing rpm files are the biggest deal, because rpm stops working
as a result.)
But that would have been a much more confusing demonstration.
We have on the order of 800 systems that do **NOT** used RHN, but
instead use our own scripts that drive RPM. In the next month we plan
to push out RHEL3 Update 4 to these systems.
It would be REALLY helpful if we got some help resolving this issue.
It would not look good for Red Hat if MIT professors came in to work
with dead systems, and all we could tell them was that "Red Hat didn't
think our bug was very important."
Out of curiosity why aren't you using RHN/up2date. Can you attach the scripts
If you are seeing this frequently I imagine that there is a more robust way of
scripting it - are you using rpm-python bindings or just calling to rpm? THe
only time that I can really see you getting two packages installed in parallel
(other than with biarch setups) is with prematurely terminated transactions.
Our update system predates the existance of Linux, and is multi-platform.
We currently update both Solaris and Linux.
Historically we've run our update system on: Vax Ultrix, IBM AOS, IBM AIX, SGI IRIX, SunOS, etc.
One of these days the industry will catch up to us.
Every so often Red Hat and I, or Sun and I restart conversations about technology transfer,
but then something distracts them away.
Here's a web page we publish for our customers describing some of the user-visible differences:
Here's an ancient web page that does a crappy job of explaining some of the insides of our install/
We use a C program which calls out to rpmlib. In the cases we've seen reported,
users have seen the rpmlib-using process hang in a futex() call (suggesting some
kind of Berkeley DB issue). We have had two reports of that situation so far;
since we have no way of reproducing it, and it happens infrequently, I have no
current expectation of asking Red Hat to fix it. It could even be our bug,
although RPM's history of BDB hangs and the ratio of our code to RPM code
suggests otherwise. http://web.mit.edu/source/athena/etc/rpmupdate contains our
A less baroque scenario would be that the user interrupts the update by hand,
perhaps in one of the places where rpmlib provides no feedback to the
application that it is making progress within the transaction.
At any rate, the issue here is the consequence of such an interrupt being
unnecessarily dire. RPM is not supposed to be so brittle that it eats itself or
other parts of the system upon resuming an interrupted update. I don't think
it's appropriate to change the status of this bug to NEEDINFO, since I have
provided a very clear reproduction recipe for the part of the problem which
needs to be fixed.
Thanks for the additional information, the NEEDINFO was merely set whilst trying
to gather more information about your environment as a whole. Thanks for the
I am disappointed that there's been no forward progress on this issue
in nearly two months. We can point to 11 systems here that required
a re-install because the RPM database got corrupted, and we see
another 6 systems hung in updates that may be caused by this problem.
We prefer to have NO systems requiring re-install or manual
intervention for updates. Can we PLEASE get some resourcing behind
Thanks for using Bugzilla, and we hope you will continue to do so when
Keep in mind, however, that Bugzilla is simply a mechanism to report
bugs, and has no defined SLA. It is not a support
mechanism as is outlined on the front page of bugzilla.redhat.com. If
you are expecting help with problems during a defined time period, it
is more appropriate to contact Red Hat Tech Support. If you do not
have a tech support contract, and want to inquire about options, you
may contact me. My name is Shawn Hunter and I am the MIT account
manager at Red Hat.
MIT will be deploying a satellite server very soon, which could help
with these problems. The administrator of that server has access to
24x7 tech support for it with defined SLA's. Obviously, we will not
be able to help you with Athena as it is not a product we distribute,
but I can point you to the appropriate contacts at MIT. Please feel
free to contact me directly with questions.
So, on 16 March, I opened an escalation with Red Hat to ask that this issue
On 19 April, that escalation was closed with, "Closing this ticket as the
process will be handled from engineering and Bugzilla."
But alas, I see no status change.
I have heard nothing from anyone.
I have moved from being disappointed to being extremely disappointed.
I appreciate that resources are finite, and that there are many different
approaches to prioritization. But after being told, "Follow this process and we
will take your issue seriously" and having the issue dropped for another two
months is really not good.
Let's please at least look closely enough at the problem to understand the scope
in crafting a repair and to make a credible assessment of the likelihood that it
will begin hurting more sites than just MIT!
I've been told that this bug is not receiving due attention because of various
misunderstandings on Red Hat's part. So, here is a little FAQ:
Q. Isn't this bug unimportant because it's very unusual to have two versions of
an RPM installed at once?
A. No. You can have two versions of an RPM installed at once because an "rpm
-Uvh" invocation was interrupted, or because a scriptlet failed. For example,
try installing librsvg2 from the original RHEL 3 on an up-to-date RHEL 3 system:
rpm -Uvh --oldpackage librsvg2-2.2.3-2.i386.rpm
The %post scriptlet fails, and you wind up with two versions of librsvg2
installed. librsvg2 doesn't happen to trigger the bug described in this bug
report; I'm simply illustrating how a user can get into the "two versions of the
same package installed at once" situation quite easily, without using rpm -i
Q. Isn't this bug unimportant because, in the reproduction recipe, only
documentation files went missing?
A. The reproduction recipe was only a simple example, intended to aid in
debugging the problem. When we've seen this problem in the field, much more
important files have gone missing, including files needed to make "rpm" itself
D: fini 040700 2 ( 2, 2) 4096 /var/spool/at/spool skip
D: fini 100600 1 ( 2, 2) 0 /var/spool/at/.SEQ skip
D: fini 040700 3 ( 2, 2) 4096 /var/spool/at skip
D: fini 100644 1 ( 0, 0) 399 /usr/share/man/man8/atrun.8.gz skip
D: fini 100644 1 ( 0, 0) 887 /usr/share/man/man8/atd.8.gz skip
D: fini 100777 1 ( 0, 0) 430 /usr/share/man/man5/at.deny.5.gz sk ip
D: fini 100644 1 ( 0, 0) 430 /usr/share/man/man5/at.allow.5.gz s kip
D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/batch.1.gz skip
D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/atrm.1.gz skip
D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/atq.1.gz skip
D: fini 100644 1 ( 0, 0) 3023 /usr/share/man/man1/at.1.gz skip
D: fini 100644 1 ( 0, 0) 2451 /usr/share/doc/at-3.1.8/timespec
D: fini 100644 1 ( 0, 0) 1854 /usr/share/doc/at-3.1.8/README
D: fini 100644 1 ( 0, 0) 387 /usr/share/doc/at-3.1.8/Problems
D: fini 100644 1 ( 0, 0) 626 /usr/share/doc/at-3.1.8/Copyright
D: fini 100644 1 ( 0, 0) 1925 /usr/share/doc/at-3.1.8/ChangeLog
D: fini 040755 2 ( 0, 0) 4096 /usr/share/doc/at-3.1.8
D: fini 100755 1 ( 0, 0) 67 /usr/sbin/atrun skip
D: fini 100755 1 ( 0, 0) 22808 /usr/sbin/atd skip
D: fini 100755 1 ( 0, 0) 975 /usr/bin/batch skip
D: fini 120777 1 ( 0, 0) 2 /usr/bin/atrm skip
D: fini 120777 1 ( 0, 0) 2 /usr/bin/atq skip
D: fini 104755 1 ( 0, 0) 43740 /usr/bin/at skip
D: fini 100755 1 ( 0, 0) 1176 /etc/rc.d/init.d/atd skip
D: fini 100600 1 ( 0, 0) 1 /etc/at.deny skip
I see the output you've appended to the case, but I don't understand its meaning.
What exactly does that output demonstrate?
I'm adding notes to record results of the investigation here - it's mostly for
my reference. I wouldn't worry about it - however fyi
The rpmfi (file information) action is not set to skip on the files that are
erased, I'm tracing back why this might be happening atm.
Thanks very much for the clarification.
I'm also really pleased to see that there may be some actual traction in
identifying an actual problem. Thanks very much for moving this forward!
I've isolated the root cause of this bug and am assessing how to proceed.
That's GREAT news!
Why has the priority of this bug been downgraded?
Is a data destruction bug now considered "normal"?
Back in October, it looked like we were very close to a resolution on this bug.
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
So, in October 2005, it seemed like we understood what this problem was.
FOUR MONTHS later, I ask for a status report, and hear silence for ANOTHER four months.
Today, we just say, QEM has reviewed and declined the request.
Sorry, but you guys REALLY can do better than that!
You have a bug here where the way in which software is installed can potentially render a system
un-bootable if the race condition happens at the right time.
At one time the bug was on the CUSP of being understood.
You owe it to your users AT LEAST to say more than, "behind closed doors we're killing this
service request with no explanation."
Could you at least clarify why, specifically you've chosen to shut this bug off?
Is the bug understood, and too difficult to repair?
Did the person who understood the problem leave, and it's just too hard to track down?
Have you decided that this will not affect ordinary users of RPM? If so, what proof do you have
that the condition is merely sleeping, not truly unique to the way MIT calls rpmlib?
Have you decided this problem does not exist in Red Hat 4? What proof do you have the problem
is resolved there?
At a time when MIT is looking at other distributions than Red Hat because of a perceived inability of
Red Hat to provide device drivers and bug fixes in a timely manner, this bug serves as an unfavorable
example of Red Hat's abilities.