Bug 142831

Summary:	Removal of old package versions inappropriately erases files
Product:	Red Hat Enterprise Linux 3	Reporter:	Greg Hudson <ghudson>
Component:	rpm	Assignee:	Daniel Riek <riek>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	amb, djuran, jamisonm, katzj, k.georgiou, laroche, nobody+pnasrat, tao, wdc
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:	RHEL3U7NAK
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-06-16 15:56:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	170417

Description Greg Hudson 2004-12-14 16:02:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Firefox/0.10.1

Description of problem:
If I have two versions of an RPM installed (say, because an rpm -Uvh
of that RPM did not complete successfully), and I remove the old one,
some subset of files is removed inappropriately, causing the new
package to be incomplete.

I was unable to determine whether this problem is fixed in RPM 4.3 (as
provided in Fedora Core 3) because I could not find a way to install
that version of RPM on a RHEL-3 system.

This problem is dramatically compromising system stability in MIT's
environment.  Whenever an update is interrupted and resumed later,
whole swaths of system files go missing, including parts of RPM
itself.  It is vitally important that we receive an errata addressing
this problem.


Version-Release number of selected component (if applicable):
rpm-4.2.3-10

How reproducible:
Always

Steps to Reproduce:
Grab copies of the "at" RPM from RHEL-3 and RHEL-3-updates.

rpm -e at
rpm -ivh at-3.1.8-46.i386.rpm 
rpm -ivh --force at-3.1.8-48.ent.i386.rpm 
rpm -e at-3.1.8-46
rpm -V at


Actual Results:  -bash-2.05b# rpm -e at
-bash-2.05b# rpm -ivh at-3.1.8-46.i386.rpm 
warning: at-3.1.8-46.i386.rpm: V3 DSA signature: NOKEY, key ID db42a60e
Preparing...               
########################################### [100%]
   1:at                    
########################################### [100%]
-bash-2.05b# rpm -ivh --force at-3.1.8-48.ent.i386.rpm 
warning: at-3.1.8-48.ent.i386.rpm: V3 DSA signature: NOKEY, key ID
db42a60e
Preparing...               
########################################### [100%]
   1:at                    
########################################### [100%]
-bash-2.05b# rpm -q at
at-3.1.8-46
at-3.1.8-48.ent
-bash-2.05b# rpm -V at-3.1.8-48.ent
-bash-2.05b# rpm -e at-3.1.8-46
-bash-2.05b# rpm -q at
at-3.1.8-48.ent
-bash-2.05b# rpm -V at
missing    /usr/share/doc/at-3.1.8
missing  d /usr/share/doc/at-3.1.8/ChangeLog
missing  d /usr/share/doc/at-3.1.8/Copyright
missing  d /usr/share/doc/at-3.1.8/Problems
missing  d /usr/share/doc/at-3.1.8/README
missing  d /usr/share/doc/at-3.1.8/timespec
-bash-2.05b# 


Expected Results:  The final "rpm -V at" invocation should have run clean.

Additional info:

Comment 1 Jeff Johnson 2004-12-29 10:03:50 UTC

"Dramatically compromising system stability" seems a bit
extreme if/when the the files are in /usr/share/doc imho.

Comment 2 Greg Hudson 2004-12-29 17:39:18 UTC

Gah.  That was just a random example.  The real instances we've been
encountering involve systems losing large chunks of rpm itself,
because of an aborted update which included an upgrade of the rpm
rpms.  (Large swaths of other system files also turn up missing, but
the missing rpm files are the biggest deal, because rpm stops working
as a result.)

But that would have been a much more confusing demonstration.

Comment 3 wdc 2005-01-05 20:22:18 UTC

We have on the order of 800 systems that do **NOT** used RHN, but
instead use our own scripts that drive RPM.  In the next month we plan
to push out RHEL3 Update 4 to these systems.

It would be REALLY helpful if we got some help resolving this issue.
It would not look good for Red Hat if MIT professors came in to work
with dead systems, and all we could tell them was that "Red Hat didn't
think our bug was very important."

Comment 4 Paul Nasrat 2005-01-06 17:17:22 UTC

Out of curiosity why aren't you using RHN/up2date.  Can you attach the scripts
you use?

If you are seeing this frequently I imagine that there is a more robust way of
scripting it - are you using rpm-python bindings or just calling to rpm?  THe
only time that I can really see you getting two packages installed in parallel
(other than with biarch setups) is with prematurely terminated transactions.

Comment 5 wdc 2005-01-06 17:29:25 UTC

Our update system predates the existance of Linux, and is multi-platform.
We currently update both Solaris and Linux.
Historically we've run our update system on:  Vax Ultrix, IBM AOS, IBM AIX, SGI IRIX, SunOS, etc.

One of these days the industry will catch up to us.
Every so often Red Hat and I, or Sun and I restart conversations about technology transfer,
but then something distracts them away.

Here's a web page we publish for our customers describing some of the user-visible differences:
http://web.mit.edu/ist/topics/linux/choosing.html

Here's an ancient web page that does a crappy job of explaining some of the insides of our install/
update system:
http://web.mit.edu/teamhtml/Athena/FY97/install/main/main.html

Comment 6 Greg Hudson 2005-01-06 17:31:37 UTC

We use a C program which calls out to rpmlib.  In the cases we've seen reported,
users have seen the rpmlib-using process hang in a futex() call (suggesting some
kind of Berkeley DB issue).  We have had two reports of that situation so far;
since we have no way of reproducing it, and it happens infrequently, I have no
current expectation of asking Red Hat to fix it.  It could even be our bug,
although RPM's history of BDB hangs and the ratio of our code to RPM code
suggests otherwise.  http://web.mit.edu/source/athena/etc/rpmupdate contains our
code.

A less baroque scenario would be that the user interrupts the update by hand,
perhaps in one of the places where rpmlib provides no feedback to the
application that it is making progress within the transaction.

At any rate, the issue here is the consequence of such an interrupt being
unnecessarily dire.  RPM is not supposed to be so brittle that it eats itself or
other parts of the system upon resuming an interrupted update.  I don't think
it's appropriate to change the status of this bug to NEEDINFO, since I have
provided a very clear reproduction recipe for the part of the problem which
needs to be fixed.

Comment 7 Paul Nasrat 2005-01-06 17:48:14 UTC

Thanks for the additional information, the NEEDINFO was merely set whilst trying
to gather more information about your environment as a whole.  Thanks for the
test case.

Comment 11 wdc 2005-03-01 22:38:42 UTC

I am disappointed that there's been no forward progress on this issue
in  nearly two months.  We can point to 11 systems here that required
a re-install because the RPM database got corrupted, and we see
another 6 systems hung in updates that may be caused by this problem.

We prefer to have NO systems requiring re-install or manual
intervention for updates.  Can we PLEASE get some resourcing behind
this issue?

Comment 12 Shawn Hunter 2005-03-02 18:48:42 UTC

MIT-

Thanks for using Bugzilla, and we hope you will continue to do so when
reporting bugs.

Keep in mind, however, that Bugzilla is simply a mechanism to report
bugs, and has no defined SLA.  It is not a support 
mechanism as is outlined on the front page of bugzilla.redhat.com. If
you are expecting help with problems during a defined time period, it
is more appropriate to contact Red Hat Tech Support.  If you do not
have a tech support contract, and want to inquire about options, you
may contact me.  My name is Shawn Hunter and I am the MIT account
manager at Red Hat.

MIT will be deploying a satellite server very soon, which could help
with these problems.  The administrator of that server has access to
24x7 tech support for it with defined SLA's.  Obviously,  we will not
be able to help you with Athena as it is not a product we distribute,
but I can point you to the appropriate contacts at MIT.  Please feel
free to contact me directly with questions.

Thanks!
-Shawn Hunter
shunter
919-754-3744

Comment 14 wdc 2005-05-06 19:32:18 UTC

So, on 16 March, I opened an escalation with Red Hat to ask that this issue
be addressed.

On 19 April, that escalation was closed with, "Closing this ticket as the
process will be handled from engineering and Bugzilla."

But alas, I see no status change.
I have heard nothing from anyone.

I have moved from being disappointed to being extremely disappointed.

I appreciate that resources are finite, and that there are many different
approaches to prioritization.  But after being told, "Follow this process and we
will take your issue seriously" and having the issue dropped for another two
months is really not good.

Let's please at least look closely enough at the problem to understand the scope
in crafting a repair and to make a credible assessment of the likelihood that it
will begin hurting more sites than just MIT!  

-wdc

Comment 17 Greg Hudson 2005-09-19 16:59:20 UTC

I've been told that this bug is not receiving due attention because of various
misunderstandings on Red Hat's part.  So, here is a little FAQ:

Q. Isn't this bug unimportant because it's very unusual to have two versions of
an RPM installed at once?

A. No.  You can have two versions of an RPM installed at once because an "rpm
-Uvh" invocation was interrupted, or because a scriptlet failed.  For example,
try installing librsvg2 from the original RHEL 3 on an up-to-date RHEL 3 system:

  rpm -Uvh --oldpackage librsvg2-2.2.3-2.i386.rpm 

The %post scriptlet fails, and you wind up with two versions of librsvg2
installed.  librsvg2 doesn't happen to trigger the bug described in this bug
report; I'm simply illustrating how a user can get into the "two versions of the
same package installed at once" situation quite easily, without using rpm -i
--force.

Q. Isn't this bug unimportant because, in the reproduction recipe, only
documentation files went missing?

A. The reproduction recipe was only a simple example, intended to aid in
debugging the problem.  When we've seen this problem in the field, much more
important files have gone missing, including files needed to make "rpm" itself
continue functioning.

Comment 20 Paul Nasrat 2005-09-27 17:05:05 UTC

rpmfidebug

D: fini      040700  2 (   2,   2)      4096 /var/spool/at/spool skip
D: fini      100600  1 (   2,   2)         0 /var/spool/at/.SEQ skip
D: fini      040700  3 (   2,   2)      4096 /var/spool/at skip
D: fini      100644  1 (   0,   0)       399 /usr/share/man/man8/atrun.8.gz skip
D: fini      100644  1 (   0,   0)       887 /usr/share/man/man8/atd.8.gz skip
D: fini      100777  1 (   0,   0)       430 /usr/share/man/man5/at.deny.5.gz sk ip
D: fini      100644  1 (   0,   0)       430 /usr/share/man/man5/at.allow.5.gz s kip
D: fini      120777  1 (   0,   0)         7 /usr/share/man/man1/batch.1.gz skip
D: fini      120777  1 (   0,   0)         7 /usr/share/man/man1/atrm.1.gz skip
D: fini      120777  1 (   0,   0)         7 /usr/share/man/man1/atq.1.gz skip
D: fini      100644  1 (   0,   0)      3023 /usr/share/man/man1/at.1.gz skip
D: fini      100644  1 (   0,   0)      2451 /usr/share/doc/at-3.1.8/timespec
D: fini      100644  1 (   0,   0)      1854 /usr/share/doc/at-3.1.8/README
D: fini      100644  1 (   0,   0)       387 /usr/share/doc/at-3.1.8/Problems
D: fini      100644  1 (   0,   0)       626 /usr/share/doc/at-3.1.8/Copyright
D: fini      100644  1 (   0,   0)      1925 /usr/share/doc/at-3.1.8/ChangeLog
D: fini      040755  2 (   0,   0)      4096 /usr/share/doc/at-3.1.8
D: fini      100755  1 (   0,   0)        67 /usr/sbin/atrun skip
D: fini      100755  1 (   0,   0)     22808 /usr/sbin/atd skip
D: fini      100755  1 (   0,   0)       975 /usr/bin/batch skip
D: fini      120777  1 (   0,   0)         2 /usr/bin/atrm skip
D: fini      120777  1 (   0,   0)         2 /usr/bin/atq skip
D: fini      104755  1 (   0,   0)     43740 /usr/bin/at skip
D: fini      100755  1 (   0,   0)      1176 /etc/rc.d/init.d/atd skip
D: fini      100600  1 (   0,   0)         1 /etc/at.deny skip

Comment 22 wdc 2005-09-27 19:38:14 UTC

Hi Paul,

I see the output you've appended to the case, but I don't understand its meaning.
What exactly does that output demonstrate?

-wdc

Comment 23 Paul Nasrat 2005-09-27 19:45:04 UTC

I'm adding notes to record results of the investigation here - it's mostly for
my reference.  I wouldn't worry about it - however fyi

The rpmfi (file information) action is not set to skip on the files that are
erased, I'm tracing back why this might be happening atm.

Comment 24 wdc 2005-09-27 19:49:58 UTC

Thanks very much for the clarification.
I'm also really pleased to see that there may be some actual traction in 
identifying an actual problem.  Thanks very much for moving this forward!

-wdc

Comment 26 Paul Nasrat 2005-10-12 14:02:41 UTC

I've isolated the root cause of this bug and am assessing how to proceed.

Comment 27 wdc 2005-10-12 18:06:44 UTC

That's GREAT news!
Thanks!

Comment 34 wdc 2006-02-11 15:10:23 UTC

Why has the priority of this bug been downgraded?
Is a data destruction bug now considered "normal"?

Back in October, it looked like we were very close to a resolution on this bug.
What's happened?

Comment 42 RHEL Program Management 2006-06-16 15:56:42 UTC

Quality Engineering Management has reviewed and declined this request.  You may appeal this decision by reopening this request.

Comment 43 wdc 2006-06-16 16:37:40 UTC

So, in October 2005, it seemed like we understood what this problem was.
FOUR MONTHS  later, I ask for a status report, and hear silence for ANOTHER four months.
Today, we just say, QEM has reviewed and declined the request.

Sorry, but you guys REALLY can do better than that!

You have a bug here where the way in which software is installed can potentially render a system
un-bootable if the race condition happens at the right time.

At one time the bug was on the CUSP of being understood.
You owe it to your users AT LEAST to say more than, "behind closed doors we're killing this
service request with no explanation."

PLEASE?

Could you at least clarify why, specifically you've chosen to shut this bug off?

    Is the bug understood, and too difficult to repair?
    Did the person who understood the problem leave, and it's just too hard to track down?
    Have you decided that this will not affect ordinary users of RPM?  If so, what proof do you have
        that the condition is merely sleeping, not truly unique to the way MIT calls rpmlib?
    Have you decided this problem does not exist in Red Hat 4?  What proof do you have the problem
        is resolved there?
    
At a time when MIT is looking at other distributions than Red Hat because of a perceived inability of 
Red Hat to provide device drivers and bug fixes in a timely manner, this bug serves as an unfavorable
example of Red Hat's abilities.