Bug 142831
Summary: | Removal of old package versions inappropriately erases files | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Greg Hudson <ghudson> |
Component: | rpm | Assignee: | Daniel Riek <riek> |
Status: | CLOSED WONTFIX | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | amb, djuran, jamisonm, katzj, k.georgiou, laroche, nobody+pnasrat, tao, wdc |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | RHEL3U7NAK | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-06-16 15:56:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 170417 |
Description
Greg Hudson
2004-12-14 16:02:13 UTC
"Dramatically compromising system stability" seems a bit extreme if/when the the files are in /usr/share/doc imho. Gah. That was just a random example. The real instances we've been encountering involve systems losing large chunks of rpm itself, because of an aborted update which included an upgrade of the rpm rpms. (Large swaths of other system files also turn up missing, but the missing rpm files are the biggest deal, because rpm stops working as a result.) But that would have been a much more confusing demonstration. We have on the order of 800 systems that do **NOT** used RHN, but instead use our own scripts that drive RPM. In the next month we plan to push out RHEL3 Update 4 to these systems. It would be REALLY helpful if we got some help resolving this issue. It would not look good for Red Hat if MIT professors came in to work with dead systems, and all we could tell them was that "Red Hat didn't think our bug was very important." Out of curiosity why aren't you using RHN/up2date. Can you attach the scripts you use? If you are seeing this frequently I imagine that there is a more robust way of scripting it - are you using rpm-python bindings or just calling to rpm? THe only time that I can really see you getting two packages installed in parallel (other than with biarch setups) is with prematurely terminated transactions. Our update system predates the existance of Linux, and is multi-platform. We currently update both Solaris and Linux. Historically we've run our update system on: Vax Ultrix, IBM AOS, IBM AIX, SGI IRIX, SunOS, etc. One of these days the industry will catch up to us. Every so often Red Hat and I, or Sun and I restart conversations about technology transfer, but then something distracts them away. Here's a web page we publish for our customers describing some of the user-visible differences: http://web.mit.edu/ist/topics/linux/choosing.html Here's an ancient web page that does a crappy job of explaining some of the insides of our install/ update system: http://web.mit.edu/teamhtml/Athena/FY97/install/main/main.html We use a C program which calls out to rpmlib. In the cases we've seen reported, users have seen the rpmlib-using process hang in a futex() call (suggesting some kind of Berkeley DB issue). We have had two reports of that situation so far; since we have no way of reproducing it, and it happens infrequently, I have no current expectation of asking Red Hat to fix it. It could even be our bug, although RPM's history of BDB hangs and the ratio of our code to RPM code suggests otherwise. http://web.mit.edu/source/athena/etc/rpmupdate contains our code. A less baroque scenario would be that the user interrupts the update by hand, perhaps in one of the places where rpmlib provides no feedback to the application that it is making progress within the transaction. At any rate, the issue here is the consequence of such an interrupt being unnecessarily dire. RPM is not supposed to be so brittle that it eats itself or other parts of the system upon resuming an interrupted update. I don't think it's appropriate to change the status of this bug to NEEDINFO, since I have provided a very clear reproduction recipe for the part of the problem which needs to be fixed. Thanks for the additional information, the NEEDINFO was merely set whilst trying to gather more information about your environment as a whole. Thanks for the test case. I am disappointed that there's been no forward progress on this issue in nearly two months. We can point to 11 systems here that required a re-install because the RPM database got corrupted, and we see another 6 systems hung in updates that may be caused by this problem. We prefer to have NO systems requiring re-install or manual intervention for updates. Can we PLEASE get some resourcing behind this issue? MIT- Thanks for using Bugzilla, and we hope you will continue to do so when reporting bugs. Keep in mind, however, that Bugzilla is simply a mechanism to report bugs, and has no defined SLA. It is not a support mechanism as is outlined on the front page of bugzilla.redhat.com. If you are expecting help with problems during a defined time period, it is more appropriate to contact Red Hat Tech Support. If you do not have a tech support contract, and want to inquire about options, you may contact me. My name is Shawn Hunter and I am the MIT account manager at Red Hat. MIT will be deploying a satellite server very soon, which could help with these problems. The administrator of that server has access to 24x7 tech support for it with defined SLA's. Obviously, we will not be able to help you with Athena as it is not a product we distribute, but I can point you to the appropriate contacts at MIT. Please feel free to contact me directly with questions. Thanks! -Shawn Hunter shunter 919-754-3744 So, on 16 March, I opened an escalation with Red Hat to ask that this issue be addressed. On 19 April, that escalation was closed with, "Closing this ticket as the process will be handled from engineering and Bugzilla." But alas, I see no status change. I have heard nothing from anyone. I have moved from being disappointed to being extremely disappointed. I appreciate that resources are finite, and that there are many different approaches to prioritization. But after being told, "Follow this process and we will take your issue seriously" and having the issue dropped for another two months is really not good. Let's please at least look closely enough at the problem to understand the scope in crafting a repair and to make a credible assessment of the likelihood that it will begin hurting more sites than just MIT! -wdc I've been told that this bug is not receiving due attention because of various misunderstandings on Red Hat's part. So, here is a little FAQ: Q. Isn't this bug unimportant because it's very unusual to have two versions of an RPM installed at once? A. No. You can have two versions of an RPM installed at once because an "rpm -Uvh" invocation was interrupted, or because a scriptlet failed. For example, try installing librsvg2 from the original RHEL 3 on an up-to-date RHEL 3 system: rpm -Uvh --oldpackage librsvg2-2.2.3-2.i386.rpm The %post scriptlet fails, and you wind up with two versions of librsvg2 installed. librsvg2 doesn't happen to trigger the bug described in this bug report; I'm simply illustrating how a user can get into the "two versions of the same package installed at once" situation quite easily, without using rpm -i --force. Q. Isn't this bug unimportant because, in the reproduction recipe, only documentation files went missing? A. The reproduction recipe was only a simple example, intended to aid in debugging the problem. When we've seen this problem in the field, much more important files have gone missing, including files needed to make "rpm" itself continue functioning. rpmfidebug D: fini 040700 2 ( 2, 2) 4096 /var/spool/at/spool skip D: fini 100600 1 ( 2, 2) 0 /var/spool/at/.SEQ skip D: fini 040700 3 ( 2, 2) 4096 /var/spool/at skip D: fini 100644 1 ( 0, 0) 399 /usr/share/man/man8/atrun.8.gz skip D: fini 100644 1 ( 0, 0) 887 /usr/share/man/man8/atd.8.gz skip D: fini 100777 1 ( 0, 0) 430 /usr/share/man/man5/at.deny.5.gz sk ip D: fini 100644 1 ( 0, 0) 430 /usr/share/man/man5/at.allow.5.gz s kip D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/batch.1.gz skip D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/atrm.1.gz skip D: fini 120777 1 ( 0, 0) 7 /usr/share/man/man1/atq.1.gz skip D: fini 100644 1 ( 0, 0) 3023 /usr/share/man/man1/at.1.gz skip D: fini 100644 1 ( 0, 0) 2451 /usr/share/doc/at-3.1.8/timespec D: fini 100644 1 ( 0, 0) 1854 /usr/share/doc/at-3.1.8/README D: fini 100644 1 ( 0, 0) 387 /usr/share/doc/at-3.1.8/Problems D: fini 100644 1 ( 0, 0) 626 /usr/share/doc/at-3.1.8/Copyright D: fini 100644 1 ( 0, 0) 1925 /usr/share/doc/at-3.1.8/ChangeLog D: fini 040755 2 ( 0, 0) 4096 /usr/share/doc/at-3.1.8 D: fini 100755 1 ( 0, 0) 67 /usr/sbin/atrun skip D: fini 100755 1 ( 0, 0) 22808 /usr/sbin/atd skip D: fini 100755 1 ( 0, 0) 975 /usr/bin/batch skip D: fini 120777 1 ( 0, 0) 2 /usr/bin/atrm skip D: fini 120777 1 ( 0, 0) 2 /usr/bin/atq skip D: fini 104755 1 ( 0, 0) 43740 /usr/bin/at skip D: fini 100755 1 ( 0, 0) 1176 /etc/rc.d/init.d/atd skip D: fini 100600 1 ( 0, 0) 1 /etc/at.deny skip Hi Paul, I see the output you've appended to the case, but I don't understand its meaning. What exactly does that output demonstrate? -wdc I'm adding notes to record results of the investigation here - it's mostly for my reference. I wouldn't worry about it - however fyi The rpmfi (file information) action is not set to skip on the files that are erased, I'm tracing back why this might be happening atm. Thanks very much for the clarification. I'm also really pleased to see that there may be some actual traction in identifying an actual problem. Thanks very much for moving this forward! -wdc I've isolated the root cause of this bug and am assessing how to proceed. That's GREAT news! Thanks! Why has the priority of this bug been downgraded? Is a data destruction bug now considered "normal"? Back in October, it looked like we were very close to a resolution on this bug. What's happened? Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request. So, in October 2005, it seemed like we understood what this problem was. FOUR MONTHS later, I ask for a status report, and hear silence for ANOTHER four months. Today, we just say, QEM has reviewed and declined the request. Sorry, but you guys REALLY can do better than that! You have a bug here where the way in which software is installed can potentially render a system un-bootable if the race condition happens at the right time. At one time the bug was on the CUSP of being understood. You owe it to your users AT LEAST to say more than, "behind closed doors we're killing this service request with no explanation." PLEASE? Could you at least clarify why, specifically you've chosen to shut this bug off? Is the bug understood, and too difficult to repair? Did the person who understood the problem leave, and it's just too hard to track down? Have you decided that this will not affect ordinary users of RPM? If so, what proof do you have that the condition is merely sleeping, not truly unique to the way MIT calls rpmlib? Have you decided this problem does not exist in Red Hat 4? What proof do you have the problem is resolved there? At a time when MIT is looking at other distributions than Red Hat because of a perceived inability of Red Hat to provide device drivers and bug fixes in a timely manner, this bug serves as an unfavorable example of Red Hat's abilities. |