Bug 89738

Summary: rpm -e causes (or reveals??) RPM database corruption under some circumstances
Product: [Retired] Red Hat Linux Reporter: Barry K. Nathan <barryn>
Component: rpmAssignee: Jeff Johnson <jbj>
Status: CLOSED CURRENTRELEASE QA Contact: Mike McLean <mikem>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: mitr
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
URL: http://math.uci.edu/~bnathan/.vlr/
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-02-07 23:40:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Barry K. Nathan 2003-04-27 07:52:12 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4a) Gecko/20030313

Description of problem:
With RPM test-4.1.1 or test-4.2 (under RH 8.0 or 9), it is possible for the RPM
database to be in a state such that:
(a) rpmdb_verify finds no problems
(b) removal of a package using "rpm -e" appears to succeed
(c) immediately after (b), rpmdb_verify complains of corruption

The only remotely feasible way of reproducing this requires doing tremendously
stupid stuff as root, but there's still a bug here: either rpmdb_verify should
be screaming at (a), or rpm -e should not be producing corruption at (b).


To try to save you the trouble of reproducing the whole procedure described
below, I'm going to put some copies of /var/lib/rpm (from a Red Hat 9 system
running test-4.2) here:
http://math.uci.edu/~bnathan/.vlr/

vlr3 was produced using step 4 of this bug but is actually related to bug 89736,
not this bug; I'll describe it in a comment to that bug. vlr4 is after step 4
(e.g. after running a stupid command as root, but rpmdb_verify thinks there's no
corruption). vlr5 is after "rpm -e --noscripts kernel-utils" -- after that one
command, rpmdb_verify sees corruption, even though rpm -e showed no errors.



Version-Release number of selected component (if applicable):
rpm-4.2-1

How reproducible:
Sometimes

Steps to Reproduce:
1. Make a copy of /var/lib/rpm. You might need it. Especially if the bug fails
to reproduce. (It fails to reproduce more often than not, unfortunately.) rpm
--rebuilddb probably works as a reasonable substitute for a backup copy if
necessary, though.

2. Log in as root.

3. If you are running Red Hat 9, export LD_ASSUME_KERNEL=2.2.5 . I absolutely
cannot make this database corruption happen without that! (On Red Hat 8.0 the
corruption happens without this.)

4. Run the following bit of insanity. Kids, do not try this at home!

for z in `seq 1 2`;do rpm -e --noscripts --allmatches kernel-utils & rpm -ivh
kernel-utils-2.4-8.29.i386.rpm & done

(If you have trouble reproducing the bug, try changing the 2 in `seq 1 2` to
something lower or higher. Needless to say, change the kernel-utils filename as
needed.)

5. On RH 9, you can unset LD_ASSUME_KERNEL now if you wish. In my testing it
seemed to make no difference in the outcome of the following steps.

6. Run rpm -q kernel-utils. If it's not installed, or if RPM segfaults, restore
your backup of /var/lib/rpm and repeat steps 3-4. You may need to do this 5 or 6
times. (BTW, it's possible that step 4 can reveal other bugs, but I currently
don't plan to report those because I'm assuming those are simply root's fault.
If my assumption is wrong, let me know.)

7. Run rpmdb_verify. If rpmdb_verify shows corruption, restore your /var/lib/rpm
backup and repeat steps 3-6. If rpmdb_verify thinks everything is OK, you may
want to make another copy of /var/lib/rpm now (to keep yourself from having to
repeat step 4 more times than absolutely necessary).

8. rpm -e --noscripts kernel-utils (if you want, try without --noscripts first
and try again with --noscripts if it fails)

9. Run rpmdb_verify again. If you now have corruption, you have reproduced the
bug. If not, optionally restore your original /var/lib/rpm backup, and go back
to step 3.

Actual Results:  Step 4 can be very noisy. Here's some output from steps 6 and
later, from one run on Red Hat 9 with test-4.2:

[root@localhost root]# rpm -q kernel-utils
kernel-utils-2.4-8.29
[root@localhost root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]*
[root@localhost root]# rpm -e --noscripts kernel-utils
[root@localhost root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]*
db_verify: Page 5427: overflow page of invalid type 0
db_verify: DB->verify: /var/lib/rpm/Packages: DB_VERIFY_BAD: Database
verification failed
[root@localhost root]# 


Expected Results:  I expected either (a) some kind of complaint from the first
rpmdb_verify or (b) no corruption from rpm -e.

Comment 1 Jeff Johnson 2003-04-29 17:16:57 UTC
Just to confirm:

   This happens on Red Hat 9 with LD_ASSUME_KERNEL=2.2.5?

I believe I know what's happening if so. There's a window
between the database opened O_RDONLY and O_RDWR that intense
erase/install concurrent access can exercise.

Thanks muchly for the QA work!

Comment 2 Barry K. Nathan 2003-04-29 19:15:51 UTC
Yes, it (the corruption with simultaneous installs/erases) happens on Red Hat 9
with LD_ASSUME_KERNEL=2.2.5.


However, the more interesting aspect of this bug IMO is that you can have a
database that rpmdb_verify sees nothing wrong with -- and then, after rpm -e
(with or without LD_ASSUME_KERNEL), rpmdb_verify suddenly sees problems. even
though rpm -e showed no error messages. The easiest way to reproduce this is:

1. Move your existing copy of /var/lib/rpm elsewhere.
2. Make a new directory /var/lib/rpm.
3. Extract http://math.uci.edu/~bnathan/.vlr/vlr4.tar.bz2 (or .gz) into
/var/lib/rpm.
4. Run rpmdb_verify; no error messages.
5. rpm -e --justdb kernel-utils; no error messages.
6. Run rpmdb_verify again; error messages appear now.

Comment 3 Jeff Johnson 2003-04-29 20:20:18 UTC
Reproduced:

# /usr/lib/rpm/rpmdb_verify Packages
# rpm -e --justdb --noscripts --notriggers kernel-utils
# /usr/lib/rpm/rpmdb_verify Packages
db_verify: Page 5427: overflow page of invalid type 0
db_verify: DB->verify: Packages: DB_VERIFY_BAD: Database verification failed

And not cache related:

# rm __db*
rm: remove regular file `__db.001'? y
rm: remove regular file `__db.002'? y
rm: remove regular file `__db.003'? y
# /usr/lib/rpm/rpmdb_verify Packages
# rpm -e --justdb --noscripts --notriggers kernel-utils
# /usr/lib/rpm/rpmdb_verify Packages
db_verify: Page 5427: overflow page of invalid[root@yarmouth rpm]
db_verify: DB->verify: Packages: DB_VERIFY_BAD: Database verification failed

Fix is pretty simple however:

# mv Packages Packages-ORIG
# /usr/lib/rpm/rpmdb_dump Packages-ORIG | /usr/lib/rpm/rpmdb_load Packages
# /usr/lib/rpm/rpmdb_verify Packages type 0

I'll try to take a look at an strace, but I suspect that there's a
bug, probably because of the use of a hash to store headers; large
parts of the header are kept mostly in overflow pages.


Comment 4 Jeff Johnson 2003-04-29 20:35:28 UTC
Here's what I'm talking about:

# /usr/lib/rpm/rpmdb_stat -d Packages
61561   Hash magic number.
8       Hash version number.
Flags:
4096    Underlying database page size.
0       Specified fill factor.
679     Number of keys in the database.
1       Number of data items in the database.
3       Number of hash buckets.
1024    Number of bytes free on bucket pages (92% ff).
5427    Number of overflow pages.
1395134 Number of bytes free in overflow pages (94% ff).
1       Number of bucket overflow pages.
1004    Number of bytes free in bucket overflow pages (75% ff).
0       Number of duplicate pages.
0       Number of bytes free in duplicate pages (0% ff).
0       Number of pages on the free list.

Lots and lots of data on overflow pages. But, yes, there's a bug here too.


Comment 5 Jeff Johnson 2005-02-07 23:40:10 UTC
After 2+ years of thrashing this problem around,
it turns out that indeed, rpm as built since RHL 9
works *only* on NPTL systems because, well, that's
how rpm is built.

There's some hackery in RHEL packages to work around
the problem for those who *must* run with LD_ASSUME_KERNEL,
but there plain and simply ain't no reason to try to fix
this problem in FC since NPTL is in kernel-2.6.x

The precise reproducer is (and was) gratefully received.

You're also more than welcome on <rpm-devel.duke.edu>
even if you prefer lurking ;-)