From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4a) Gecko/20030313 Description of problem: With RPM test-4.1.1 or test-4.2 (under RH 8.0 or 9), it is possible for the RPM database to be in a state such that: (a) rpmdb_verify finds no problems (b) removal of a package using "rpm -e" appears to succeed (c) immediately after (b), rpmdb_verify complains of corruption The only remotely feasible way of reproducing this requires doing tremendously stupid stuff as root, but there's still a bug here: either rpmdb_verify should be screaming at (a), or rpm -e should not be producing corruption at (b). To try to save you the trouble of reproducing the whole procedure described below, I'm going to put some copies of /var/lib/rpm (from a Red Hat 9 system running test-4.2) here: http://math.uci.edu/~bnathan/.vlr/ vlr3 was produced using step 4 of this bug but is actually related to bug 89736, not this bug; I'll describe it in a comment to that bug. vlr4 is after step 4 (e.g. after running a stupid command as root, but rpmdb_verify thinks there's no corruption). vlr5 is after "rpm -e --noscripts kernel-utils" -- after that one command, rpmdb_verify sees corruption, even though rpm -e showed no errors. Version-Release number of selected component (if applicable): rpm-4.2-1 How reproducible: Sometimes Steps to Reproduce: 1. Make a copy of /var/lib/rpm. You might need it. Especially if the bug fails to reproduce. (It fails to reproduce more often than not, unfortunately.) rpm --rebuilddb probably works as a reasonable substitute for a backup copy if necessary, though. 2. Log in as root. 3. If you are running Red Hat 9, export LD_ASSUME_KERNEL=2.2.5 . I absolutely cannot make this database corruption happen without that! (On Red Hat 8.0 the corruption happens without this.) 4. Run the following bit of insanity. Kids, do not try this at home! for z in `seq 1 2`;do rpm -e --noscripts --allmatches kernel-utils & rpm -ivh kernel-utils-2.4-8.29.i386.rpm & done (If you have trouble reproducing the bug, try changing the 2 in `seq 1 2` to something lower or higher. Needless to say, change the kernel-utils filename as needed.) 5. On RH 9, you can unset LD_ASSUME_KERNEL now if you wish. In my testing it seemed to make no difference in the outcome of the following steps. 6. Run rpm -q kernel-utils. If it's not installed, or if RPM segfaults, restore your backup of /var/lib/rpm and repeat steps 3-4. You may need to do this 5 or 6 times. (BTW, it's possible that step 4 can reveal other bugs, but I currently don't plan to report those because I'm assuming those are simply root's fault. If my assumption is wrong, let me know.) 7. Run rpmdb_verify. If rpmdb_verify shows corruption, restore your /var/lib/rpm backup and repeat steps 3-6. If rpmdb_verify thinks everything is OK, you may want to make another copy of /var/lib/rpm now (to keep yourself from having to repeat step 4 more times than absolutely necessary). 8. rpm -e --noscripts kernel-utils (if you want, try without --noscripts first and try again with --noscripts if it fails) 9. Run rpmdb_verify again. If you now have corruption, you have reproduced the bug. If not, optionally restore your original /var/lib/rpm backup, and go back to step 3. Actual Results: Step 4 can be very noisy. Here's some output from steps 6 and later, from one run on Red Hat 9 with test-4.2: [root@localhost root]# rpm -q kernel-utils kernel-utils-2.4-8.29 [root@localhost root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]* [root@localhost root]# rpm -e --noscripts kernel-utils [root@localhost root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]* db_verify: Page 5427: overflow page of invalid type 0 db_verify: DB->verify: /var/lib/rpm/Packages: DB_VERIFY_BAD: Database verification failed [root@localhost root]# Expected Results: I expected either (a) some kind of complaint from the first rpmdb_verify or (b) no corruption from rpm -e.
Just to confirm: This happens on Red Hat 9 with LD_ASSUME_KERNEL=2.2.5? I believe I know what's happening if so. There's a window between the database opened O_RDONLY and O_RDWR that intense erase/install concurrent access can exercise. Thanks muchly for the QA work!
Yes, it (the corruption with simultaneous installs/erases) happens on Red Hat 9 with LD_ASSUME_KERNEL=2.2.5. However, the more interesting aspect of this bug IMO is that you can have a database that rpmdb_verify sees nothing wrong with -- and then, after rpm -e (with or without LD_ASSUME_KERNEL), rpmdb_verify suddenly sees problems. even though rpm -e showed no error messages. The easiest way to reproduce this is: 1. Move your existing copy of /var/lib/rpm elsewhere. 2. Make a new directory /var/lib/rpm. 3. Extract http://math.uci.edu/~bnathan/.vlr/vlr4.tar.bz2 (or .gz) into /var/lib/rpm. 4. Run rpmdb_verify; no error messages. 5. rpm -e --justdb kernel-utils; no error messages. 6. Run rpmdb_verify again; error messages appear now.
Reproduced: # /usr/lib/rpm/rpmdb_verify Packages # rpm -e --justdb --noscripts --notriggers kernel-utils # /usr/lib/rpm/rpmdb_verify Packages db_verify: Page 5427: overflow page of invalid type 0 db_verify: DB->verify: Packages: DB_VERIFY_BAD: Database verification failed And not cache related: # rm __db* rm: remove regular file `__db.001'? y rm: remove regular file `__db.002'? y rm: remove regular file `__db.003'? y # /usr/lib/rpm/rpmdb_verify Packages # rpm -e --justdb --noscripts --notriggers kernel-utils # /usr/lib/rpm/rpmdb_verify Packages db_verify: Page 5427: overflow page of invalid[root@yarmouth rpm] db_verify: DB->verify: Packages: DB_VERIFY_BAD: Database verification failed Fix is pretty simple however: # mv Packages Packages-ORIG # /usr/lib/rpm/rpmdb_dump Packages-ORIG | /usr/lib/rpm/rpmdb_load Packages # /usr/lib/rpm/rpmdb_verify Packages type 0 I'll try to take a look at an strace, but I suspect that there's a bug, probably because of the use of a hash to store headers; large parts of the header are kept mostly in overflow pages.
Here's what I'm talking about: # /usr/lib/rpm/rpmdb_stat -d Packages 61561 Hash magic number. 8 Hash version number. Flags: 4096 Underlying database page size. 0 Specified fill factor. 679 Number of keys in the database. 1 Number of data items in the database. 3 Number of hash buckets. 1024 Number of bytes free on bucket pages (92% ff). 5427 Number of overflow pages. 1395134 Number of bytes free in overflow pages (94% ff). 1 Number of bucket overflow pages. 1004 Number of bytes free in bucket overflow pages (75% ff). 0 Number of duplicate pages. 0 Number of bytes free in duplicate pages (0% ff). 0 Number of pages on the free list. Lots and lots of data on overflow pages. But, yes, there's a bug here too.
After 2+ years of thrashing this problem around, it turns out that indeed, rpm as built since RHL 9 works *only* on NPTL systems because, well, that's how rpm is built. There's some hackery in RHEL packages to work around the problem for those who *must* run with LD_ASSUME_KERNEL, but there plain and simply ain't no reason to try to fix this problem in FC since NPTL is in kernel-2.6.x The precise reproducer is (and was) gratefully received. You're also more than welcome on <rpm-devel.duke.edu> even if you prefer lurking ;-)