Bug 98248
Summary: | rpm -V segment faults even after repeated DB rebuilds | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Robert M. Riches Jr. <rm.riches> |
Component: | rpm | Assignee: | Jeff Johnson <jbj> |
Status: | CLOSED NOTABUG | QA Contact: | Mike McLean <mikem> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 8.0 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-10-10 14:32:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Robert M. Riches Jr.
2003-06-29 01:20:24 UTC
Created attachment 92667 [details]
original script
(I tried to send a compressed tar file of /var/lib/rpm,
but it exceeded a size limit and was rejected.)
Created attachment 92668 [details]
gzipped core file
Created attachment 92669 [details]
output of gdb's where command
Created attachment 92670 [details]
first script output, list of packages
Created attachment 92671 [details]
second output of script, verification
Created attachment 92672 [details]
new script with workaround that is 80% effective
Can you attach a pointer (i.e. URL, attachments won't work) to your database? I'd be happy to send the database. However, I'm on dynamic IP address and don't have access to a static IP web server. I'm behind a NAT device, and configuring it for external access would likely be error prone. I've never set up Apache and would prefer to not risk making a security mistake in a rush. Is there a way I could FTP it to you? Thanks. Try mail to jbj. No uploads here, sorry. I sent it, and it appears to have been made it as far as I can see. It's gzipped, but even then the uuencoded message is 33MB. It takes a while for SMTP to slog through that much stuff, even on DSL. Please inform if more help from me is needed. Thanks. OK, rpmdb arrived, all signatures verify that contents are same as when signed. db_verify Packages shows that the database is intact. So this isn't a database problem. Sanity check: If using LDAP passwords, then you need to run nscd to avoid a segfault that affects statically linked binaries like rpm: /sbin/service nscd restart You're running some mixture of Red Hat 8 and 9. If you are going to use kernel/glibc from Red Hat 9, then you probably should upgrade to rpm-4.2-1; otherwise, if Red Hat 8 kernel/glibc, then rpm-4.1.1-1, both available from ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.x.y SO are you still segfaulting? To the best of my knowledge, I'm not using LDAP passwords, so I guess I'm probably not using them. To the best of my knowledge, I'm using _strictly_ RHL 8.0 plus ALSA from alsa.org sources. I installed RHL 8.0 and have used up2date every week or two to keep it updated. I'm using kernel-2.4.20-18.8 and glibc-2.3.2-4.80.6, both apparently from up2date, as they're both sitting in /var/spool/up2date. Why do you say I'm using a mix of 8 and 9? Because you have kernel-2.4.20-x.y installed. (for all i know that may be the latest errata for RHL8, but 2.4.20 was first in RHL9). RHL8 is fine, you might want to upgrade to rpm-4.1.1 from ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.1.x after this segfault gets cleared up. Hmmm, this segfault was with up2date? That's what the core says. Can you produce a simpler test, like verifying a single package? Is it all moses, install only, query only, or verify only? If I can narrow the scope some, I can probably Kernel 2.4.20-18.8 was the second 2.4.20 kernel released via RHN for RHL 8.0. It appears the final "8" means it's for RHL 8.0, while a final "9" would mean a kernel intended for RHL 9. I'm not sure how to generate a smaller test case. When I run the script, the same 10 packages see segmentation faults each time. Of those 10, the same 8 always pass on the second try (after a sleep of about 10 seconds), and the same 2 always fail repeatedly. However, in previous testing any manual runs outside the script work fine with _no_ segment faults (either one at a time or all 10 in one command). I rewrote the script in Perl and get the same results as the C-shell script. I suspect it's a very subtle memory corruption or race condition. Would more core files be of any use to you? I can send up to 18 of them, 10 from the first attempt of the 10 problematic packages and 4 each for retries on the 2 hard cases. Any suggestions to making a smaller test case? Oh, it's only on verify that I have seen the segmentation fault issue. I have seen rpm hang while doing installs/upgrades (as have other people, too). Hangs are fixed in rpm-4.1.1. Core files are of little use without symbols. If you could eyeball the stack back traces to see if there's a common traceback, or several problems, is probably all that's of use. Hmmm, intermittent non-reproducible problems don't smell like software, which usually fails reproducibly. Are you prelinking shared libraries by chance? What files systems are involved? Can you try to reproduce with other kernels? Created attachment 92757 [details]
tarball of many tracebacks, three runs, two kernels
I have attached stack backtraces from a bunch of core files, all neatly packaged in a tarball. Sorry if I sent more than you wanted; hopefully it is easy to disregard the excess ones. The files are named traceback.run{1,2,3}.core.* with run1 and run2 using kernel 2.4.20-18.8 and run3 using kernel 2.4.18-27.8.0. I see a whole lot off similarity in the tracebacks. The 'file /usr/lib/rpm/rpmv' command says it is not stripped. Being as that binary I have came from a binary RPM, with luck you (or your developers) will have access to that executable with symbols and could find the line(s) of code that are hitting the problem. I hope this will help find and fix the apparent bug. Actually, the seg faults are reproducible. It just takes a little setup to create the right environment to reproduce them. While it is a little rare to find software bugs that are diffult to reproduce, it does happen that there are some software problems that require strange environmental issues to tickle the bug. I'll spare you the war stories about such cases I saw in the 17 years I worked in processor design at Intel Corp.--unless you want to hear the war stories. :-) One hypothesis is something is being accessed beyond the valid end of the stack. Perhaps function A is calling function B. Function B has some local variables allocated on the stack. Function B returns or stores a pointer to one of those local variables. After function B returns, function A dereferences the pointer and reads something from the stack. In the absence of interrupts, page faults, other asynchronous events, or any intervening calls that overwrite the area of memory that was function B's stack frame, function A would read what function B had left there. However, if there were any kind of interrupt, such as a page fault, I/O handling, or such during the intervening time, function A could get something different from what function B had intended. If this corrupted datum read by function A were used as an array index or a pointer, then we could easily see a segmentation violation. Other possible causes of semi-deterministic behavior are any heuristics used by malloc() or heuristics used by a garbage collector. I have a war story on the latter one. Am I prelinking shared libraries? I don't know of anything I am knowingly doing that would be classified as doing that. As far as I'm aware, I'm using the 'rpm' command straight out of the box. File systems: /, /boot, /tmp are ext2 (don't yet trust journaling) over IDE. I have checked system log files and don't see anything indicating RAM or disk errors. Other file systems served over NFS from another RHL 8.0 box: /home, /usr/local, and /var/spool/mail. Network errors and such are extremely low--nothing noteworthy. Sure that's lots of ways to screw up subtly. Howvere, I have run "LD_ASSUME_KERNEL=2.4.0 valgrind -v rpm -Va" and do not see anything unusual. OTOH, that doesn't mean valgring catches all problems. OK, let's try to narrow this down some. Are you running these verifies concurrently? Are you running as root? If not, please do so; root <-> root locking is in place, nonroot cannot create locks yet. Thanks for working on this. Yes, this is subtle. Of my three systems, only one shows the symptoms: "one" was udpated via up2date and has not seen the symptom; "two" was updated via autorpm and saw the symptom until I rebuilt the DB twice; "three" was updated via autorpm and sees the symptom even after several DB rebuilds. "three" was created as a clone of "two" and has been updated in lock step. Both "two" and "three" get /usr/local over NFS, "two" has 100mb ethernet, while "three" has 10mb ethernet. I'm guessing network speed affects timing which makes the difference. Answers to your questions: I'm not running the 'rpm -V $pkg' concurrently. The first attachment, 92667, is the original script. A new script, which I will attach, gets the same symptoms. This latter script is what produced the core files from which the backtraces came. In the original script, the key is the following: foreach pkg (`cat $qfile`) echo 'vvvvvvvv '$pkg' vvvvvvvv' >> $vfile csh -c "rpm -V $pkg" >>& $vfile echo '^^^^^^^^ '$pkg' ^^^^^^^^' >> $vfile end I am running as root. I would have permission denials if I ran 'rpm -V' as non-root. How else can I help get this resolved? Created attachment 92777 [details]
new script gets same symptoms
Hmmm, adding "--qf '%{name}-%{version}-%{release}'" to rpm -qa will make your perl script immune to Start time: Mon Jul 7 14:31:33 EDT 2003 sh: -c: line 1: syntax error near unexpected token `(' sh: -c: line 1: `rpm -V 4Suite(0:0.11.1-11).i386' ERROR: unknown status from 'rpm -V': 512 Also, your program is pig slow (because of exec's) on an already slow --verify. If all you want is customizable pre/post per-pkg markers, this is like 15 lines of C added to -Va. 'Tain't hard, might even be useful. I've run your perl script with rpm-4.2.1-0.11 and (duh) don't see segfaults, so I doubt I'm gonna be able to reproduce locally. I'll be happy to try and pin down wassup, but (be forewarned) the last time I chased a problem like this it turned out to be bugg NFS. You might try latest valgrind-1.9.5 to convince yourself that this isn't an rpm problem. Here's what I see when running valgrind (one minor rpm problem, several documented performance tweaks in berkeley db): ==32054== ERROR SUMMARY: 1075 errors from 3 contexts (suppressed: 6 from 2) ==32054== ==32054== 2 errors in context 1 of 3: ==32054== Invalid read of size 4 ==32054== at 0x4027F00D: (within /usr/lib/librpm-4.2.so) ==32054== by 0x4027FE88: rpmalAllFileSatisfiesDepend (in /usr/lib/librpm-4.2.so) ==32054== by 0x40280304: rpmalAllSatisfiesDepend (in /usr/lib/librpm-4.2.so) ==32054== by 0x4028034F: rpmalSatisfiesDepend (in /usr/lib/librpm-4.2.so) ==32054== Address 0x419281B0 is 4 bytes after a block of size 16 alloc'd ==32054== at 0x40162788: malloc (vg_clientfuncs.c:100) ==32054== by 0x40162C78: realloc (vg_clientfuncs.c:265) ==32054== by 0x4027F589: rpmalAdd (in /usr/lib/librpm-4.2.so) ==32054== by 0x40267151: rpmtsAddInstallElement (in /usr/lib/librpm-4.2.so) ==32054== ==32054== 74 errors in context 2 of 3: ==32054== Syscall param pwrite(buf) contains uninitialised or unaddressable byte(s) ==32054== at 0x4052B894: __GI___pwrite64 (in /lib/libc-2.3.2.so) ==32054== by 0x4041EF41: __pwrite64 (vg_libpthread.c:2502) ==32054== by 0x4035E538: __os_io_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== by 0x403568FF: (within /usr/lib/librpmdb-4.2.so) ==32054== Address 0x41CB2F2E is 1261322 bytes inside a block of size 1318912 alloc'd ==32054== at 0x40162788: malloc (vg_clientfuncs.c:100) ==32054== by 0x4035CEF7: __os_malloc_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== by 0x4035E21A: __os_r_attach_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== by 0x40330176: __db_r_attach_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== ==32054== 999 errors in context 3 of 3: ==32054== Conditional jump or move depends on uninitialised value(s) ==32054== at 0x40358709: __memp_fopen_int_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== by 0x40358080: (within /usr/lib/librpmdb-4.2.so) ==32054== by 0x40306ECF: __db_dbenv_setup_rpmdb (in /usr/lib/librpmdb-4.2.so) ==32054== by 0x40318E2F: __db_dbopen_rpmdb (in /usr/lib/librpmdb-4.2.so) --32054-- --32054-- supp: 2 _dl_relocate_object*/dl_open_worker/_dl_catch_error*(Cond) --32054-- supp: 4 __pthread_mutex_unlock/_IO_funlockfile ==32054== ==32054== IN SUMMARY: 1075 errors from 3 contexts (suppressed: 6 from 2) ==32054== ==32054== malloc/free: in use at exit: 159 bytes in 12 blocks. ==32054== malloc/free: 1024677 allocs, 1024665 frees, 1950588649 bytes allocated. ==32054== --32054-- TT/TC: 0 tc sectors discarded. --32054-- 13349 chainings, 0 unchainings. --32054-- translate: new 16157 (257093 -> 3240959; ratio 126:10) --32054-- discard 112 (1421 -> 18060; ratio 127:10). --32054-- dispatch: 7916650000 jumps (bb entries), of which 437328244 (5%) were unchained. --32054-- 158494/7422051 major/minor sched events. 1073287 tt_fast misses. --32054-- reg-alloc: 2816 t-req-spill, 601672+23605 orig+spill uis, 82803 total-reg-r. --32054-- sanity: 158380 cheap, 6336 expensive checks. --32054-- ccalls: 58421 C calls, 55% saves+restores avoided (190638 bytes) --32054-- 80294 args, avg 0.88 setup instrs each (18720 bytes) --32054-- 0% clear the stack (175263 bytes) --32054-- 23166 retvals, 31% of reg-reg movs avoided (13938 bytes) Thanks for the format hint. I'll keep it in mind in case I ever see the parentheses from 'rpm -qa'. The two things I get from the script are the per-package markers and deterministic order of packages. That way, I can compare a run from last week with this week to make sure nothing was corrupted in between. I can also compare pre-update vs. post-update to see whether the update created any bad things. If a future version of RPM would do those things faster than the script, that would be great! What kind of bug in NFS would cause this RPM to give a segmentation fault? Being as the NFS involved is between two RHL 8.0 machines, I guess such a finding would have to result in another bug report against NFS. Thanks. So far so good as far as not having any more segmentation faults. However, I just had another hang when upgrading six packages (openssh*, wu-ftp). Jeff Johnson said hang are fixed in 4.1.1. So, when will this fix get production release for RHL 8.0? The reason I'm not jumping in to switch to a non-production release is I got chewed out by Mike Harris a year or three ago for using a non-production version of XFree86 he had suggested I use. They're back... :-( This time, updates were to the kdebase, openssh, and sendmail groups. Now, the problem shows up on a different machine. The problematic machine hung while doing rpm -Uvh. I killed it, removed the lock files, and did rpm --rebuilddb six times. Then, I checked for any of the old packages still around, the ones that should have been removed when the newer versions were installed. I found all the old ones were still listed as present. I did rpm -e on them. It hung once, so I killed it, did rpm --rebuilddb five times. Doing rpm -e on the remaining older ones succeeded. Now, rpm -V gets segment faults and dumps core. Do you want any information? At this point, I plan to do another series of several rpm --rebuilddb and see what happens. Again, this is _not_ on the same machine as had the problem last time. This is on a different machine, one that was fine (after getting rid of the hangs) last time. Someone mentioned a version of RPM that solves the hangs. Is there any chance this will ever be released to production for 8.0? Thanks. Hmmm... After 7 more repetitions of rpm --rebuilddb, the segmentation faults are now gone again. So, when is this new-and-improved rpm version going to be released to production for RHL 8.0? Being as I can't duplicate the problem, even if there was interest in trying to isolate the root cause, please go ahead and mark it back to notabug, closed. NOTABUG For the benefit of anyone who later looks at this report, both the hangs and segment faults continue in RHL 9. (It appears Jeff Johnson won't be swayed by facts, so it isn't worth changing the report state.) On my first machine, one that never saw a hang or segment fault (at least from rpm) with RHL 8.0, there were some segment faults after doing the big up2date session after installing RHL 9. After setting LD_ASSUME_KERNEL to 2.4.19 (thanks to Shadowman for that workaround), rpm --rebuilddb five times solved them. On my second machine, using rpm (via the autorpm perl script) to update about 178 packages, rpm hung hard after about 165. This left somewhere around 100 of the old packages still listed as installed. Doing 'rpm -e' on the stale packages (fortunately, not glibc or anything else vital) and then 'rpm -i --replacepackage' (sp?) on the affected new packages got the system to a good state. Of source, there were 'rpm --rebuilddb' sessions along the way. On my third machine, using rpm to manually update about 10 packages at a time after a clean RHL 9 install resulted in segment faults when doing 'rpm -V'. Four iterations of 'rpm --rebuilddb' solved the segment faults. Three different machines, and they all show the segment faults. I'd say that means they _do_ exist, even in RHL 9's version of RPM. Just in case anyone cares to look into this issue, I'm saving core files as I get them. Just for the record: I'm not saying that the segfaults don't exist. And I have no clue if or when later versions of rpm will be available for RHL 8.0. Note that RHL 8.0 support by Red Hat ended 1/1/04. I'm aware 8.0 has expired. That's why I installed RHL 9. Is there anything that can be done to fix the hangs and/or segment faults in RHL 9's RPM? Shadowman said to set LD_ASSUME_KERNEL to 2.4.19 for doing a --rebuilddb. Is it likely this would help installation and/or verification? Just to set the record straight on the accusation that this was a hardware problem, I have not seen a _SINGLE_ segmentation fault or hang from rpm on _ANY_ of the Mandrake/Mandriva releases I have used since RedHat 9 expired. This is on the exact same _THREE_ machines that all had the same type of segmentation faults and such on Redhat 8.0 and 9. |