Bug 98248

Summary:

rpm -V segment faults even after repeated DB rebuilds

Product:

[Retired] Red Hat Linux

Reporter:

Robert M. Riches Jr. <rm.riches>

Component:

rpm

Assignee:

Jeff Johnson <jbj>

Status:

CLOSED NOTABUG

QA Contact:

Mike McLean <mikem>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

8.0

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-10-10 14:32:26 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
original script	none
gzipped core file	none
output of gdb's where command	none
first script output, list of packages	none
second output of script, verification	none
new script with workaround that is 80% effective	none
tarball of many tracebacks, three runs, two kernels	none
new script gets same symptoms	none

Description Robert M. Riches Jr. 2003-06-29 01:20:24 UTC

Description of problem:
Installing the XFree86 and other updates that became available
between June 21 and June 28 apparently corrupted the RPM
database on two of my three systems.  One system was updated
via up2date and had no problems.  The other two were updated
by an "rpm -Uvh ..." command (from autorpm).  The classic
long-time hang of the rpm command happened, so I killed (-9)
it, removed the lock files and did "rpm --rebuilddb", used
rpm to remove the old packages that should have been
superseded.  These two systems had subsequent segmentation
faults when doing "rpm -V" on each package.  On one system,
doing 'rpm --rebuilddb' twice gave successful verification of
all packages.  On the other system, rebuilding several times
has not solved the problem.

The strange thing is when I run a shell script to loop through
the packages and verify each, the same 10 packages segment fault
every time.  However, if I manually do "rpm -V" on those 10
packages, there is no segment fault.  This tells me there may
be something dirty left behind in lock files or elsewhere in
the DB by the repeated runs in the loop.

Version-Release number of selected component (if applicable):
4.1-1.06

How reproducible:
100% on the one system.

Steps to Reproduce:
1. In a shell script loop, run "rpm -V" on each package.
2.
3.
    
Actual results:
10 packages will produce segmentation faults.  However, they
will verify successfully if done manually.

Expected results:
Verification should not segment fault.

Additional info:
I'll be attaching a tar file of my RPM DB, my script, a core
file, the output of gdb's 'where' command on the core file.,
and the output files from my script.

Comment 1 Robert M. Riches Jr. 2003-06-29 01:55:38 UTC

Created attachment 92667 [details]
original script

(I tried to send a compressed tar file of /var/lib/rpm,
but it exceeded a size limit and was rejected.)

Comment 2 Robert M. Riches Jr. 2003-06-29 01:57:10 UTC

Created attachment 92668 [details]
gzipped core file

Comment 3 Robert M. Riches Jr. 2003-06-29 01:57:55 UTC

Created attachment 92669 [details]
output of gdb's where command

Comment 4 Robert M. Riches Jr. 2003-06-29 01:58:34 UTC

Created attachment 92670 [details]
first script output, list of packages

Comment 5 Robert M. Riches Jr. 2003-06-29 01:59:27 UTC

Created attachment 92671 [details]
second output of script, verification

Comment 6 Robert M. Riches Jr. 2003-06-29 02:00:40 UTC

Created attachment 92672 [details]
new script with workaround that is 80% effective

Comment 7 Jeff Johnson 2003-07-03 15:24:53 UTC

Can you attach a pointer (i.e. URL, attachments won't work)
to your database?

Comment 8 Robert M. Riches Jr. 2003-07-03 15:35:59 UTC

I'd be happy to send the database.  However, I'm on dynamic IP
address and don't have access to a static IP web server.  I'm
behind a NAT device, and configuring it for external access
would likely be error prone.  I've never set up Apache and
would prefer to not risk making a security mistake in a rush.
Is there a way I could FTP it to you?  Thanks.

Comment 9 Jeff Johnson 2003-07-03 17:27:05 UTC

Try mail to jbj. No uploads here, sorry.

Comment 10 Robert M. Riches Jr. 2003-07-03 18:21:17 UTC

I sent it, and it appears to have been made
it as far as I can see.  It's gzipped, but
even then the uuencoded message is 33MB.
It takes a while for SMTP to slog through
that much stuff, even on DSL.  Please
inform if more help from me is needed.
Thanks.

Comment 11 Jeff Johnson 2003-07-03 18:55:36 UTC

OK, rpmdb arrived, all signatures verify that contents
are same as when signed.

db_verify Packages shows that the database is intact.

So this isn't a database problem.

Sanity check: If using LDAP passwords, then you need
to run nscd to avoid a segfault that affects statically linked
binaries like rpm:
    /sbin/service nscd restart

You're running some mixture of Red Hat 8 and 9. If you
are going to use kernel/glibc from Red Hat 9, then you
probably should upgrade to rpm-4.2-1; otherwise, if Red Hat
8 kernel/glibc, then rpm-4.1.1-1, both available from
    ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.x.y

SO are you still segfaulting?

Comment 12 Robert M. Riches Jr. 2003-07-03 19:10:24 UTC

To the best of my knowledge, I'm not using LDAP passwords,
so I guess I'm probably not using them.

To the best of my knowledge, I'm using _strictly_ RHL 8.0
plus ALSA from alsa.org sources.  I installed RHL 8.0 and
have used up2date every week or two to keep it updated.
I'm using kernel-2.4.20-18.8 and glibc-2.3.2-4.80.6, both
apparently from up2date, as they're both sitting in
/var/spool/up2date.

Why do you say I'm using a mix of 8 and 9?

Comment 13 Jeff Johnson 2003-07-03 19:21:47 UTC

Because you have kernel-2.4.20-x.y installed. (for all i know
that may be the latest errata for RHL8, but 2.4.20 was first in
RHL9).

RHL8 is fine, you might want to upgrade to rpm-4.1.1 from
    ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.1.x
after this segfault gets cleared up.

Hmmm, this segfault was with up2date? That's what the core says.

Can you produce a simpler test, like verifying a single package?

Is it all moses, install only, query only, or verify only?

If I can narrow the scope some, I can probably

Comment 14 Robert M. Riches Jr. 2003-07-03 19:46:43 UTC

Kernel 2.4.20-18.8 was the second 2.4.20 kernel
released via RHN for RHL 8.0.  It appears the
final "8" means it's for RHL 8.0, while a final
"9" would mean a kernel intended for RHL 9.

I'm not sure how to generate a smaller test case.
When I run the script, the same 10 packages see
segmentation faults each time.  Of those 10, the
same 8 always pass on the second try (after a
sleep of about 10 seconds), and the same 2 always
fail repeatedly.  However, in previous testing
any manual runs outside the script work fine with
_no_ segment faults (either one at a time or all
10 in one command).  I rewrote the script in Perl
and get the same results as the C-shell script.

I suspect it's a very subtle memory corruption or
race condition.  Would more core files be of any
use to you?  I can send up to 18 of them, 10 from
the first attempt of the 10 problematic packages
and 4 each for retries on the 2 hard cases.

Any suggestions to making a smaller test case?

Oh, it's only on verify that I have seen the
segmentation fault issue.  I have seen rpm hang
while doing installs/upgrades (as have other
people, too).

Comment 15 Jeff Johnson 2003-07-03 19:53:56 UTC

Hangs are fixed in rpm-4.1.1.

Core files are of little use without symbols. If you
could eyeball the stack back traces to see if there's
a common traceback, or several problems, is probably all
that's of use.

Hmmm, intermittent non-reproducible problems don't smell
like software, which usually fails reproducibly.

Are you prelinking shared libraries by chance?

What files systems are involved?

Can you try to reproduce with other kernels?

Comment 16 Robert M. Riches Jr. 2003-07-04 18:52:14 UTC

Created attachment 92757 [details]
tarball of many tracebacks, three runs, two kernels

Comment 17 Robert M. Riches Jr. 2003-07-04 18:56:05 UTC

I have attached stack backtraces from a bunch of core files,
all neatly packaged in a tarball.  Sorry if I sent more than
you wanted; hopefully it is easy to disregard the excess
ones.  The files are named traceback.run{1,2,3}.core.* with
run1 and run2 using kernel 2.4.20-18.8 and run3 using kernel
2.4.18-27.8.0.

I see a whole lot off similarity in the tracebacks.  The
'file /usr/lib/rpm/rpmv' command says it is not stripped.
Being as that binary I have came from a binary RPM, with
luck you (or your developers) will have access to that
executable with symbols and could find the line(s) of code
that are hitting the problem.  I hope this will help find
and fix the apparent bug.

Actually, the seg faults are reproducible.  It just takes a
little setup to create the right environment to reproduce
them.  While it is a little rare to find software bugs that
are diffult to reproduce, it does happen that there are some
software problems that require strange environmental issues
to tickle the bug.  I'll spare you the war stories about
such cases I saw in the 17 years I worked in processor
design at Intel Corp.--unless you want to hear the war
stories.  :-)

One hypothesis is something is being accessed beyond the
valid end of the stack.  Perhaps function A is calling
function B.  Function B has some local variables allocated
on the stack.  Function B returns or stores a pointer to one
of those local variables.  After function B returns,
function A dereferences the pointer and reads something from
the stack.  In the absence of interrupts, page faults, other
asynchronous events, or any intervening calls that overwrite
the area of memory that was function B's stack frame,
function A would read what function B had left there.
However, if there were any kind of interrupt, such as a page
fault, I/O handling, or such during the intervening time,
function A could get something different from what function
B had intended.  If this corrupted datum read by function A
were used as an array index or a pointer, then we could
easily see a segmentation violation.

Other possible causes of semi-deterministic behavior are any
heuristics used by malloc() or heuristics used by a garbage
collector.  I have a war story on the latter one.

Am I prelinking shared libraries?  I don't know of anything
I am knowingly doing that would be classified as doing that.
As far as I'm aware, I'm using the 'rpm' command straight
out of the box.

File systems: /, /boot, /tmp are ext2 (don't yet trust
journaling) over IDE.  I have checked system log files and
don't see anything indicating RAM or disk errors.  Other
file systems served over NFS from another RHL 8.0 box:
/home, /usr/local, and /var/spool/mail.  Network errors and
such are extremely low--nothing noteworthy.

Comment 18 Jeff Johnson 2003-07-07 16:08:13 UTC

Sure that's lots of ways to screw up subtly.

Howvere, I have run "LD_ASSUME_KERNEL=2.4.0 valgrind -v rpm -Va" and
do not see anything unusual. OTOH, that doesn't mean valgring catches all problems.

OK, let's try to narrow this down some.

Are you running these verifies concurrently?

Are you running as root? If not, please do so; root <-> root locking is
in place, nonroot cannot create locks yet.

Comment 19 Robert M. Riches Jr. 2003-07-07 16:38:46 UTC

Thanks for working on this.  Yes, this is subtle.  Of my three systems,
only one shows the symptoms: "one" was udpated via up2date and has not
seen the symptom; "two" was updated via autorpm and saw the symptom until
I rebuilt the DB twice; "three" was updated via autorpm and sees the
symptom even after several DB rebuilds.  "three" was created as a clone
of "two" and has been updated in lock step.  Both "two" and "three" get
/usr/local over NFS, "two" has 100mb ethernet, while "three" has 10mb
ethernet.  I'm guessing network speed affects timing which makes the
difference.

Answers to your questions: I'm not running the 'rpm -V $pkg' concurrently.
The first attachment, 92667, is the original script.  A new script,
which I will attach, gets the same symptoms.  This latter script is what
produced the core files from which the backtraces came.  In the original
script, the key is the following:

foreach pkg (`cat $qfile`)
  echo 'vvvvvvvv '$pkg' vvvvvvvv' >> $vfile
  csh -c "rpm -V $pkg" >>& $vfile
  echo '^^^^^^^^ '$pkg' ^^^^^^^^' >> $vfile
end

I am running as root.  I would have permission denials if I ran
'rpm -V' as non-root.

How else can I help get this resolved?

Comment 20 Robert M. Riches Jr. 2003-07-07 16:41:44 UTC

Created attachment 92777 [details]
new script gets same symptoms

Comment 21 Jeff Johnson 2003-07-07 19:02:04 UTC

Hmmm, adding "--qf '%{name}-%{version}-%{release}'" to rpm -qa
will make your perl script immune to
Start time: Mon Jul  7 14:31:33 EDT 2003
sh: -c: line 1: syntax error near unexpected token `('
sh: -c: line 1: `rpm -V 4Suite(0:0.11.1-11).i386'
ERROR: unknown status from 'rpm -V': 512


Also, your program is pig slow (because of exec's) on an already
slow --verify. If all you want is customizable pre/post per-pkg
markers, this is like 15 lines of C added to -Va. 'Tain't hard, might
even be useful. 

I've run your perl script with rpm-4.2.1-0.11 and (duh) don't
see segfaults, so I doubt I'm gonna be able to reproduce locally.

I'll be happy to try and pin down wassup, but (be forewarned) the
last time I chased a problem like this it turned out to be bugg NFS.

You might try latest valgrind-1.9.5 to convince yourself that this
isn't an rpm problem.

Here's what I see when running valgrind (one minor rpm problem, several
documented performance tweaks in berkeley db):

==32054== ERROR SUMMARY: 1075 errors from 3 contexts (suppressed: 6 from 2)
==32054== 
==32054== 2 errors in context 1 of 3:
==32054== Invalid read of size 4
==32054==    at 0x4027F00D: (within /usr/lib/librpm-4.2.so)
==32054==    by 0x4027FE88: rpmalAllFileSatisfiesDepend (in /usr/lib/librpm-4.2.so)
==32054==    by 0x40280304: rpmalAllSatisfiesDepend (in /usr/lib/librpm-4.2.so)
==32054==    by 0x4028034F: rpmalSatisfiesDepend (in /usr/lib/librpm-4.2.so)
==32054==    Address 0x419281B0 is 4 bytes after a block of size 16 alloc'd
==32054==    at 0x40162788: malloc (vg_clientfuncs.c:100)
==32054==    by 0x40162C78: realloc (vg_clientfuncs.c:265)
==32054==    by 0x4027F589: rpmalAdd (in /usr/lib/librpm-4.2.so)
==32054==    by 0x40267151: rpmtsAddInstallElement (in /usr/lib/librpm-4.2.so)
==32054== 
==32054== 74 errors in context 2 of 3:
==32054== Syscall param pwrite(buf) contains uninitialised or unaddressable byte(s)
==32054==    at 0x4052B894: __GI___pwrite64 (in /lib/libc-2.3.2.so)
==32054==    by 0x4041EF41: __pwrite64 (vg_libpthread.c:2502)
==32054==    by 0x4035E538: __os_io_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054==    by 0x403568FF: (within /usr/lib/librpmdb-4.2.so)
==32054==    Address 0x41CB2F2E is 1261322 bytes inside a block of size 1318912
alloc'd
==32054==    at 0x40162788: malloc (vg_clientfuncs.c:100)
==32054==    by 0x4035CEF7: __os_malloc_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054==    by 0x4035E21A: __os_r_attach_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054==    by 0x40330176: __db_r_attach_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054== 
==32054== 999 errors in context 3 of 3:
==32054== Conditional jump or move depends on uninitialised value(s)
==32054==    at 0x40358709: __memp_fopen_int_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054==    by 0x40358080: (within /usr/lib/librpmdb-4.2.so)
==32054==    by 0x40306ECF: __db_dbenv_setup_rpmdb (in /usr/lib/librpmdb-4.2.so)
==32054==    by 0x40318E2F: __db_dbopen_rpmdb (in /usr/lib/librpmdb-4.2.so)
--32054-- 
--32054-- supp:    2 _dl_relocate_object*/dl_open_worker/_dl_catch_error*(Cond)
--32054-- supp:    4 __pthread_mutex_unlock/_IO_funlockfile
==32054== 
==32054== IN SUMMARY: 1075 errors from 3 contexts (suppressed: 6 from 2)
==32054== 
==32054== malloc/free: in use at exit: 159 bytes in 12 blocks.
==32054== malloc/free: 1024677 allocs, 1024665 frees, 1950588649 bytes allocated.
==32054== 
--32054--     TT/TC: 0 tc sectors discarded.
--32054--            13349 chainings, 0 unchainings.
--32054-- translate: new     16157 (257093 -> 3240959; ratio 126:10)
--32054--            discard 112 (1421 -> 18060; ratio 127:10).
--32054--  dispatch: 7916650000 jumps (bb entries), of which 437328244 (5%) were
unchained.
--32054--            158494/7422051 major/minor sched events.  1073287 tt_fast
misses.
--32054-- reg-alloc: 2816 t-req-spill, 601672+23605 orig+spill uis, 82803
total-reg-r.
--32054--    sanity: 158380 cheap, 6336 expensive checks.
--32054--    ccalls: 58421 C calls, 55% saves+restores avoided (190638 bytes)
--32054--            80294 args, avg 0.88 setup instrs each (18720 bytes)
--32054--            0% clear the stack (175263 bytes)
--32054--            23166 retvals, 31% of reg-reg movs avoided (13938 bytes)

Comment 22 Robert M. Riches Jr. 2003-07-07 19:20:14 UTC

Thanks for the format hint.  I'll keep it in mind in case I
ever see the parentheses from 'rpm -qa'.

The two things I get from the script are the per-package
markers and deterministic order of packages.  That way, I can
compare a run from last week with this week to make sure nothing
was corrupted in between.  I can also compare pre-update vs.
post-update to see whether the update created any bad things.  If
a future version of RPM would do those things faster than the
script, that would be great!

What kind of bug in NFS would cause this RPM to give a segmentation
fault?  Being as the NFS involved is between two RHL 8.0 machines,
I guess such a finding would have to result in another bug report
against NFS.

Thanks.

Comment 23 Robert M. Riches Jr. 2003-08-02 22:17:36 UTC

So far so good as far as not having any more segmentation
faults.  However, I just had another hang when upgrading
six packages (openssh*, wu-ftp).

Jeff Johnson said hang are fixed in 4.1.1.  So, when will
this fix get production release for RHL 8.0?

The reason I'm not jumping in to switch to a non-production
release is I got chewed out by Mike Harris a year or three
ago for using a non-production version of XFree86 he had
suggested I use.

Comment 24 Robert M. Riches Jr. 2003-09-20 22:28:30 UTC

They're back...  :-(

This time, updates were to the kdebase, openssh, and sendmail groups.
Now, the problem shows up on a different machine.  The problematic
machine hung while doing rpm -Uvh.  I killed it, removed the lock files,
and did rpm --rebuilddb six times.  Then, I checked for any of the old
packages still around, the ones that should have been removed when the
newer versions were installed.  I found all the old ones were still
listed as present.  I did rpm -e on them.  It hung once, so I killed it,
did rpm --rebuilddb five times.  Doing rpm -e on the remaining older
ones succeeded.  Now, rpm -V gets segment faults and dumps core.

Do you want any information?

At this point, I plan to do another series of several rpm --rebuilddb
and see what happens.

Again, this is _not_ on the same machine as had the problem last time.
This is on a different machine, one that was fine (after getting rid
of the hangs) last time.

Someone mentioned a version of RPM that solves the hangs.  Is there any
chance this will ever be released to production for 8.0?

Thanks.

Comment 25 Robert M. Riches Jr. 2003-09-20 23:41:15 UTC

Hmmm...

After 7 more repetitions of rpm --rebuilddb, the segmentation faults
are now gone again.

So, when is this new-and-improved rpm version going to be released
to production for RHL 8.0?

Being as I can't duplicate the problem, even if there was interest in
trying to isolate the root cause, please go ahead and mark it back to
notabug, closed.

Comment 26 Jeff Johnson 2003-10-10 14:32:26 UTC

NOTABUG

Comment 27 Robert M. Riches Jr. 2004-02-11 20:07:18 UTC

For the benefit of anyone who later looks at this report,
both the hangs and segment faults continue in RHL 9.  (It
appears Jeff Johnson won't be swayed by facts, so it isn't
worth changing the report state.)

On my first machine, one that never saw a hang or segment
fault (at least from rpm) with RHL 8.0, there were some
segment faults after doing the big up2date session after
installing RHL 9.  After setting LD_ASSUME_KERNEL to 2.4.19
(thanks to Shadowman for that workaround), rpm --rebuilddb
five times solved them.

On my second machine, using rpm (via the autorpm perl
script) to update about 178 packages, rpm hung hard after
about 165.  This left somewhere around 100 of the old
packages still listed as installed.  Doing 'rpm -e' on
the stale packages (fortunately, not glibc or anything else
vital) and then 'rpm -i --replacepackage' (sp?) on the
affected new packages got the system to a good state.  Of
source, there were 'rpm --rebuilddb' sessions along the way.

On my third machine, using rpm to manually update about 10
packages at a time after a clean RHL 9 install resulted in
segment faults when doing 'rpm -V'.  Four iterations of
'rpm --rebuilddb' solved the segment faults.

Three different machines, and they all show the segment
faults.  I'd say that means they _do_ exist, even in RHL 9's
version of RPM.  Just in case anyone cares to look into this
issue, I'm saving core files as I get them.

Comment 28 Jeff Johnson 2004-02-11 20:14:06 UTC

Just for the record:

    I'm not saying that the segfaults don't exist.

And I have no clue if or when later versions of rpm
will be available for RHL 8.0. Note that RHL 8.0 support
by Red Hat ended 1/1/04.

Comment 29 Robert M. Riches Jr. 2004-02-11 20:45:16 UTC

I'm aware 8.0 has expired.  That's why I installed
RHL 9.  Is there anything that can be done to fix
the hangs and/or segment faults in RHL 9's RPM?

Shadowman said to set LD_ASSUME_KERNEL to 2.4.19
for doing a --rebuilddb.  Is it likely this would
help installation and/or verification?

Comment 30 Robert M. Riches Jr. 2009-01-10 20:13:09 UTC

Just to set the record straight on the accusation that this was a hardware problem, I have not seen a _SINGLE_ segmentation fault or hang from rpm on _ANY_ of the Mandrake/Mandriva releases I have used since RedHat 9 expired.  This is on the exact same _THREE_ machines that all had the same type of segmentation faults and such on Redhat 8.0 and 9.