101603 – LTC3765-RPM problems running under LinuxThreads

Bug 101603 - LTC3765-RPM problems running under LinuxThreads

Summary: LTC3765-RPM problems running under LinuxThreads

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	rpm
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:	Mike McLean
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	110554 (view as bug list)
Depends On:
Blocks:	106472
TreeView+	depends on / blocked

Reported:	2003-08-04 16:37 UTC by Bob Johnson
Modified:	2007-11-30 22:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-11-24 01:47:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
"simple_failure_test.tar.gz" (2.32 KB, application/octet-stream) 2004-11-16 17:33 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:147	0	normal	SHIPPED_LIVE	rpm bug fix update	2005-05-18 04:00:00 UTC

Description Greg Kelleher 2003-08-04 16:37:51 UTC

The following has be reported by IBM LTC:LTC Bug#3765
  
RPM problens running under LinuxThreads

Hardware Environment:
Dual processor (550Mhx), PIII, SCSI hard-drives, 1GB ram.

Software Environment:
RHEL 3.0 Beta 1, WS edition

RPM behavior on RHEL 3.0 is bad under LinuxThreads mode.  
Easy reproduction can be shown using rpmbuild:

Steps to Reproduce:
1.	make a directory rpm_bug
2.	Create a dummy RPM named DummyROm-1.0.spec, such as:
Name: DummyRpm
Version: 1.0
Release: 0
Summary: Simple empty RPM testcase, Doesn't have any files.
License: IBM 2002
Group: System Environment/Daemons
AutoReqProv: Yes

%description


%files
3.)	Create a script:
#!/bin/sh

rc=0

while [ "$rc" = "0" ]
do
        rpmbuild --define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb 
$PWD/DummyRpm-1.0.spec
        rc=$?
Done

4.)	export LD_ASSUME_KERNEL2.4.0
5.)	Run the script.  

Actual Results:
	Script/rpmbuild crashes after some random number of runs.

Expected Results:
	Should run until user-interrupted.

Additional Information:
	Running ?rpm ?i? on a package has varying results.  Sometimes it works, 
sometimes it crashes, sometimes it corrupts the system RPM DB.

	This is a significant concern for several reasons.  Any product which 
requires LinuxThreads to work, and calls RPM under the covers will have issues 
with this.  I can name two products directly affected:

ISMP (Installshield Multiplatform) (It uses both rpm and rpmbuild to do 
software registration)
Caching Proxy (From WebSphere Edge) (Its installer uses rpm to install the 
various packages the user selects)

	Also, any application bundled in RPM format that needs LinuxThreads to 
run and as part of the post-install executes part of itself to configure or 
whatnot, will have problems (As currently the only way to set the 
LD_ASSUME_KERNEL for extisting install rpms is to set it on the commandline 
before calling RPM).  So, this will likely affect a good many existing products 
that run on RHEL 2.1 today, and are expected to be able to run on RHEL 3.0 in 
LinuxThreads mode.Make that:  export LD_ASSUME_KERNEL=2.4.0 and DummyRpm-
1.0.spec

Comment 1 Jeff Johnson 2003-08-19 13:34:18 UTC

The above procedure (slightly modified) runs fine for me.

If LD_ASSUME_KERNEL is involved, then this is going to be
a glibc and/or kernel, not rpm, issue.

Reopen (and reassign to kernel/glibc) with the version/release
of kernel/glibc that you are testing against.

Comment 2 IBM Bug Proxy 2003-08-20 22:02:52 UTC

------- Additional Comment #6 From Jared P. Jurkiewicz  2003-08-20 15:48 -------

I can still produce the failure using export LD_ASSUME_KERNEL values of 2.2.5 
and 2.4.0.

[root@arathorn test]# uname -a
Linux arathorn.raleigh.ibm.com 2.4.21-1.1931.2.393.entsmp #1 SMP Wed Aug 13 
21:51:41 EDT 2003 i686 i686 i386 GNU/Linux
[root@arathorn test]# export LD_ASSUME_KERNEL=2.4.0
[root@arathorn test]# ./repr.sh
Processing files: DummyRpm-1.0-0
Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot}
Wrote: ./1315054.tmp
Processing files: DummyRpm-1.0-0
Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot}
Wrote: ./1315054.tmp
Processing files: DummyRpm-1.0-0
Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot}
Wrote: ./1315054.tmp
Processing files: DummyRpm-1.0-0
Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot}
Wrote: ./1315054.tmp
Processing files: DummyRpm-1.0-0
Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot}
Wrote: ./1315054.tmp
./repr.sh: line 9:  5405 Segmentation fault      (core dumped) rpmbuild --
define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb $PWD/DummyRpm-
1.0.spec
[root@arathorn test]#


Also, the 2.2.5 export also fails.

GDB examination of core:
[root@arathorn test]# gdb /usr/bin/rpmbuild core.5301
GNU gdb Red Hat Linux (5.3.90-0.20030710.14rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols 
found)...Using host libthread_db library "/lib/libthread_db.so.1".

Core was generated by `rpmbuild --define _rpmdir ./ --define _rpmfilename 
1315054.tmp -bb /root/test/D'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/librpmbuild-4.2.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/librpmbuild-4.2.so
Reading symbols from /usr/lib/librpm-4.2.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/librpm-4.2.so
Reading symbols from /usr/lib/librpmdb-4.2.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/librpmdb-4.2.so
Reading symbols from /usr/lib/librpmio-4.2.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/librpmio-4.2.so
Reading symbols from /usr/lib/libpopt.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/libpopt.so.0
Reading symbols from /usr/lib/libelf.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libelf.so.1
Reading symbols from /usr/lib/libbeecrypt.so.6...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib/libbeecrypt.so.6
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /usr/lib/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libbz2.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib/libnss_files.so.2
#0  0x00337d3f in domd5 () from /usr/lib/librpmdb-4.2.so
(gdb) info stack
#0  0x00337d3f in domd5 () from /usr/lib/librpmdb-4.2.so
#1  0x00e8c6d7 in rpmAddSignature () from /usr/lib/librpm-4.2.so
#2  0x00775866 in writeRPM () from /usr/lib/librpmbuild-4.2.so
#3  0x00776443 in packageBinaries () from /usr/lib/librpmbuild-4.2.so
#4  0x0076cd6c in buildSpec () from /usr/lib/librpmbuild-4.2.so
#5  0x0804a2b1 in ?? ()
#6  0x083a4418 in ?? ()
#7  0x083a5ea8 in ?? ()
#8  0x0000009f in ?? ()
(gdb)

and in case they want to know:
[root@arathorn test]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 549.067
cache size      : 512 KB
physical id     : 0
siblings        : 1
runqueue        : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse
bogomips        : 1094.45

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 549.067
cache size      : 512 KB
physical id     : 0
siblings        : 1
runqueue        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse
bogomips        : 1097.72

And:
[root@arathorn test]# free
             total       used       free     shared    buffers     cached
Mem:       1028548    1008200      20348          0     132596     598684
-/+ buffers/cache:     276920     751628
Swap:       634528        208     634320

Incidentally:
2.4.1 works.  The problem seems to lie in the non-floating stack based glibc.
<floating stack setup glibc, LinuxThreads based>
[root@arathorn test]# export LD_ASSUME_KERNEL=2.4.1
[root@arathorn test]# ldd /usr/bin/rpmbuild
        librpmbuild-4.2.so => /usr/lib/librpmbuild-4.2.so (0x008fd000)
        librpm-4.2.so => /usr/lib/librpm-4.2.so (0x00855000)
        librpmdb-4.2.so => /usr/lib/librpmdb-4.2.so (0x00ccd000)
        librpmio-4.2.so => /usr/lib/librpmio-4.2.so (0x00175000)
        libpopt.so.0 => /usr/lib/libpopt.so.0 (0x00fc4000)
        libelf.so.1 => /usr/lib/libelf.so.1 (0x00681000)
        libbeecrypt.so.6 => /usr/lib/libbeecrypt.so.6 (0x00655000)
        librt.so.1 => /lib/i686/librt.so.1 (0x00111000)
        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x001b3000)
        libbz2.so.1 => /usr/lib/libbz2.so.1 (0x00645000)
        libc.so.6 => /lib/i686/libc.so.6 (0x00283000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x0026c000)

<Non-floating stack setup, LinuxThreads based>
[root@arathorn test]# export LD_ASSUME_KERNEL=2.4.0
[root@arathorn test]# ldd /usr/bin/rpmbuild
        librpmbuild-4.2.so => /usr/lib/librpmbuild-4.2.so (0x00d52000)
        librpm-4.2.so => /usr/lib/librpm-4.2.so (0x00636000)
        librpmdb-4.2.so => /usr/lib/librpmdb-4.2.so (0x00711000)
        librpmio-4.2.so => /usr/lib/librpmio-4.2.so (0x00bda000)
        libpopt.so.0 => /usr/lib/libpopt.so.0 (0x00111000)
        libelf.so.1 => /usr/lib/libelf.so.1 (0x00e22000)
        libbeecrypt.so.6 => /usr/lib/libbeecrypt.so.6 (0x002e0000)
        librt.so.1 => /lib/librt.so.1 (0x009b9000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00119000)
        libbz2.so.1 => /usr/lib/libbz2.so.1 (0x008a2000)
        libc.so.6 => /lib/libc.so.6 (0x0016b000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x0087f000)
[root@arathorn test]#



contents of repr.sh:
#!/bin/sh

rc=0

while [ "$rc" = "0" ]
do
        rpmbuild --define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb 
$PWD/DummyRpm-1.0.spec
        rc=$?
done


Contents of DummyRpm-1.0.spec:
Name: DummyRpm
Version: 1.0
Release: 0
Summary: Simple empty RPM testcase, Doesn't have any files.
License: IBM 2002
Group: System Environment/Daemons
AutoReqProv: Yes

%description


%files




------- Additional Comment #7 From Khoa D. Huynh 2003-08-20 15:55 -------

Jared - did you try LD_ASSUME_KERNEL=2.4.1 ?


------- Additional Comment #8 From Jared P. Jurkiewicz 2003-08-20 16:00 -------

Look under the last section of my comment (Incidentally).  I note 2.4.1 does 
work, and that narrows down the problem area to the pthreads non-floating stack 
glibc, which is what loads when you use 2.2.5 or 2.4.0.  2.4.1 gives you 
pthreads floating stack glibc.

We do have older programs that need the non-floating stack version, that could 
run into problems here (And incidentally, are supported on RHEL 2.1).  WAS 
4.0.1, for example, needs it.  JDK 1.3.0 would not work without that parameter 
being set (2.2.5).  So, the base install for V4 will potentially have problems 
on RHEL 3.0 (And I'm pretty sure someone will ask us to support that).  



------- Additional Comment #9 From Jared P. Jurkiewicz 2003-08-20 16:07 -------

Addendum:

In terms of V4, we picked up support RHEL 2.1 in 4.0.5, which is JDK 1.3.1 
based.  However, the customer has to install 4.0.1 first (the full install 
image), before the update to 4.0.5 can be done.  Which means we need rpm (which 
is invoked to install IHS and GsKit), to not have the potential for destroying 
to OS.

Of course, all this is moot if we don't have to support RHEL 3.0 with WAS v4.  
But, to be on the safe side, we should probably try and get this addressed.


------- Additional Comment #10 From Jared P. Jurkiewicz 2003-08-20 16:08 -------

Addendum to addendum,
  Destroying the RPM repository of the OS, rather, which then makes RPM updates 
and such hard to do, as dependencies won't be found, etc.

Comment 3 Jeff Johnson 2003-08-21 13:50:59 UTC

This is not an rpm problem afaict.

Reopen and assign to glibc if you wish.

Comment 4 IBM Bug Proxy 2003-08-21 22:37:02 UTC

------ Additional Comments From khoa.com  2003-21-08 17:25 -------
Glen/Greg - can you reopen the bug in RH Bugzilla and assign it to glibc ?
Thanks.

Comment 5 IBM Bug Proxy 2003-08-25 17:21:04 UTC

------ Additional Comments From jaredj.com  2003-25-08 11:52 -------
Just found out this morning ... we need the LD_ASSUME_KERNEL=2.4.0 (and 2.2.5) 
to work in order to get WAS V 5.0.0 to install the baseline WAS.  JDK 1.3.1 
will segfault under LD_ASSUME_KERNEL=2.4.1 or greater.  

So, rpm having issues under the older glibc is dangerous, as rpm is called to 
install things like embedded MQ, GsKit, etcera.

-- Jared Jurkiewicz
WebSphere AppServer Development

Comment 6 Jakub Jelinek 2003-08-26 22:49:04 UTC

From what I can see, this is just that rpm, rpmbuild and maybe other rpm
programs eat too much stack.
linuxthreads non-FLOATING_STACKS (ie. LD_ASSUME_KERNEL <= 2.4.0) on IA-32
limit stack to 2M (that's the size of the stack slots assigned to each
thread), including the initial thread (this is nothing new, it has been like
this since the beginning).
If I ulimit -s 2048, I can get it to segfault with all of LD_ASSUME_KERNEL 2.2.5,
2.4.1 and without LD_ASSUME_KERNEL (ie. lt, ltfs, nptl).
But, when debugging rpmbuild, it seems to eat something like 0x46000 bytes of
stack (difference between $esp in __libc_start_main and at the point of segfault).
To me this looks like if kernel stack randomization eats from RLIMIT_STACK
(this would explain why the segfaults aren't reproduceable in every run).

Comment 7 IBM Bug Proxy 2003-09-10 15:07:08 UTC

------ Additional Comments From jaredj.com  2003-10-09 11:03 -------
In addition, on PPC architectures, if I set LD_ASSUME_KERNEL=2.4.19 I see RPM DB
corruption there as well.  I've had to reload my machine a couple times because
of it.  This is needed because we have programs that get installed undeer the
covers via RPM, that require LinuxThreads to properly install (they run post
setup scripts and such calling executables that require LinuxThreads behaviors).

While a workaround could be done by backing p the RPM-DB, installing the
program, then restoring the DB, I don't think general customers will find that
acceptable.
A secondary bug can be opened on PPC, but it looks like this is just more a
general RPM issue or somesuch.

Comment 8 Bob Johnson 2003-09-12 18:08:54 UTC

is RPM embedded/nested underneath ?
Why not clear the environment variable before you do this ?

Comment 9 IBM Bug Proxy 2003-09-12 18:22:40 UTC

------ Additional Comments From jaredj.com  2003-12-09 14:16 -------
Because, the RPM itself invokes scripts that require LinuxThreads, and we can't 
regen the RPM to do checks and sets of the env-variables.

Also, the ISMP PPK's for Linux do RPM registration, which generate dummy rpms 
to insert and install, and those too get this variable set when you have to set 
it for the JVM.

Comment 10 Matt Wilson 2003-09-12 18:42:32 UTC

LD_ASSUME_KERNEL=$VALUE, where $VALUE is less than or equal to 2.4.19, will
disable ALL locking in RPM.  Any concurrent database access will corrupt the
database.  NPTL is needed for the concurrent locking mechanism which is needed
to do things like invoke rpm from %post scripts.  If a script in %post is
calling a program (such as a JVM) which requires LinuxThreads, it should set the
LD_ASSUME_KERNEL environment variable in the %post script.

Applications which invoke rpm (as a child process) that require LD_ASSUME_KERNEL
values to turn on LinuxThreads support should clear the LD_ASSUME_KERNEL from
the environment before exec()ing rpm.

Comment 11 IBM Bug Proxy 2003-09-12 20:07:27 UTC

------ Additional Comments From jaredj.com  2003-12-09 15:58 -------
If that's the case and you cannot fix it, then you will have serious issues 
from companies other than IBM.  There are programs people may want to install, 
that require LinxuThreads to install to do post config work, where the 
RPM/install image cannot be recreated.  So, the post-install scripts can't be 
updated.

This, in fact, will break any program using ISMP, with a JDK that requires 
LinuxThreads.  Their calls to rpm do not have unset calls for that variable, 
and therefore applications that would install on RHEL 2.1, will now corrupt the 
DB of RHEL 3 when they're installed.

Comment 12 Arjan van de Ven 2003-09-13 09:19:48 UTC

on the bright side.. the number of apps that don't work with NPTL is prety low
since most applications seem to have not made implementation details just posix
behavior. I realize that's not helping this specific case though.

Comment 13 Arjan van de Ven 2003-09-13 09:30:17 UTC

having said that, the 2Mb stack issue is something that will be fixed by the
kernel; the recursion issue is obviously beyond the kernel's scope

Comment 14 Jeff Johnson 2003-09-15 03:57:12 UTC

Either add LD_ASSUME_KERNEL to the %post scriptlet
or use a version of rpm compiled w/o --enable-posixmutexes.

I see no other solution for rpm, hence WONTFIX.

Comment 15 Matt Wilson 2003-09-16 18:34:26 UTC

corruption should only occur if you have multiple database accesses.  db4 is
only designed to support one locking mechanism.  In our case, we select posix
mutexes.  db4 depends on the PTHREAD_PROCESS_SHARED, which is only implemented
in NPTL.  So rpm's locking mechanism only works with NPTL.

Note:   /usr/lib/rpm/rpmi is a statically linked NPTL application that could be
used, but if a newer version of glibc is installed the NSS modules will be
incompatible with the interfaces used in the static binary.  This makes it
unsuitable as a solution.

Comment 16 IBM Bug Proxy 2003-09-18 15:56:12 UTC

------ Additional Comments From jaredj.com  2003-18-09 11:52 -------
main()- >  do a getenv("LD_ASSUME_KERNEL");  If not null, note that, store off 
if it sets to LinxuThreads dynamic oor not, do like a putenv
("SET_LINUXTHREADS_SCRIPTS_<PROCESS ID OF THIS RPM PROCESS>=TRUE),  then clear 
out that LD_ASSUME_KERNEL variable.  Then!  do an execvp() of the same 
commandline that was used to invoke RPM, with the flag set to now to pass 
LinuxThreads settings to scripts (but, without the LD_ASSUME_KERNEL, it should 
use NPTL for RPM itself.  When they do the script calls, they look to see if it 
says to set LinuxThreads, set that in the ENV if it is supposed to bem then 
execvp the script.


I'll try to provide a code example of what I mean (not RPM patch, but a simple 
example) to see if this is feasible.

Comment 17 IBM Bug Proxy 2003-09-18 19:09:41 UTC

------ Additional Comments From jaredj.com  2003-18-09 14:19 -------
Here's the quick code example to better explain what I mean.  It's by no means
robust or well written (or bug free, for that matter!).  But, it shows the idea:

test_varswap.c:
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

extern char ** environ;

/**
 * Quick function to just replace the LD_ASSUME_KERNEL entry with another 
special one
 * so we can 'pass through' the LD_ASSUME to underlying prgrams we fork, but we 
depend on
 * it not being set.
 */
char ** replaceLDAssume(char ** envArray, char* replacementVal)
{
    int arraycount, pos;
    char ** newEnv;

    newEnv     = NULL;
    arraycount = 0;

    if (environ != NULL) 
    {
        /**
         * Figure out how big the env is, then malloc an array of the same size.
         */
        while (environ[arraycount] != NULL)
        {
            arraycount++;
        }
        newEnv = (char**)malloc((arraycount + 1) * sizeof(char*));
        memset(newEnv, 0, (arraycount + 1) * sizeof(char*));

        /**
         * Clone and replace.
         */
        for (pos = 0; pos < arraycount; pos++) 
        {
            if (strstr(environ[pos], "LD_ASSUME_KERNEL") == NULL) 
            {
                newEnv[pos] = environ[pos];
            }
            else
            {
                newEnv[pos] = replacementVal;
            }
        }
    }
    else
    {
        newEnv = environ;
    }
    return newEnv;
}
    
void print_array(char** env)
{
    int arraycount, pos;
    arraycount = 0;

    if (env != NULL) 
    {
        /**
         * Figure out how big the env is, then malloc an array of the same size.
         */
        while (environ[arraycount] != NULL)
        {
            printf("%s
",env[arraycount]);
            arraycount++;
        }
    }
}


int main(int argc, char** argv)
{
    char ** newEnv;
    char swapmodeVar[256];
    char compatVar[256];
    char** newargs;
    int count;
    pid_t pid;
    char* script[1];
    script[0]       = "./testscript.sh";
    count           = 0;
    

    if (getenv("LD_ASSUME_KERNEL") != NULL) 
    {
        printf("LD_ASSUME_KERNEL detected.  Removal from env initiated
");
        /**
         * Generate passthrough variable!  We save the pid as part of the var, 
so it can be isolated
         * to one process, this one.  Only this one will recognise it on lookup.
         */
        snprintf(swapmodeVar,255,"%s_%d=%s","PASSTHRU_LD_ASSUME",getpid(),getenv
("LD_ASSUME_KERNEL"));
        swapmodeVar[255] =

Comment 18 Matt Wilson 2003-09-19 15:09:16 UTC

there is no way that a change like this can go in for RC/GM

Comment 19 Bob Johnson 2003-09-19 15:16:58 UTC

Other than Websphere 4.0 and 5.0 (not sure which pieces of WS) what other IBM
software is broken by doing this rpm under the covers ?  Any Tivoli applications ?

Comment 20 IBM Bug Proxy 2003-09-19 15:25:57 UTC

------ Additional Comments From jaredj.com  2003-19-09 11:23 -------
Can it be fixed by the first service pack for RHEL 3.0, then?  Even if it's not 
fixed in the GM/RC, it doesn't mean the fix shouldn't be investigated.  It 
would mean we could at least do something by early next year for our customers 
on older releases, who were on say, RedHat 7.2 on s390.

Comment 22 IBM Bug Proxy 2003-09-26 19:06:24 UTC

------ Additional Comments From greg_kelleher.com  2003-26-09 14:58 -------
RH is going to try to fix this RHEL 3 update 1

Comment 23 Bob Johnson 2003-09-26 19:12:23 UTC

Greg,
We are going to investigate it for the update.

Comment 24 IBM Bug Proxy 2003-10-16 00:49:55 UTC

------ Additional Comments From khoa.com  2003-15-10 19:52 -------
The corresponding RH bug (RH Bug 101603) is closed as WONTFIX.  We need to
re-open that RH bug report and need some status update.  Thanks.

Comment 25 IBM Bug Proxy 2003-10-16 15:57:16 UTC

------ Additional Comments From jaredj.com  2003-16-10 11:55 -------
	This is a significant concern for several reasons.  Any product which 
requires LinuxThreads to work, and calls RPM under the covers will have issues 
with this.  I can name two products directly affected:

        ISMP (Installshield Multiplatform) (It uses both rpm and rpmbuild to do 
software registration) Caching Proxy (From WebSphere Edge) (Its installer uses
rpm to install various packages the user selects)

	Also, any application bundled in RPM format that needs LinuxThreads to 
run and as part of the post-install executes part of itself to configure or 
whatnot, will have problems (As currently the only way to set the 
LD_ASSUME_KERNEL for extisting install rpms is to set it on the commandline 
before calling RPM).  So, this will likely affect a good many existing products 
that run on RHEL 2.1 today, and are expected to be able to run on RHEL 3.0 in 
LinuxThreads mode.

        The reason this is business important is for customer migration/legacy
support senarios.  There are cases where a customer may try to install an older
application or whatnot, and to do so set LinuxThreads (To say, invoke an older
JVM which under the covers installs rpms or what have you).  When this happens
the customer is running in a state where they can corrupt the operating system
repository.  If that gets corrupted, the RHN update code won't function right,
nor will RPM dependency resolves and so forth.  The concern really is for legacy
applications and the end user experience.  If the OS DB gets corrupted, that's
bad for both the vendor providing the application, as well as RedHat.

From a WebSphere AppServer perspective: 
        While the next revision of WebSphere (crrently in development) should
not be affected, all versions in the field are.  I've been working with
management here to push that we won't support anything older than our upcoming
next release, but that could potentially be a hard sell for platforms other than
x86 hardware.  We have a chunk of customers on 390 hardware and RedHat 7.2 that
I've heard want 5.0 support, and potentially 4.0 support on RHEL 3 (since they
were running applications in production on RedHat 7.2 and do not want to move to
next version of WebSphere when it releases).  With RPM acting the way it is,
that makes the support extremely dangerous and potentially very expensive.

Comment 26 Greg Kelleher 2003-10-16 22:05:02 UTC

Business Case Added : At Bob Johnson's Request ; IBM SWG Would like this 
considered for Update 1

Comment 28 Bob Johnson 2003-11-05 17:16:28 UTC

Jared,

Also another question - I need to know exactly what the execution
paths from WAS to rpm are.  As long as the WAS installer is the only
thing on the system doing rpm db commits there is no problem.  If WAS
is calling rpm recursively, there's no way that will work without
concurrent database access, which requires NPTL.

Comment 29 IBM Bug Proxy 2003-11-05 17:31:21 UTC

------ Additional Comments From jaredj.com  2003-05-11 12:29 -------
Bob,
   I have no idea, as some of the rpms come from other products, and I don't 
know what they do internally.  That's why I'd like to see RPM behave properly, 
and the patch I sent which forces RPM into NPTL mode always, should take care 
of that.  My patch should also handle RPM calls inside RPMS, or whatnot, it 
tags the passthrough variables with the PID of the RPM process, and the sub 
processes (the scripts), use the ppid to find that, restore it, and so on.  
But, if RPM is invoked inside an rpm script, it'll detect the LinuxThreads 
setting, store it back out, set a new variable in the env with a new pid, and 
so on.  With the way it's passed in my patch, it can't really inherit when it 
shouldn't, from what I can tell.

Comment 30 Matt Wilson 2003-11-05 17:33:49 UTC

I have not been able to crash rpm with any LD_ASSUME_KERNEL variable
value with the test case provided.

Your patch does not handle rpm calls inside rpms.

Comment 31 Matt Wilson 2003-11-05 17:39:16 UTC

rather, rpm calls in the %post of a package may work with your
approach.  but this is all trying to handle a hypothetical case where
you MAY have two writers to the rpm database.  If your packages don't
call rpm in %post, and you don't invoke two package installs in
parallel, you're ok.  I don't think that the segfault that was
reported initially acutally happens with gold code, and if that was
the basis for the concern I think that you need to reassess the severity.

Comment 32 IBM Bug Proxy 2003-11-05 18:40:54 UTC

------ Additional Comments From jaredj.com  2003-05-11 13:39 -------
Concurrent RPM installs, one under LinuxThreads, will happen at some point. 
It's a very real problem.  If this isn't fixed, older versions of WAS will
simply not be supported on RHEL 3.

Comment 33 Matt Wilson 2003-11-05 18:53:11 UTC

This does not make sense.  It has nothing to do with the proper
operation of WAS on RHEL 3.  WAS will install and function properly on
RHEL 3.

Comment 34 IBM Bug Proxy 2003-11-05 19:03:21 UTC

------ Additional Comments From jaredj.com  2003-05-11 13:58 -------
I can give an easy example of such a situation.  Someone's say, running an old 
WAS install (5.0), which requires LinuxThreads to start the 1.3.1 JDK.  During 
the install, they decide to run the up2date program and pick up the latest 
patches too.  That kicks off.  WAS hits the section where it does 
MQ/GsKit/registration code and starts up RPM to install those parts.  At the 
same time, up2date engages installs of new RPMs for patches.

And the end of the install, (perhaps a system reboot), everything initially 
will seem okay.  Later, runs up2date to get more patches (say a week/month 
later), starts getting errors due to RPM DB corruption (packages missing, and 
so on).

I think people generally assume they can run a few processes in parallel like 
that on Linux.  It's not windows, which things tend to get flaky if you do more 
than one install at a time. :)

Which! It could be users have kicked off a large blanket install of a lot of 
programs, one of which requires LinuxThreads, and the installs are running in 
parallel (Such as using make's parallel execution ability to automate 
something).  Kaboom.

Or, what if a user had LinuxThreads set in a window because they were running 
an older program which needed it, then goes to run an RPM install.  At the same 
time, another user logged in decides to install something (Yes, two people 
installing stuff on the same box isn't good from a security/management point of 
view, but I'm certain it goes on).  Again, likely corruption.

Not all these cases are hypothetical.  As a general user, before the problem 
was identified, I lost my machine a few times due to the RPM corruption that 
happened.  If it happened to me, it'll happen to others.

I'm really not trying to be a pain here.  I feel this is a very real, and very 
serious, problem.  This is a legacy support scenario.  I spent time digging 
through the RPM code to come up with a possible fix (and I'm no expert on RPM 
code.  That was the first time I'd ever looked at it), to help provide possible 
solutions that wouldn't be invasive.  I don't think the concept in my patch is 
invasive at all, though I cannot claim it to be perfect as I do not know the 
RPM code well, and I don't regularly program in C (I like C, but due to my job, 
I don't get any sort of extended time to keep up to date andmy knowledge active 
with it), just now and then.  I wrote it as a view quick example.  If there are 
problems with it, I'd appreciate to hear what they are so I can learn and 
better understand possible other scenarios that my patch is useless for so I 
can better analyze and understand things in the future.

Comment 35 IBM Bug Proxy 2003-11-05 19:03:45 UTC

------ Additional Comments From jaredj.com  2003-05-11 14:02 -------
very, not view.  Spellcheck is not helpful sometimes.

Comment 36 IBM Bug Proxy 2003-11-05 19:26:05 UTC

------ Additional Comments From jaredj.com  2003-05-11 14:22 -------
I'm hoping to get this fixed to save both IBM and RedHat customer support 
calls.  It looks bad to install a program, and given a certain senario of 
possibilities, have the machine end up corrupted such that the updater program 
won't work, RPM dependency resolves fail, and so on.  I *truly* feel it would 
benefit both our companies to make sure RPM tolerates this sort of senario 
safely.  

Neither of us would want a high-paying, large customer, coming to us upset that 
their machine is messed up just because they installed an application, and 
happened do do another install in parallel, or somesuch.

Comment 37 Matt Wilson 2003-11-05 20:49:36 UTC

there is a best-effort amount of locking that is attempted between
NPTL rpm and non-NPTL rpm.  There could be some races, but there is a
fcntl lock on the /var/lib/rpm/Pacakges file that is set when we
detect that there is no NPTL available.  I'm going to need some
reproducible test case that demonstrates corruption before we're going
to be able to make progress with this.

Again, I've never gotten rpm to segfault as you demonstrate with the
rpmbuild-in-a-loop test case.  In that mode rpm is only doing read
actions on the database to satisfy build dependencies.  I think that
there was some other sort of instability on your system that resulted
in random corruption everywhere.

Comment 38 Bob Johnson 2003-11-05 21:05:11 UTC

Jared,

Have you been able to repro this on a system with a clean, fresh install ?

Comment 39 IBM Bug Proxy 2003-11-05 21:20:34 UTC

------ Additional Comments From jaredj.com  2003-05-11 16:21 -------
But, to quote you from a previous comment in the bug:
"corruption should only occur if you have multiple database accesses.  db4 is
only designed to support one locking mechanism.  In our case, we select posix
mutexes.  db4 depends on the PTHREAD_PROCESS_SHARED, which is only implemented
in NPTL.  So rpm's locking mechanism only works with NPTL."

You said database locking only works when NPTL is enabled.  That implies that 
running in LinuxThreads and calling RPM could potentially cause multiple 
database accesses and corrupt the DB.  So, er, which is it?

The whole reason I spent the time to write that patch was for concern that RPM 
always needed to be invoked under NPTL, based on that statement.  If it will 
perform some sort of proper locking, regardless if rpm running under NPTL and 
LinuxThreads run concurrently, then I'd say the problem is dealt with.  But, 
that's not what I was told previously in this bug.

Comment 40 Matt Wilson 2003-11-05 21:23:59 UTC

Right, db4 can only have one type of locking and I was only looking at
the db4 side of things.  I missed where rpm does its own fcntl lock
for NPTL-less operation.

Comment 41 IBM Bug Proxy 2003-11-05 21:25:59 UTC

------ Additional Comments From jaredj.com  2003-05-11 16:23 -------
Bob, I've not run it on the release candidates and attempted to do concurrent 
RPM access during an install that has LinuxThreads enabled.  After being told 
it was a problem with RPM itself, and it's locking, from RedHat, I made sure to 
never invoke RPM under LinuxThreads so I wouldn't have to keep rebuilding my 
machine after it got chewed up.  I don't have unlimited time to continually 
rebuild boxes.

Comment 42 Matt Wilson 2003-11-05 21:28:28 UTC

You can back up your database, and often corruptions in the databases
other than the Packages file can be repaired by rpm --rebuilddb.

Comment 43 IBM Bug Proxy 2003-11-05 21:55:38 UTC

------ Additional Comments From jaredj.com  2003-05-11 16:52 -------
--rebuilddb didn't fix it when I lost it the last times.  I had to continually 
re-install when the database was wrecked.

Interestingly, on the GM RHEL 3.0, the 2.2.5 (Non-floating) glibc doesn't crash 
the rpmbuild test, so I guess the stack overwrite that was happening there 
must've gotten fixed at some point.

So, perhaps it was just a glibc stack over-write issue that was eating the DB 
beforehand, or perhaps not.  I'll see what I can dig up when I have time.

And as a request, when you re-evaluate something which changes your observation 
beforehand (such as the db4/rpm interaction), could you please post them into 
the bugzilla so they get back to me.  Otherwise, I'm working on half/incomplete 
information and my decisions and advice to some managers, and even one of the 
directors, here are based on that.  I'd really appreciate it.

Comment 44 Matt Wilson 2003-11-05 21:57:58 UTC

I'll do my best, and apologize for the incorrect information I came up
with the first time.

Comment 45 mark wisner 2003-11-06 20:03:46 UTC

test comment

Comment 46 Bob Johnson 2003-11-06 20:07:51 UTC

Jared, Any feedback ?  If this was glibc that has been fixed that is
great news, if not we have more work to continue.

Comment 47 IBM Bug Proxy 2003-11-06 20:32:34 UTC

------- Additional Comment #54 From Jared P. Jurkiewicz  2003-11-06
13:27 -------

Thanks.  I'm trying to get you more information and I'm glad to see
that the 
2.2.5 export GLIBC seems to be working better now.  But, I have bad
news to 
report. :-(  I broke the box again today, doing something fairly
trivial.  I 
installed WAS V5.1 (With LinuxThreads set before the install is run),
and in 
another window, I just had a simple shell script running in a look and
force 
installing a dummy RPM into the DB every 2 secods or so. (So I'd get a 
concurrent access sort of situation).  That said, The MQ install
script we call 
under the covers in the java installer (Which does RPM installs),
reported a 
ton of failures.  Namely:
wmsetup: 06Nov03 12:19:35 
================================================================================
==================
wmsetup: 06Nov03 12:19:35 Date: Thu Nov 6 12:19:35 EST 2003
wmsetup: 06Nov03 12:19:35 
================================================================================
==================
wmsetup: 06Nov03 12:19:35 Hostname: arathorn.raleigh.ibm.com
wmsetup: 06Nov03 12:19:35 Operating System: Linux
wmsetup: 06Nov03 12:19:35 User: uid=0(root) gid=0(root)
groups=0(root),1(bin),2
(daemon),3(sys),4(adm),6(disk),10(wheel),501(mqm),502(mqbrkrs)
wmsetup: 06Nov03 12:19:35 wmsetup version: 1.22
wmsetup: 06Nov03 12:19:35 wsmfuncs.common version: 1.60
wmsetup: 06Nov03 12:19:35 wsmfuncs.Linux version: 1.47
wmsetup: 06Nov03 12:19:35 Command line is: /opt/wasinst/messaging/wmsetup 
install /opt/WebSphere/AppServer/logs/mq_install.log
wmsetup: 06Nov03 12:19:35 Function is install
wmsetup: 06Nov03 12:19:35 Checking pre-requisites ...
wmsetup: 06Nov03 12:19:35 Getting OS level ...
wmsetup: 06Nov03 12:19:35 Check_oslevel return 0
wmsetup: 06Nov03 12:19:35 Checking kernel ...
wmsetup: 06Nov03 12:19:35 ... OK
wmsetup: 06Nov03 12:19:35 Checking for group mqm ...
wmsetup: 06Nov03 12:19:35 Check_group returning 0
wmsetup: 06Nov03 12:19:35 Checking for user mqm ...
wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_user
wmsetup: 06Nov03 12:19:35 Checking for group mqbrkrs ...
wmsetup: 06Nov03 12:19:35 Check_group returning 0
wmsetup: 06Nov03 12:19:35 Check_root mqm
wmsetup: 06Nov03 12:19:35 Checking for group "mqm" ...
wmsetup: 06Nov03 12:19:35 Checking if user "root" is in group "mqm"
wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_root
wmsetup: 06Nov03 12:19:35 Check_root mqbrkrs
wmsetup: 06Nov03 12:19:35 Checking for group "mqbrkrs" ...
wmsetup: 06Nov03 12:19:35 Checking if user "root" is in group "mqbrkrs"
wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_root
wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesJava ...
wmsetup: 06Nov03 12:19:35 package MQSeriesJava is not installed
wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesRuntime ...
wmsetup: 06Nov03 12:19:35 package MQSeriesRuntime is not installed
wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesJava-5.2.2 ...
wmsetup: 06Nov03 12:19:36 package MQSeriesJava-5.2.2 is not installed
wmsetup: 06Nov03 12:19:36 Checking for installed MQSeriesJava ...
wmsetup: 06Nov03 12:19:36 package MQSeriesJava is not installed
wmsetup: 06Nov03 12:19:36 Checking for installed wemps-runtime ...
wmsetup: 06Nov03 12:19:36 package wemps-runtime is not installed
wmsetup: 06Nov03 12:19:36 Return code 0 from Check_prereqs
wmsetup: 06Nov03 12:19:36 Install mqjava ...
wmsetup: 06Nov03 12:19:36 IsClient entered
wmsetup: 06Nov03 12:19:39 IsClient exit RC = 4
wmsetup: 06Nov03 12:19:39 installing component mqjava ...
wmsetup: 06Nov03 12:19:39 MQSeriesJava-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 12:19:41 Checking for previously installed packages ...
wmsetup: 06Nov03 12:19:41 Check for previously installed packages complete
wmsetup: 06Nov03 12:19:41 Installing MQSeriesJava-5.3.0-1.i386.rpm
rpmdb: PANIC: Invalid argument
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from dbcursor->c_put: DB_RUNRECOVERY: Fatal
error, run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from dbcursor->c_close: DB_RUNRECOVERY: Fatal
error, 
run database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->cursor: DB_RUNRECOVERY: Fatal error,
run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "MQSeriesJava" records from Name index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->cursor: DB_RUNRECOVERY: Fatal error,
run 
database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "Cleanup" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "DefaultConfiguration" records from
Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "IVTRun" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "IVTSetup" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "IVTTidy" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "JMSAdmin" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "JMSAdmin.config" records from Basenames
index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "JmsPostcardSample.ini" records from
Basenames 
index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "MQJMS_PSQ.mqsc" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "PSIVTRun" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "PSReportDump.class" records from
Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "formatLog" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "postcard" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "postcard.ini" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "runjms" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "com.ibm.mq.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "com.ibm.mqbind.jar" records from
Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "com.ibm.mqjms.jar" records from
Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "connector.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "fscontext.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "jms.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "jndi.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "jta.jar" records from Basenames index
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run 
database recovery
error: error(-30982) getting "ldap.jar" records from Basenames index
...

and it goes on and on like that.  Looks like something in the best attemp 
locking isn't doing so well, or somesuch.  It failed really badly. 
I'm glad I 
tgz'ed the RPM /var/lib/rpm directory ahead of time.

I think this can be reproduced easily via a couple scripts, running in
loops to 
install rpms, one in LinuxThreads mode, the other in NPTL.  I'll try
to set 
that up, but I wanted to get you this feedback so you could see what I
was 
getting reported.  

I restored my RPM DB, and ran both installers in NPTL mode (The WAS
install, 
and the trivial script/dummy rpm) and it installed fine.  Here's the
output 
from MQ's install script (which just calls RPM)
wmsetup: 06Nov03 13:02:17 
================================================================================
==================
wmsetup: 06Nov03 13:02:17 Date: Thu Nov 6 13:02:17 EST 2003
wmsetup: 06Nov03 13:02:17 
================================================================================
==================
wmsetup: 06Nov03 13:02:17 Hostname: arathorn.raleigh.ibm.com
wmsetup: 06Nov03 13:02:17 Operating System: Linux
wmsetup: 06Nov03 13:02:17 User: uid=0(root) gid=0(root)
groups=0(root),1(bin),2
(daemon),3(sys),4(adm),6(disk),10(wheel),501(mqm),502(mqbrkrs)
wmsetup: 06Nov03 13:02:18 wmsetup version: 1.22
wmsetup: 06Nov03 13:02:18 wsmfuncs.common version: 1.60
wmsetup: 06Nov03 13:02:18 wsmfuncs.Linux version: 1.47
wmsetup: 06Nov03 13:02:18 Command line is: /opt/wasinst/messaging/wmsetup 
install /opt/WebSphere/AppServer/logs/mq_install.log
wmsetup: 06Nov03 13:02:18 Function is install
wmsetup: 06Nov03 13:02:18 Checking pre-requisites ...
wmsetup: 06Nov03 13:02:18 Getting OS level ...
wmsetup: 06Nov03 13:02:18 Check_oslevel return 0
wmsetup: 06Nov03 13:02:18 Checking kernel ...
wmsetup: 06Nov03 13:02:18 ... OK
wmsetup: 06Nov03 13:02:18 Checking for group mqm ...
wmsetup: 06Nov03 13:02:18 Check_group returning 0
wmsetup: 06Nov03 13:02:18 Checking for user mqm ...
wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_user
wmsetup: 06Nov03 13:02:18 Checking for group mqbrkrs ...
wmsetup: 06Nov03 13:02:18 Check_group returning 0
wmsetup: 06Nov03 13:02:18 Check_root mqm
wmsetup: 06Nov03 13:02:18 Checking for group "mqm" ...
wmsetup: 06Nov03 13:02:18 Checking if user "root" is in group "mqm"
wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_root
wmsetup: 06Nov03 13:02:18 Check_root mqbrkrs
wmsetup: 06Nov03 13:02:18 Checking for group "mqbrkrs" ...
wmsetup: 06Nov03 13:02:18 Checking if user "root" is in group "mqbrkrs"
wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_root
wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesJava ...
wmsetup: 06Nov03 13:02:18 package MQSeriesJava is not installed
wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesRuntime ...
wmsetup: 06Nov03 13:02:18 package MQSeriesRuntime is not installed
wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesJava-5.2.2 ...
wmsetup: 06Nov03 13:02:19 package MQSeriesJava-5.2.2 is not installed
wmsetup: 06Nov03 13:02:19 Checking for installed MQSeriesJava ...
wmsetup: 06Nov03 13:02:19 package MQSeriesJava is not installed
wmsetup: 06Nov03 13:02:19 Checking for installed wemps-runtime ...
wmsetup: 06Nov03 13:02:19 package wemps-runtime is not installed
wmsetup: 06Nov03 13:02:19 Return code 0 from Check_prereqs
wmsetup: 06Nov03 13:02:19 Install mqjava ...
wmsetup: 06Nov03 13:02:19 IsClient entered
wmsetup: 06Nov03 13:02:22 IsClient exit RC = 4
wmsetup: 06Nov03 13:02:22 installing component mqjava ...
wmsetup: 06Nov03 13:02:22 MQSeriesJava-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:25 Checking for previously installed packages ...
wmsetup: 06Nov03 13:02:25 Check for previously installed packages complete
wmsetup: 06Nov03 13:02:25 Installing MQSeriesJava-5.3.0-1.i386.rpm
wmsetup: 06Nov03 13:02:28 ... Return code 0 from rpm -i
wmsetup: 06Nov03 13:02:28 Install mqm ...
wmsetup: 06Nov03 13:02:28 IsClient entered
wmsetup: 06Nov03 13:02:33 This is a currently a WAS client-only
installation
wmsetup: 06Nov03 13:02:33 IsClient exit RC = 0
wmsetup: 06Nov03 13:02:33 installing component mqm ...
wmsetup: 06Nov03 13:02:33 MQSeriesClient-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_Zh_CN-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_Zh_TW-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_de-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_es-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_fr-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_it-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_ja-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_ko-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesMsg_pt-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesServer-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesSDK-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:33 MQSeriesRuntime-5.3.0-1.i386.rpm Found
wmsetup: 06Nov03 13:02:35 Checking for previously installed packages ...
wmsetup: 06Nov03 13:02:36 Check for previously installed packages complete
wmsetup: 06Nov03 13:02:36 Installing MQSeriesRuntime-5.3.0-1.i386.rpm 
MQSeriesSDK-5.3.0-1.i386.rpm MQSeriesServer-5.3.0-1.i386.rpm
MQSeriesMsg_pt-
5.3.0-1.i386.rpm MQSeriesMsg_ko-5.3.0-1.i386.rpm MQSeriesMsg_ja-5.3.0-
1.i386.rpm MQSeriesMsg_it-5.3.0-1.i386.rpm
MQSeriesMsg_fr-5.3.0-1.i386.rpm 
MQSeriesMsg_es-5.3.0-1.i386.rpm MQSeriesMsg_de-5.3.0-1.i386.rpm 
MQSeriesMsg_Zh_TW-5.3.0-1.i386.rpm MQSeriesMsg_Zh_CN-5.3.0-1.i386.rpm 
MQSeriesClient-5.3.0-1.i386.rpm
wmsetup: 06Nov03 13:03:15 ... Return code 0 from rpm -i
wmsetup: 06Nov03 13:03:15 Setting capacity units
/opt/mqm/bin/setmqcap: relocation error: /opt/mqm/lib/libmqmr_r.so:
symbol 
errno, version GLIBC_2.0 not defined in file libc.so.6 with link time
reference
wmsetup: 06Nov03 13:03:15 Install mqm_csd ...
wmsetup: 06Nov03 13:03:15 IsClient entered
wmsetup: 06Nov03 13:03:21 This is not currently a WAS client-only
installation
wmsetup: 06Nov03 13:03:21 IsClient exit RC = 4
wmsetup: 06Nov03 13:03:21 installing component mqm_csd ...
wmsetup: 06Nov03 13:03:21 MQSeriesClient-U486878-5.3.0-4.i386.rpm Found
wmsetup: 06Nov03 13:03:21 MQSeriesRuntime-U486878-5.3.0-4.i386.rpm Found
wmsetup: 06Nov03 13:03:21 MQSeriesSDK-U486878-5.3.0-4.i386.rpm Found
wmsetup: 06Nov03 13:03:21 MQSeriesServer-U486878-5.3.0-4.i386.rpm Found
wmsetup: 06Nov03 13:03:21 MQSeriesJava-U486878-5.3.0-4.i386.rpm Found
wmsetup: 06Nov03 13:03:23 Determining which CSD packages are
applicable ...
wmsetup: 06Nov03 13:03:23 Starting with list: MQSeriesJava-U486878-5.3.0-
4.i386.rpm MQSeriesServer-U486878-5.3.0-4.i386.rpm
MQSeriesSDK-U486878-5.3.0-
4.i386.rpm MQSeriesRuntime-U486878-5.3.0-4.i386.rpm
MQSeriesClient-U486878-
5.3.0-4.i386.rpm
wmsetup: 06Nov03 13:03:23 Ending with list: MQSeriesJava-U486878-5.3.0-
4.i386.rpm MQSeriesServer-U486878-5.3.0-4.i386.rpm
MQSeriesSDK-U486878-5.3.0-
4.i386.rpm MQSeriesRuntime-U486878-5.3.0-4.i386.rpm
MQSeriesClient-U486878-
5.3.0-4.i386.rpm
wmsetup: 06Nov03 13:03:23 Checking for previously installed packages ...
wmsetup: 06Nov03 13:03:24 Check for previously installed packages complete
wmsetup: 06Nov03 13:03:24 Installing
MQSeriesJava-U486878-5.3.0-4.i386.rpm 
MQSeriesServer-U486878-5.3.0-4.i386.rpm
MQSeriesSDK-U486878-5.3.0-4.i386.rpm 
MQSeriesRuntime-U486878-5.3.0-4.i386.rpm
MQSeriesClient-U486878-5.3.0-4.i386.rpm
wmsetup: 06Nov03 13:08:45 ... Return code 0 from rpm -i
wmsetup: 06Nov03 13:08:45 Install wemps ...
wmsetup: 06Nov03 13:08:45 IsClient entered
wmsetup: 06Nov03 13:08:52 This is not currently a WAS client-only
installation
wmsetup: 06Nov03 13:08:52 IsClient exit RC = 4
wmsetup: 06Nov03 13:08:52 installing component wemps ...
wmsetup: 06Nov03 13:08:52 wemps-runtime-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-De_DE-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Es_ES-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Fr_FR-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-It_IT-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Ja_JP-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Ko_KR-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Pt_BR-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Zh_CN-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:52 wemps-msg-Zh_TW-2.1.0-0.i386.rpm Found
wmsetup: 06Nov03 13:08:55 Checking for previously installed packages ...
wmsetup: 06Nov03 13:08:55 Check for previously installed packages complete
wmsetup: 06Nov03 13:08:55 Installing wemps-msg-Zh_TW-2.1.0-0.i386.rpm
wemps-msg-
Zh_CN-2.1.0-0.i386.rpm wemps-msg-Pt_BR-2.1.0-0.i386.rpm
wemps-msg-Ko_KR-2.1.0-
0.i386.rpm wemps-msg-Ja_JP-2.1.0-0.i386.rpm
wemps-msg-It_IT-2.1.0-0.i386.rpm 
wemps-msg-Fr_FR-2.1.0-0.i386.rpm wemps-msg-Es_ES-2.1.0-0.i386.rpm
wemps-msg-
De_DE-2.1.0-0.i386.rpm wemps-runtime-2.1.0-0.i386.rpm
Preinstall Phase Executing Please Wait...
Preinstall Phase Finished
Postinstall Phase Executing Please Wait...
Postinstall Phase Finished
wmsetup: 06Nov03 13:09:20 ... Return code 0 from rpm -i
wmsetup: 06Nov03 13:09:20 Return code 0 from Install_wsm
wmsetup: 06Nov03 13:09:20 Exiting - return code 0
wmsetup: 06Nov03 13:09:20 
============================================================

Comment 48 IBM Bug Proxy 2003-11-06 20:57:47 UTC

------ Additional Comments From khoa.com  2003-06-11 15:50 -------
Add Salina to the CC list, so she can see all the updates.  Thanks.

Comment 49 IBM Bug Proxy 2003-11-07 05:02:09 UTC

------ Additional Comments From jinge.com  2003-06-11 23:41 -------
I also saw the similar problem when I install CSM after setting
LD_ASSUME_KERNEL=2.2.5.

installms:  Exit code 1 from command: /bin/rpm -U
/csminstall/Linux/RedHatEL-ES/csm/1.3.2/packages/csm.dsh-1.3.2.10-2.i386.rpm 2>&1

Error message from cmd: rpmdb: PANIC: Invalid argument
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from dbcursor->c_put: DB_RUNRECOVERY: Fatal error,
run database recovery
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: fatal region error detected; run recovery
..............

After I remove __db.00* from /var/lib/rpm, the rpm could restore to normal. But
the problem will occur again after several installing.

Can someone add me and keshav.com into CC list? So we can get updates.
Thanks.

Comment 50 IBM Bug Proxy 2003-11-07 16:13:38 UTC

test to debug  BugMail tool for Mark Wisner

Comment 51 IBM Bug Proxy 2003-11-08 15:10:49 UTC

------ Additional Comments From markwiz.com  2003-07-11 07:32 -------
added keshav.com to the cc list

Comment 52 Bob Johnson 2003-11-14 18:51:00 UTC

I did a test: (I = Matt Wilson)

mkdir -p /tmp/test/var/lib/rpm
cp /usr/lib/rpmdb/i386-redhat-linux/redhat/* /tmp/test/var/lib/rpm/

This is a full and very large database.

window 1:
while :; do
  rpm -r /tmp/test -Uvh redhat-release-3-1.noarch.rpm --force --nodeps
|| break
done

window 2:
while :; do
 LD_ASSUME_KERNEL=2.4.19 rpm -r /tmp/test -Uvh
redhat-release-3-1.noarch.rpm --force --nodeps || break
done

Eventually I did see some database environment errors:

rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from dbenv->close: DB_RUNRECOVERY: Fatal
error, run database recovery

But the database is still fully intact, rpm -qa on the database still
shows 1392 package entries, etc.

Comment 53 Bob Johnson 2003-11-14 18:53:49 UTC

Per Jared, Matt can you try 2.2.5 ?

Comment 54 IBM Bug Proxy 2003-11-15 20:49:42 UTC

------ Additional Comments From keshav.com  2003-14-11 13:22 -------
Hi - Any progress on this defect, is this going to make it in to the first 
update. This is a blocking defect for CSM and we would really like to have this 
in update 1.

Comment 55 IBM Bug Proxy 2003-11-17 14:46:34 UTC

------ Additional Comments From jaredj.com  2003-17-11 09:41 -------
  I'm not certain if I've been able to corrupt the DB again, but I have been
able to hang it several times using the example RedHat provided regarding a
simple script install in two windows.  I couldn't even do an rpm -qa unless I
went in and deleted the __db.00? files.  

  So, as it stands I don't think we have anything useful we can tell customers.
 In one case it throws errors to the screen and messes up installs.  In another,
it makes the DB completely locked up.  Then, there's still the possible third of
DB corruption.

  An also inteesting side note.  After I killed the install the same rpm (In two
windows, in a loop), and unlocked the DB by removing the __db files, I did an
rpm -qa.  It reported back that there were 679 packages in the DB.  Just before
the run, I had queried and it reported 677.  Since it was installing the same
package, shouldn't it have only reported 678?  I don't think RPM will put in
duplicate entries, will it?

Comment 56 mark wisner 2003-11-17 15:19:06 UTC

test comment

Comment 57 IBM Bug Proxy 2003-11-17 15:46:04 UTC

test to debug running BugMail tool for Mark Wisner

Comment 58 Bob Johnson 2003-11-17 15:50:57 UTC

Mark and Glen,

Can you open up a test bug for your work and not use this one ?
Everytime you touch the bug you spawn emails to everyone that is on
the cc list and all who have posted.  With this being a hot bug we all
go take a look and see...."test comment"....Thanks.

Comment 59 mark wisner 2003-11-17 17:35:36 UTC

test to debug running BugMail tool for Mark Wisner

Comment 60 Matt Wilson 2003-11-20 20:07:04 UTC

How exactly is rpm invoked by the installation tools?  By direct path
(/bin/rpm) or by rpm in $PATH?

Comment 61 Karen Bennet 2003-11-21 03:54:26 UTC

*** Bug 110554 has been marked as a duplicate of this bug. ***

Comment 62 mark wisner 2003-11-21 14:07:50 UTC

------ Additional Comments From jaredj.com  2003-21-11 09:03 -------
Depends on the exec'er.  I know ISMP references full pathname, whereas the MQ 
series install scripts invoke it via $PATH. So, in the case of the WAS 
installer, both ways are used.

Comment 63 mark wisner 2003-11-21 19:00:50 UTC

------ Additional Comments From jaredj.com  2003-21-11 13:57 -------
Regarding the note above about the RPM patch and oddity there.  I suspect you'd
see the same if you ran without the patched rpm and just ran the test RPM
scripts, both in NPTL mode.  I've not verified this, however.  I haven't had
time to oldpackage back to the non-patched RPM.

Comment 64 Greg Kelleher 2003-11-25 14:28:23 UTC

This was emailed - But I will add it here : 
------- Additional Comment #69 From Jared P. Jurkiewicz 2003-11-21 
13:56 ------- 
A few more interesting side notes.  I've gotten RPM to hang (The one 
running
under NTPL) when there were dual LinuxThreads and NPTL RPM calls 
accessing at
the same time.  This seemed very easy to trivver with the -vvv option 
set in the
script that was installing the RPM.  In any event, here is snippied 
from the logs:

NPTL running RPM call:
==========================================
... bunch of calls
D: sanity checking 1 elements
D: computing 0 file fingerprints
Preparing packages for installation...
D: computing file dispositions
D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0
D: Expected size:         1169 = lead(96)+sigs(180)+pad(4)+data(889)
D:   Actual size:         1169
D:   install: DummyRpm-1.0-0 has 0 files, test = 0
D: opening  db index       /var/lib/rpm/Name create mode=0x42
D:  read h#    1006 Header SHA1 digest: OK
(8103c2a4c7954dcc5eb9a97827d653781e1b9a99)
DummyRpm-1.0-0
D:   --- h#    1006 DummyRpm-1.0-0
D: removing "DummyRpm" from Name index.
D: opening  db index       /var/lib/rpm/Group create mode=0x42
D: removing "System Environment/Daemons" from Group index.
D: opening  db index       /var/lib/rpm/Requirename create mode=0x42
D: removing 2 entries from Requirename index.
D: opening  db index       /var/lib/rpm/Providename create mode=0x42
D: removing "DummyRpm" from Providename index.
D: opening  db index       /var/lib/rpm/Requireversion create 
mode=0x42
D: removing 2 entries from Requireversion index.
D: opening  db index       /var/lib/rpm/Provideversion create 
mode=0x42
D: removing "1.0-0" from Provideversion index.
D: opening  db index       /var/lib/rpm/Installtid create mode=0x42
D: removing 1 entries from Installtid index.
D: opening  db index       /var/lib/rpm/Sigmd5 create mode=0x42
D: removing 1 entries from Sigmd5 index.
D: opening  db index       /var/lib/rpm/Sha1header create mode=0x42
D: removing "8103c2a4c7954dcc5eb9a97827d653781e1b9a99" from 
Sha1header index.
D:   +++ h#    1007 Header SHA1 digest: OK
(8103c2a4c7954dcc5eb9a97827d653781e1b9a99)

At this point, it's hung and just stays there.


LinuxThreads running RPM call:
===================================
D: ============== 1315054.tmp
D: Expected size:         1169 = lead(96)+sigs(180)+pad(4)+data(889)
D:   Actual size:         1169
D: 1315054.tmp: MD5 digest: OK (ecc0cece7925db82e0602eca45259c2f)
D:      added binary package [0]
D: found 0 source and 1 binary packages
D: unshared posix mutexes found(38), adding DB_PRIVATE, using fcntl 
lock
D: opening  db environment /var/lib/rpm/Packages 
create:cdb:mpool:private
D: opening  db index       /var/lib/rpm/Packages rdonly mode=0x0
D: locked   db index       /var/lib/rpm/Packages
D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0
D: opening  db index       /var/lib/rpm/Depends create mode=0x0
D:  Requires: rpmlib(PayloadFilesHavePrefix) <= 4.0-1       YES 
(rpmlib provides)
D:  Requires: rpmlib(CompressedFileNames) <= 3.0.4-1        YES 
(rpmlib provides)
D: closed   db index       /var/lib/rpm/Depends
D: closed   db index       /var/lib/rpm/Packages
D: closed   db environment /var/lib/rpm/Packages
D: ========== recording tsort relations
D: ========== tsorting packages (order, #predecessors, #succesors, 
tree, depth)
D:     0    0    0    0    0 +DummyRpm-1.0-0
D: installing binary packages
D: unshared posix mutexes found(38), adding DB_PRIVATE, using fcntl 
lock
D: opening  db environment /var/lib/rpm/Packages 
create:cdb:mpool:private
D: opening  db index       /var/lib/rpm/Packages create mode=0x42
D: mounted filesystems:
D:     i    dev bsize       bavail       iavail mount point
D:     0 0x0812  4096       777024       944553 /
D:     1 0x0002  1024            0           -1 /proc
D:     2 0x0008  1024            0           -1 /proc/bus/usb
D:     3 0x0811  1024        48735        20030 /boot
D:     4 0x0007  1024            0           -1 /dev/pts
D:     5 0x0009  4096       128566       128565 /dev/shm
D: sanity checking 1 elements
D: computing 0 file fingerprints
Preparing packages for installation...
D: computing file dispositions
D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0
D: Expected size:         1169 = lead(96)+sigs(180)+pad(4)+data(889)
D:   Actual size:         1169
D:   install: DummyRpm-1.0-0 has 0 files, test = 0
D: opening  db index       /var/lib/rpm/Name create mode=0x42
D:  read h#    1006 Header SHA1 digest: OK
(8103c2a4c7954dcc5eb9a97827d653781e1b9a99)
DummyRpm-1.0-0
D:   --- h#    1006 DummyRpm-1.0-0
D: removing "DummyRpm" from Name index.
D: opening  db index       /var/lib/rpm/Group create mode=0x42
D: removing "System Environment/Daemons" from Group index.
D: opening  db index       /var/lib/rpm/Requirename create mode=0x42
D: removing 2 entries from Requirename index.
D: opening  db index       /var/lib/rpm/Providename create mode=0x42
D: removing "DummyRpm" from Providename index.

And it just keeps going indefinitely.  Very strange.

=========================================================
Another thing I tried was rebuilding RPM itself with the patch that 
forces NPTL
That certainly got around an RPM hang, but after some ranom number of 
loops
through the script, it started spitting the environment fatal errors:
D: ============== 1315054.tmp
D: Expected size:         1169 = lead(96)+sigs(180)+pad(4)+data(889)
D:   Actual size:         1169
D: 1315054.tmp: MD5 digest: OK (ecc0cece7925db82e0602eca45259c2f)
D:      added binary package [0]
D: found 0 source and 1 binary packages
D: opening  db environment /var/lib/rpm/Packages joinenv
rpmdb: fatal region error detected; run recovery
error: db4 error(-30982) from dbenv->open: DB_RUNRECOVERY: Fatal 
error, run
database recovery
D: opening  db index       /var/lib/rpm/Packages rdonly mode=0x0
error: cannot open Packages index using db3 -  (-30982)
error: cannot open Packages database in /var/lib/rpm
D: ========== recording tsort relations
D: ========== tsorting packages (order, #predecessors, #succesors, 
tree, depth)
D:     0    0    0    0    0 +DummyRpm-1.0-0
D: ============== 1315054.tmp
D: Expected size:         1169 = lead(96)+sigs(180)+pad(4)+data(889)
D:   Actual size:         1169

=============================================
Since these logs are huge, I'm going to directly mail them to Mark, 
along
with the patched build of RPM, just in case he would like to look at 
is as well.







------- Additional Comment #70 From Jared P. Jurkiewicz 2003-11-21 
13:57 ------- 
Regarding the note above about the RPM patch and oddity there.  I 
suspect you'd
see the same if you ran without the patched rpm and just ran the test 
RPM
scripts, both in NPTL mode.  I've not verified this, however.  I 
haven't had
time to oldpackage back to the non-patched RPM.

Comment 65 mark wisner 2003-12-03 18:51:27 UTC

------ Additional Comments From jaredj.com  2003-03-12 13:51 -------
More experimentation results.  I completely reloaded my box with RHEL 3, AS, and
installed all current RHN updates on it.

I backed up my RPM DB, just in case.

I swapped the threading model to LinuxThreads, non-floating stack.
(LD_ASSUME_KERNEL=2.2.5)

I used the WA V5.1 installer (because I now can't get the V5.0 installer to
invoke at all.  The JDK segfaults no matter what glibc is used, not good.  It
did work in the past using 2.2.5 export).

Before the install, RPM reported there were 697 packages in the DB.
After the install, RPM reported there were 147 packages in the DB.

This is bad...

I'll be mailing the RPM package list of before and after the install to Mark and
Bob.

Comment 66 mark wisner 2003-12-03 18:52:03 UTC

------ Additional Comments From jaredj.com  2003-03-12 13:52 -------
Correction, it was 142 packages in the DB after the install.

Comment 67 IBM Bug Proxy 2003-12-08 21:21:19 UTC

------ Additional Comments From jaredj.com  2003-08-12 16:22 -------
Sent a trial CD to RedHat so they can easily reproduce in house.

Couple things that are useful to know:

The CD is vof V5.1, which will run the install in NPTL mode.  It's the only
images I had readily on hand that would work.  To see the problem, set
LinuxThreads before the run: 
export LD_ASSUME_KERNEL=2.2.5 
will show the problem usually very readily.  V5.0 requires LinuxThreads (For the
1.3.1 JVM to work, etcera), so this demonstrates the same RPM error V5.0 hits on
RHEL 3 when run this way.

There is a file on the root of the CD (install.sp) that sets debug on for ISMP
that will show you all the RPM commands it's calling out to.

This will normally end up in the instal log at the end of theinstall in
$WAS_HOME/logs/log.txt

You can just display it to the console while it's running as well by invoking
the install as:

./install -is:javaconsole

I have verified the trial install also triggers the failure.  It took two
installs to hit it, but I had the RPM DB report 697 before the install, then 179
entries afterward.

Comment 68 Matt Wilson 2003-12-09 15:47:29 UTC

[root@msw root]# mount /mnt/cdrom
[root@msw root]# cd /mnt/cdrom
[root@msw root]# LD_ASSUME_KERNEL=2.2.5 ./install
InstallShield Wizard
                                                                     
          
Initializing InstallShield Wizard...
                                                                     
          
Searching for Java(tm) Virtual Machine...
........
[root@msw root]# rpm --dbpath /root/rpm-backup -qa | wc -l
    534
[root@msw root]# rpm -qa | wc -l
    534
[root@msw root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]*
[root@msw root]#

Comment 69 Matt Wilson 2003-12-09 17:42:49 UTC

This was because rpmbuild wasn't installed. ISMP builds "fake"
packages with rpmbuild to "install" on the system.  After I install
the rpm-build package, I see what you're seeing, but this is due to
the db environment:

[root@msw WAS]# rpm -qa | wc -l
    107
But:
[root@msw WAS]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]*
                                                                     
          
The database is intact
[root@msw WAS]# rm /var/lib/rpm/__db*
[root@msw rpm]# rpm -qa | wc -l
    575
Before installing:
[root@msw rpm]# rpm --dbpath /root/rpm-backup-2/ -qa | wc -l
    548
[root@msw tmp]# rpm --dbpath /root/rpm-backup-2/ -qa | sort > before
[root@msw tmp]# rpm -qa | sort > after
[root@msw tmp]# diff -u before after
--- before      2003-12-09 12:42:38.000000000 -0500
+++ after       2003-12-09 12:42:59.000000000 -0500
@@ -507,6 +508,33 @@
 Wnn6-SDK-1.0-25
 Wnn6-SDK-devel-1.0-25
 words-2-21
+WSBAC1AA51-5.1-0
+WSBAS1AA51-5.1-0
+WSBAU1AA51-5.1-0
+WSBCO1AA51-5.1-0
+WSBCO5AA51-5.1-0
+WSBDM1AA51-5.1-0
+WSBDT1AA51-5.1-0
+WSBES1AA-5.0-0
+WSBGK2AA51-5.1-0
+WSBIH1AA51-1.3-28
+WSBJA1AA51-5.1-0
+WSBJD5AA51-1.3-1
+WSBJD7AA51-1.3-1
+WSBJD9AA51-1.3-1
+WSBLA1AA51-5.1-0
+WSBMQ1AA-5.0-0
+WSBMQ2AA-5.0-0
+WSBMQ3AA-5.0-0
+WSBMQ4AA-5.0-0
+WSBMS3AA-5.0-0
+WSBMS6AA-5.0-0
+WSBPL1AA51-5.1-0
+WSBPS1AA51-5.1-0
+WSBSM1AA51-5.1-0
+WSBSR1AA51-5.1-0
+WSBSR4AA51-5.1-0
+WSBTV1AA51-5.1-0
 wvdial-1.53-11
 Xaw3d-1.5-18
 xchat-2.0.4-3.EL

Comment 70 IBM Bug Proxy 2003-12-12 18:57:10 UTC

------ Additional Comments From jaredj.com  2003-12-12 13:54 -------
Recieved a patch from RedHat that conceptually does the same thing the patch I
submitted to them previously did (detect LD_ASSUME_KERNEL, if found, store off,
reset to NPTL, re-exec.  API's used are BSD originated ones, versus just pure
POSIX ones for env manipulation I used, but same idea and all that :-). 
Po-ta-to, po-tah-to).  

It's enabled via setting RPM_FORCE_NPTL=1 in the environment.  So, you have to
forcibly enable it versus having the switch always enabled.  It does correct the
corruption problem I saw.  (Before install, RPM DB shows 698 packages, after,
768, deinstall, goes back to 698, reinstall, up to 768, and so on.  I repeated
the install/deinstall procedure several times and verified it consistantly
registers cleanly.)

So, it looks like this should resolve it, and gives us a way to set the switch
over to protect the RPMDB when RPM executables get called under the covers by a
bunch of things.  And since we have to document anyway to set LD_ASSUME_KERNEL
for releases older than 5.1 to even invoke the install, it can be doc'ed to set
both values.

Comment 72 IBM Bug Proxy 2004-03-26 15:46:37 UTC

----- Additional Comments From khoa.com  2004-03-26 10:41 -------
Closing this bug with Jared's agreement.

Comment 73 Jeff Johnson 2004-08-06 13:05:37 UTC

AFAICT, this bug is fixed.

Comment 74 IBM Bug Proxy 2004-11-11 15:28:59 UTC

----- Additional Comments From khoa.com  2004-11-11 10:27 EDT -------
Per the call with WebSphere this morning, this bug has come back in RHEL3 U3.
So we need to reopen this.  Thanks.

Comment 75 IBM Bug Proxy 2004-11-11 16:05:30 UTC

Reopened.

Comment 76 IBM Bug Proxy 2004-11-11 16:06:22 UTC

----- Additional Comments From salina.com  2004-11-11 11:08 EDT -------
Stacy,
Please reopen problem at RH.   See Khoa's last comment.  Thanks.

Comment 79 Bob Johnson 2004-11-11 19:26:22 UTC

IBM, Need some more details.
RHEL 3 U3 shipped in September,

Comment 80 Bob Johnson 2004-11-11 22:11:08 UTC

The requirement for NPTL was taken out at U3 but function did not
change and WAS signed off on that test, iirc.

Comment 81 IBM Bug Proxy 2004-11-12 17:04:30 UTC

----- Additional Comments From markwiz.com  2004-11-12 12:01 EDT -------
This is a RHEL4 test that should also work for RHEL3 U4.

We have a simpler reproducer for you. 
 
You'll need a clean install of RHEL 4, and the installers that were provided to 
you. 
 
In order to replicate the bug *without* re-installing RHEL 4, I advise you to 
tar up /var/lib/rpm prior to performing the steps listed. 
 
1) adduser mqm 
2) adduser mqbrkrs 
3) vigr ... add root to groups mqm and mqbrkrs in /etc/group and 
   /etc/gshadow 
4) cd /opt 
5) tar -xzf WASTrialInstaller.tgz 
6) ulimit -s 2048 
7) export LC_CTYPE=$LANG 
8) cd WASTrialInst/instdir/messaging 
9) ./wmsetup install /tmp/mqlog 
10) cd .. 
11) mkdir ptfs 
12) cd ptfs 
13) tar -xzf was51_fp1_linux.tar.gz 
14) cd fixpacks 
15) jar xf was51_fp1_linux.jar 
16) cd ptfs/was51_fp1_linux/components/external.mq/CSD/ 
17) chmod -R +x  
18) export LD_ASSUME_KERNEL=2.4.19 
19) ./wmservice install /tmp/mqupgradelog 
20) unset LD_ASSUME_KERNEL 
 
At this point, issue an rpm -qa > installed_packages. 
This *should* provide a DB_PAGE_NOTFOUND. 
 
Removal procedure: 
1) rm /var/lib/__db*. This is a precaution, to prevent the 
   corrupted caches from "infecting" the rpm db. 
2) cd /opt/WASInst 
3) rpm -qa -last > some_text_file 
4) Edit some_text_file, so that the only packages in the file 
   are those containing the words MQSeries or wemps. 
5) rpm -ev --noscripts `cat some_text_file` 
6) cd /opt 
7) rm -rf mqm wemps 
8) cd /var 
9) rm -rf mqm wemps 
10) cd /usr/bin 
11) symlinks -d . 
 
This will completely clean your system, but will leave your rpm db 
in a state that will not permit you to replicate the issue. 
 
In order to replicate the issue, cd to /var/lib/rpm, rm __db*, and 
then untar the backup of /var/lib/rpm over the current set of files.

Comment 82 IBM Bug Proxy 2004-11-12 22:12:03 UTC

----- Additional Comments From jaredj.com  2004-11-12 17:11 EDT -------
Reproduced on my X86 2-way machine.  This is incredibly strange.
./wmservice mqlog.txt
[root@arathorn CSD]# ./wmservice install /tmp/mqsetup.log
[root@arathorn CSD]# rpm -qa > somefile
[root@arathorn CSD]# echo $LD_ASSUME_KERNEL
2.4.19
[root@arathorn CSD]# unset LD_ASSUME_KERNEL
[root@arathorn CSD]# rpm -qa > somefile2
error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page
not found
[root@arathorn CSD]# export LD_ASSUME_KERNEL=2.4.19
[root@arathorn CSD]# rpm -qa > somefile3
[root@arathorn CSD]#


It barfs now OWNLY when LD_ASSUME_KERNEL us unset after an install when
LD_ASSUME_KERNEL was set.  This is from steps 18-20.  If I reset
LD_ASSUME_KERNEL, no DB_PAGE errors.  If I unset it now, PAGE errors.


And an interesting Bugzilla Victor just pointed me to:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=91933


More things I should point out I find suspect (but I may just be reading it
wrong...)

If you pass in a --disable-posixmutexes ... How does this code in rpmdb/db3.c
know about it?  It looks iike if it can compute a posix mutex, it'll use it
regardless.

Here's the code i'm wondering about...

    /*
     * Avoid incompatible DB_CREATE/DB_RDONLY flags on DBENV->open.
     */
    if (dbi->dbi_use_dbenv) {

#if HAVE_LIBPTHREAD
        if (rpmdb->db_dbenv == NULL) {
            /* Set DB_PRIVATE if posix mutexes are not shared. */
            xx = db3_pthread_nptl();
            if (xx) {
                dbi->dbi_eflags |= DB_PRIVATE;
                rpmMessage(RPMMESS_DEBUG, _("unshared posix mutexes found(%d),
adding DB_PRIVATE, using fcntl lock
"), xx);
            }
        }
#endif

        if (access(dbhome, W_OK) == -1) {

            /* dbhome is unwritable, don't attempt DB_CREATE on 
...


Then the function:
/**
 * Check that posix mutexes are shared.
 * @return              0 == shared.
 */
static int db3_pthread_nptl(void)
        /*@*/
{
    pthread_mutex_t mutex;
    pthread_mutexattr_t mutexattr, *mutexattrp = NULL;
    pthread_cond_t cond;
    pthread_condattr_t condattr, *condattrp = NULL;
    int ret = 0;

    ret = pthread_mutexattr_init(&mutexattr);
    if (ret == 0) {
        ret = pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED);
        mutexattrp = &mutexattr;
    }

    if (ret == 0)
        ret = pthread_mutex_init(&mutex, mutexattrp);

...


So, how is DB_PRIVATE always getting set when the disable was passed?  I don't
see how it could, given that code.


So, this implies to me that --disable-posixmutexes is broken from a confugure
point of view.  It's *not* disabling it all the way.

If I'm wrong in my analysis of that db3.c file, please let me know.

Comment 83 IBM Bug Proxy 2004-11-12 22:12:46 UTC

----- Additional Comments From jaredj.com  2004-11-12 17:13 EDT -------
Note that the RPM code snippet included is from the RPM source packages of RHEL
4, Beta 2.  I would think RHEL 3 is probably similar.

Comment 84 Jeff Johnson 2004-11-12 22:39:12 UTC

Yes, RHEL3 is very similar.

What you miss is that shared posix mutexes are only available
through the test above if/when NPTL is functional and enabled.

Setting LD_ASSUME_KERNEL makes shared posix mutexes unavailable.

The test is exactly what db4 does while checking flags.

Setting DB_PRIVATE is not at all the right thing to do, because
there is no locking whatsoevere with DB_PRIVATE set. The
intent was to have something functional when a non-redhat
(and hence non-nptl) 2.4 kernel without NPTL was used. This is a
rather small corner case for RHEL/FC product.

Backing up and designing a goal other than "Make it work." is
perhaps better than mucking about with various locking schemes.

I can easily insturment a WAS specific pathway that takes out a
dirt simple fcntl exclusive lock on Packages in rpm and avoids the
complexities involved with concurrent access locking and NPTL
if you wish. AFAIK, that is not an acceptable solution to IBM.

Comment 85 IBM Bug Proxy 2004-11-15 21:36:26 UTC

 ------- Additional Comment #84 From Jared P. Jurkiewicz  2004-11-15
15:32 EDT  [reply] -------     Internal Only

Note my last comment was for RHEL 3, U4 beta.  It fails there.

Comment 86 IBM Bug Proxy 2004-11-16 17:24:58 UTC

----- Additional Comments From jaredj.com  2004-11-15 17:06 EDT -------
Note:  There is an error in one of my emit texts:
Deleting the /var/lib/__db.00?

Should be: Deleting the /var/lib/rpm/__db.00?

The script does delete the 001, 002, 003 files individually in /var/lib/rpm,
only the text is in error.

Comment 87 IBM Bug Proxy 2004-11-16 17:26:31 UTC

----- Additional Comments From jaredj.com  2004-11-15 17:06 EDT -------
Note:  There is an error in one of my emit texts:
Deleting the /var/lib/__db.00?

Should be: Deleting the /var/lib/rpm__db.00?

The script does delete the 001, 002, 003 files individually in /var/lib/rpm,
only the text is in error.

Comment 88 IBM Bug Proxy 2004-11-16 17:33:11 UTC

Created attachment 106824 [details]
"simple_failure_test.tar.gz"

Comment 89 IBM Bug Proxy 2004-11-16 17:33:41 UTC

----- Additional Comments From jaredj.com  2004-11-15 17:03 EDT -------
 
Dirt simple reproduction testcase

Here's a dirt simple reproduction.  I've been able to easily remove WAS
completely from the equation and still reprroduce failures.  This is the run
from RHEL3, U4 Beta.  Note the RPM's included in this testcase are dirt simple,
and the binaries were built with RPM v4 from RHEL 3 and RHEL4.	So, it removes
concerns it's that MQ rpms were built with RPMv3 as the failure causes

My run results:

./test_install
Backing up /var/lib/rpm to: /backup_rpm.tar.gz ...
Backup completed

Setting LD_ASSUME_KERNEL to 2.4.19
Set LD_ASSUME_KERNEL to: 2.4.19

Installing test_passthrough RPM
Pre script...
LD_ASSUME_KERNEL = 2.4.19
Post script...
LD_ASSUME_KERNEL = 2.4.19
Completed the install of the passthrough test RPM.

Running 'rpm -qa' and piping to /dev/null.  Should be no errors on stderr.
rpm-qa run complete.

Total # of RPMS in RPM DB:     718

Unsetting LD_ASSUME_KERNEL...
LD_ASSUME_KERNEL is now:
Running 'rpm -qa' and piping to /dev/null.  There will likely be an
error on stderr regarding DB_PAGE_NOTFOUND
error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page
not found
rpm-qa run complete.  Should be a failure.

Now, installing another rpm without LD_ASSUME_KERNEL set.  Note that a
failure DOES NOT EMIT!	It will 'install' fine, and intruduce corruption
into the RPMDB from the caches.
Pre script...
LD_ASSUME_KERNEL not set!
Post script...
LD_ASSUME_KERNEL not set!
Install complete.

Re-running package count.  You WILL see an error here.	But now it's
too late, RPMDB is corrupted.
error: rpmdbNextIterator: skipping h#	  718 blob size(1568): BAD, 8 + 16 *
il(41) + dl(916)
New RPM package count is:     718

Deleting the /var/lib/__db.00? files to see if it is only cache issue
corruption...
Deletion done.

Re-running package count.  You WILL see an error here.	But now it's
too late, RPMDB is corrupted.
error: rpmdbNextIterator: skipping h#	  718 blob size(1568): BAD, 8 + 16 *
il(41) + dl(916)
New RPM package count is:     718

Note that even with the deletion of the /var/lib/__db.00? files,
corruption is permanent.  Futher calls to rpm produce failures.
Testrun complete.  Please restore /backup_rpm.tar.gz over /var/lib/rpm
to repair RPMDB corruption that occured.

Comment 90 IBM Bug Proxy 2004-11-16 17:41:09 UTC

----- Additional Comments From jaredj.com  2004-11-15 15:32 EDT -------
Note my last comment was for RHEL 3, U4 beta.  It fails there.

Comment 91 Jeff Johnson 2004-11-24 01:47:38 UTC

This should be fixed in rpm-4.2.3-16_nonptl headed for
RHEL3U5.

Comment 95 IBM Bug Proxy 2005-02-20 22:06:42 UTC

---- Additional Comments From khoa.com  2005-02-20 17:03 EST -------
Jared - please verify this when U5 is available and close this bug report
if possible.  Thanks.

Comment 96 IBM Bug Proxy 2005-03-22 04:27:09 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED
             Impact|------                      |Functionality




------- Additional Comments From vjo.com  2005-03-21 23:21 EST -------
Tested on RHEL 3 U5 beta 1 - issue has been resolved.

Comment 97 Dennis Gregorovic 2005-05-18 14:45:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-147.html

Comment 98 Mike Sisk 2005-06-01 08:20:12 UTC

This bug is still around for me on RHEL 4 -- I ran into a repeatable corrupt rpmdb problem only when 
"LD_ASSUME_KERNEL=2.4.19" is set. 

This is a production machine and I can't do much testing, but when the LD_ASSUME_KERNEL was set in /
etc/profile the rpmdb would get corrupt everytime I loaded an rpm or ran up2date. After recovery I 
removed the LD_ASSUME_KERNEL setting and everything went fine. Putting it back in again resulted in 
corruption. 

I'm on rpm-4.3.3-7_nonptl and kernel-smp-2.6.9-5.0.5.EL and all the stuff needed for Oracle 9i.

Comment 99 Paul Nasrat 2005-06-01 16:27:43 UTC

Regards comment #98

Please describe exactly the error message you are seeing:

Are you seeing:

error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page
not found 403

Do you get this purely if using rpm or only when using both rpm and up2date.
Does LD_ASSUME_KERNEL=2.4.0 work correctly for you (old LinuxThreads)?

Note You need to log in before you can comment on or make changes to this bug.