The following has be reported by IBM LTC:LTC Bug#3765 RPM problens running under LinuxThreads Hardware Environment: Dual processor (550Mhx), PIII, SCSI hard-drives, 1GB ram. Software Environment: RHEL 3.0 Beta 1, WS edition RPM behavior on RHEL 3.0 is bad under LinuxThreads mode. Easy reproduction can be shown using rpmbuild: Steps to Reproduce: 1. make a directory rpm_bug 2. Create a dummy RPM named DummyROm-1.0.spec, such as: Name: DummyRpm Version: 1.0 Release: 0 Summary: Simple empty RPM testcase, Doesn't have any files. License: IBM 2002 Group: System Environment/Daemons AutoReqProv: Yes %description %files 3.) Create a script: #!/bin/sh rc=0 while [ "$rc" = "0" ] do rpmbuild --define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb $PWD/DummyRpm-1.0.spec rc=$? Done 4.) export LD_ASSUME_KERNEL2.4.0 5.) Run the script. Actual Results: Script/rpmbuild crashes after some random number of runs. Expected Results: Should run until user-interrupted. Additional Information: Running ?rpm ?i? on a package has varying results. Sometimes it works, sometimes it crashes, sometimes it corrupts the system RPM DB. This is a significant concern for several reasons. Any product which requires LinuxThreads to work, and calls RPM under the covers will have issues with this. I can name two products directly affected: ISMP (Installshield Multiplatform) (It uses both rpm and rpmbuild to do software registration) Caching Proxy (From WebSphere Edge) (Its installer uses rpm to install the various packages the user selects) Also, any application bundled in RPM format that needs LinuxThreads to run and as part of the post-install executes part of itself to configure or whatnot, will have problems (As currently the only way to set the LD_ASSUME_KERNEL for extisting install rpms is to set it on the commandline before calling RPM). So, this will likely affect a good many existing products that run on RHEL 2.1 today, and are expected to be able to run on RHEL 3.0 in LinuxThreads mode.Make that: export LD_ASSUME_KERNEL=2.4.0 and DummyRpm- 1.0.spec
The above procedure (slightly modified) runs fine for me. If LD_ASSUME_KERNEL is involved, then this is going to be a glibc and/or kernel, not rpm, issue. Reopen (and reassign to kernel/glibc) with the version/release of kernel/glibc that you are testing against.
------- Additional Comment #6 From Jared P. Jurkiewicz 2003-08-20 15:48 ------- I can still produce the failure using export LD_ASSUME_KERNEL values of 2.2.5 and 2.4.0. [root@arathorn test]# uname -a Linux arathorn.raleigh.ibm.com 2.4.21-1.1931.2.393.entsmp #1 SMP Wed Aug 13 21:51:41 EDT 2003 i686 i686 i386 GNU/Linux [root@arathorn test]# export LD_ASSUME_KERNEL=2.4.0 [root@arathorn test]# ./repr.sh Processing files: DummyRpm-1.0-0 Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot} Wrote: ./1315054.tmp Processing files: DummyRpm-1.0-0 Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot} Wrote: ./1315054.tmp Processing files: DummyRpm-1.0-0 Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot} Wrote: ./1315054.tmp Processing files: DummyRpm-1.0-0 Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot} Wrote: ./1315054.tmp Processing files: DummyRpm-1.0-0 Checking for unpackaged file(s): /usr/lib/rpm/check-files %{buildroot} Wrote: ./1315054.tmp ./repr.sh: line 9: 5405 Segmentation fault (core dumped) rpmbuild -- define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb $PWD/DummyRpm- 1.0.spec [root@arathorn test]# Also, the 2.2.5 export also fails. GDB examination of core: [root@arathorn test]# gdb /usr/bin/rpmbuild core.5301 GNU gdb Red Hat Linux (5.3.90-0.20030710.14rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib/libthread_db.so.1". Core was generated by `rpmbuild --define _rpmdir ./ --define _rpmfilename 1315054.tmp -bb /root/test/D'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/librpmbuild-4.2.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/librpmbuild-4.2.so Reading symbols from /usr/lib/librpm-4.2.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/librpm-4.2.so Reading symbols from /usr/lib/librpmdb-4.2.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/librpmdb-4.2.so Reading symbols from /usr/lib/librpmio-4.2.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib/librpmio-4.2.so Reading symbols from /usr/lib/libpopt.so.0...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libpopt.so.0 Reading symbols from /usr/lib/libelf.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libelf.so.1 Reading symbols from /usr/lib/libbeecrypt.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libbeecrypt.so.6 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /usr/lib/libbz2.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libbz2.so.1 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libnss_files.so.2 #0 0x00337d3f in domd5 () from /usr/lib/librpmdb-4.2.so (gdb) info stack #0 0x00337d3f in domd5 () from /usr/lib/librpmdb-4.2.so #1 0x00e8c6d7 in rpmAddSignature () from /usr/lib/librpm-4.2.so #2 0x00775866 in writeRPM () from /usr/lib/librpmbuild-4.2.so #3 0x00776443 in packageBinaries () from /usr/lib/librpmbuild-4.2.so #4 0x0076cd6c in buildSpec () from /usr/lib/librpmbuild-4.2.so #5 0x0804a2b1 in ?? () #6 0x083a4418 in ?? () #7 0x083a5ea8 in ?? () #8 0x0000009f in ?? () (gdb) and in case they want to know: [root@arathorn test]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 549.067 cache size : 512 KB physical id : 0 siblings : 1 runqueue : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1094.45 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 549.067 cache size : 512 KB physical id : 0 siblings : 1 runqueue : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1097.72 And: [root@arathorn test]# free total used free shared buffers cached Mem: 1028548 1008200 20348 0 132596 598684 -/+ buffers/cache: 276920 751628 Swap: 634528 208 634320 Incidentally: 2.4.1 works. The problem seems to lie in the non-floating stack based glibc. <floating stack setup glibc, LinuxThreads based> [root@arathorn test]# export LD_ASSUME_KERNEL=2.4.1 [root@arathorn test]# ldd /usr/bin/rpmbuild librpmbuild-4.2.so => /usr/lib/librpmbuild-4.2.so (0x008fd000) librpm-4.2.so => /usr/lib/librpm-4.2.so (0x00855000) librpmdb-4.2.so => /usr/lib/librpmdb-4.2.so (0x00ccd000) librpmio-4.2.so => /usr/lib/librpmio-4.2.so (0x00175000) libpopt.so.0 => /usr/lib/libpopt.so.0 (0x00fc4000) libelf.so.1 => /usr/lib/libelf.so.1 (0x00681000) libbeecrypt.so.6 => /usr/lib/libbeecrypt.so.6 (0x00655000) librt.so.1 => /lib/i686/librt.so.1 (0x00111000) libpthread.so.0 => /lib/i686/libpthread.so.0 (0x001b3000) libbz2.so.1 => /usr/lib/libbz2.so.1 (0x00645000) libc.so.6 => /lib/i686/libc.so.6 (0x00283000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x0026c000) <Non-floating stack setup, LinuxThreads based> [root@arathorn test]# export LD_ASSUME_KERNEL=2.4.0 [root@arathorn test]# ldd /usr/bin/rpmbuild librpmbuild-4.2.so => /usr/lib/librpmbuild-4.2.so (0x00d52000) librpm-4.2.so => /usr/lib/librpm-4.2.so (0x00636000) librpmdb-4.2.so => /usr/lib/librpmdb-4.2.so (0x00711000) librpmio-4.2.so => /usr/lib/librpmio-4.2.so (0x00bda000) libpopt.so.0 => /usr/lib/libpopt.so.0 (0x00111000) libelf.so.1 => /usr/lib/libelf.so.1 (0x00e22000) libbeecrypt.so.6 => /usr/lib/libbeecrypt.so.6 (0x002e0000) librt.so.1 => /lib/librt.so.1 (0x009b9000) libpthread.so.0 => /lib/libpthread.so.0 (0x00119000) libbz2.so.1 => /usr/lib/libbz2.so.1 (0x008a2000) libc.so.6 => /lib/libc.so.6 (0x0016b000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x0087f000) [root@arathorn test]# contents of repr.sh: #!/bin/sh rc=0 while [ "$rc" = "0" ] do rpmbuild --define '_rpmdir ./' --define '_rpmfilename 1315054.tmp' -bb $PWD/DummyRpm-1.0.spec rc=$? done Contents of DummyRpm-1.0.spec: Name: DummyRpm Version: 1.0 Release: 0 Summary: Simple empty RPM testcase, Doesn't have any files. License: IBM 2002 Group: System Environment/Daemons AutoReqProv: Yes %description %files ------- Additional Comment #7 From Khoa D. Huynh 2003-08-20 15:55 ------- Jared - did you try LD_ASSUME_KERNEL=2.4.1 ? ------- Additional Comment #8 From Jared P. Jurkiewicz 2003-08-20 16:00 ------- Look under the last section of my comment (Incidentally). I note 2.4.1 does work, and that narrows down the problem area to the pthreads non-floating stack glibc, which is what loads when you use 2.2.5 or 2.4.0. 2.4.1 gives you pthreads floating stack glibc. We do have older programs that need the non-floating stack version, that could run into problems here (And incidentally, are supported on RHEL 2.1). WAS 4.0.1, for example, needs it. JDK 1.3.0 would not work without that parameter being set (2.2.5). So, the base install for V4 will potentially have problems on RHEL 3.0 (And I'm pretty sure someone will ask us to support that). ------- Additional Comment #9 From Jared P. Jurkiewicz 2003-08-20 16:07 ------- Addendum: In terms of V4, we picked up support RHEL 2.1 in 4.0.5, which is JDK 1.3.1 based. However, the customer has to install 4.0.1 first (the full install image), before the update to 4.0.5 can be done. Which means we need rpm (which is invoked to install IHS and GsKit), to not have the potential for destroying to OS. Of course, all this is moot if we don't have to support RHEL 3.0 with WAS v4. But, to be on the safe side, we should probably try and get this addressed. ------- Additional Comment #10 From Jared P. Jurkiewicz 2003-08-20 16:08 ------- Addendum to addendum, Destroying the RPM repository of the OS, rather, which then makes RPM updates and such hard to do, as dependencies won't be found, etc.
This is not an rpm problem afaict. Reopen and assign to glibc if you wish.
------ Additional Comments From khoa.com 2003-21-08 17:25 ------- Glen/Greg - can you reopen the bug in RH Bugzilla and assign it to glibc ? Thanks.
------ Additional Comments From jaredj.com 2003-25-08 11:52 ------- Just found out this morning ... we need the LD_ASSUME_KERNEL=2.4.0 (and 2.2.5) to work in order to get WAS V 5.0.0 to install the baseline WAS. JDK 1.3.1 will segfault under LD_ASSUME_KERNEL=2.4.1 or greater. So, rpm having issues under the older glibc is dangerous, as rpm is called to install things like embedded MQ, GsKit, etcera. -- Jared Jurkiewicz WebSphere AppServer Development
From what I can see, this is just that rpm, rpmbuild and maybe other rpm programs eat too much stack. linuxthreads non-FLOATING_STACKS (ie. LD_ASSUME_KERNEL <= 2.4.0) on IA-32 limit stack to 2M (that's the size of the stack slots assigned to each thread), including the initial thread (this is nothing new, it has been like this since the beginning). If I ulimit -s 2048, I can get it to segfault with all of LD_ASSUME_KERNEL 2.2.5, 2.4.1 and without LD_ASSUME_KERNEL (ie. lt, ltfs, nptl). But, when debugging rpmbuild, it seems to eat something like 0x46000 bytes of stack (difference between $esp in __libc_start_main and at the point of segfault). To me this looks like if kernel stack randomization eats from RLIMIT_STACK (this would explain why the segfaults aren't reproduceable in every run).
------ Additional Comments From jaredj.com 2003-10-09 11:03 ------- In addition, on PPC architectures, if I set LD_ASSUME_KERNEL=2.4.19 I see RPM DB corruption there as well. I've had to reload my machine a couple times because of it. This is needed because we have programs that get installed undeer the covers via RPM, that require LinuxThreads to properly install (they run post setup scripts and such calling executables that require LinuxThreads behaviors). While a workaround could be done by backing p the RPM-DB, installing the program, then restoring the DB, I don't think general customers will find that acceptable. A secondary bug can be opened on PPC, but it looks like this is just more a general RPM issue or somesuch.
is RPM embedded/nested underneath ? Why not clear the environment variable before you do this ?
------ Additional Comments From jaredj.com 2003-12-09 14:16 ------- Because, the RPM itself invokes scripts that require LinuxThreads, and we can't regen the RPM to do checks and sets of the env-variables. Also, the ISMP PPK's for Linux do RPM registration, which generate dummy rpms to insert and install, and those too get this variable set when you have to set it for the JVM.
LD_ASSUME_KERNEL=$VALUE, where $VALUE is less than or equal to 2.4.19, will disable ALL locking in RPM. Any concurrent database access will corrupt the database. NPTL is needed for the concurrent locking mechanism which is needed to do things like invoke rpm from %post scripts. If a script in %post is calling a program (such as a JVM) which requires LinuxThreads, it should set the LD_ASSUME_KERNEL environment variable in the %post script. Applications which invoke rpm (as a child process) that require LD_ASSUME_KERNEL values to turn on LinuxThreads support should clear the LD_ASSUME_KERNEL from the environment before exec()ing rpm.
------ Additional Comments From jaredj.com 2003-12-09 15:58 ------- If that's the case and you cannot fix it, then you will have serious issues from companies other than IBM. There are programs people may want to install, that require LinxuThreads to install to do post config work, where the RPM/install image cannot be recreated. So, the post-install scripts can't be updated. This, in fact, will break any program using ISMP, with a JDK that requires LinuxThreads. Their calls to rpm do not have unset calls for that variable, and therefore applications that would install on RHEL 2.1, will now corrupt the DB of RHEL 3 when they're installed.
on the bright side.. the number of apps that don't work with NPTL is prety low since most applications seem to have not made implementation details just posix behavior. I realize that's not helping this specific case though.
having said that, the 2Mb stack issue is something that will be fixed by the kernel; the recursion issue is obviously beyond the kernel's scope
Either add LD_ASSUME_KERNEL to the %post scriptlet or use a version of rpm compiled w/o --enable-posixmutexes. I see no other solution for rpm, hence WONTFIX.
corruption should only occur if you have multiple database accesses. db4 is only designed to support one locking mechanism. In our case, we select posix mutexes. db4 depends on the PTHREAD_PROCESS_SHARED, which is only implemented in NPTL. So rpm's locking mechanism only works with NPTL. Note: /usr/lib/rpm/rpmi is a statically linked NPTL application that could be used, but if a newer version of glibc is installed the NSS modules will be incompatible with the interfaces used in the static binary. This makes it unsuitable as a solution.
------ Additional Comments From jaredj.com 2003-18-09 11:52 ------- main()- > do a getenv("LD_ASSUME_KERNEL"); If not null, note that, store off if it sets to LinxuThreads dynamic oor not, do like a putenv ("SET_LINUXTHREADS_SCRIPTS_<PROCESS ID OF THIS RPM PROCESS>=TRUE), then clear out that LD_ASSUME_KERNEL variable. Then! do an execvp() of the same commandline that was used to invoke RPM, with the flag set to now to pass LinuxThreads settings to scripts (but, without the LD_ASSUME_KERNEL, it should use NPTL for RPM itself. When they do the script calls, they look to see if it says to set LinuxThreads, set that in the ENV if it is supposed to bem then execvp the script. I'll try to provide a code example of what I mean (not RPM patch, but a simple example) to see if this is feasible.
------ Additional Comments From jaredj.com 2003-18-09 14:19 ------- Here's the quick code example to better explain what I mean. It's by no means robust or well written (or bug free, for that matter!). But, it shows the idea: test_varswap.c: #include <unistd.h> #include <stdlib.h> #include <stdio.h> extern char ** environ; /** * Quick function to just replace the LD_ASSUME_KERNEL entry with another special one * so we can 'pass through' the LD_ASSUME to underlying prgrams we fork, but we depend on * it not being set. */ char ** replaceLDAssume(char ** envArray, char* replacementVal) { int arraycount, pos; char ** newEnv; newEnv = NULL; arraycount = 0; if (environ != NULL) { /** * Figure out how big the env is, then malloc an array of the same size. */ while (environ[arraycount] != NULL) { arraycount++; } newEnv = (char**)malloc((arraycount + 1) * sizeof(char*)); memset(newEnv, 0, (arraycount + 1) * sizeof(char*)); /** * Clone and replace. */ for (pos = 0; pos < arraycount; pos++) { if (strstr(environ[pos], "LD_ASSUME_KERNEL") == NULL) { newEnv[pos] = environ[pos]; } else { newEnv[pos] = replacementVal; } } } else { newEnv = environ; } return newEnv; } void print_array(char** env) { int arraycount, pos; arraycount = 0; if (env != NULL) { /** * Figure out how big the env is, then malloc an array of the same size. */ while (environ[arraycount] != NULL) { printf("%s ",env[arraycount]); arraycount++; } } } int main(int argc, char** argv) { char ** newEnv; char swapmodeVar[256]; char compatVar[256]; char** newargs; int count; pid_t pid; char* script[1]; script[0] = "./testscript.sh"; count = 0; if (getenv("LD_ASSUME_KERNEL") != NULL) { printf("LD_ASSUME_KERNEL detected. Removal from env initiated "); /** * Generate passthrough variable! We save the pid as part of the var, so it can be isolated * to one process, this one. Only this one will recognise it on lookup. */ snprintf(swapmodeVar,255,"%s_%d=%s","PASSTHRU_LD_ASSUME",getpid(),getenv ("LD_ASSUME_KERNEL")); swapmodeVar[255] =
there is no way that a change like this can go in for RC/GM
Other than Websphere 4.0 and 5.0 (not sure which pieces of WS) what other IBM software is broken by doing this rpm under the covers ? Any Tivoli applications ?
------ Additional Comments From jaredj.com 2003-19-09 11:23 ------- Can it be fixed by the first service pack for RHEL 3.0, then? Even if it's not fixed in the GM/RC, it doesn't mean the fix shouldn't be investigated. It would mean we could at least do something by early next year for our customers on older releases, who were on say, RedHat 7.2 on s390.
------ Additional Comments From greg_kelleher.com 2003-26-09 14:58 ------- RH is going to try to fix this RHEL 3 update 1
Greg, We are going to investigate it for the update.
------ Additional Comments From khoa.com 2003-15-10 19:52 ------- The corresponding RH bug (RH Bug 101603) is closed as WONTFIX. We need to re-open that RH bug report and need some status update. Thanks.
------ Additional Comments From jaredj.com 2003-16-10 11:55 ------- This is a significant concern for several reasons. Any product which requires LinuxThreads to work, and calls RPM under the covers will have issues with this. I can name two products directly affected: ISMP (Installshield Multiplatform) (It uses both rpm and rpmbuild to do software registration) Caching Proxy (From WebSphere Edge) (Its installer uses rpm to install various packages the user selects) Also, any application bundled in RPM format that needs LinuxThreads to run and as part of the post-install executes part of itself to configure or whatnot, will have problems (As currently the only way to set the LD_ASSUME_KERNEL for extisting install rpms is to set it on the commandline before calling RPM). So, this will likely affect a good many existing products that run on RHEL 2.1 today, and are expected to be able to run on RHEL 3.0 in LinuxThreads mode. The reason this is business important is for customer migration/legacy support senarios. There are cases where a customer may try to install an older application or whatnot, and to do so set LinuxThreads (To say, invoke an older JVM which under the covers installs rpms or what have you). When this happens the customer is running in a state where they can corrupt the operating system repository. If that gets corrupted, the RHN update code won't function right, nor will RPM dependency resolves and so forth. The concern really is for legacy applications and the end user experience. If the OS DB gets corrupted, that's bad for both the vendor providing the application, as well as RedHat. From a WebSphere AppServer perspective: While the next revision of WebSphere (crrently in development) should not be affected, all versions in the field are. I've been working with management here to push that we won't support anything older than our upcoming next release, but that could potentially be a hard sell for platforms other than x86 hardware. We have a chunk of customers on 390 hardware and RedHat 7.2 that I've heard want 5.0 support, and potentially 4.0 support on RHEL 3 (since they were running applications in production on RedHat 7.2 and do not want to move to next version of WebSphere when it releases). With RPM acting the way it is, that makes the support extremely dangerous and potentially very expensive.
Business Case Added : At Bob Johnson's Request ; IBM SWG Would like this considered for Update 1
Jared, Also another question - I need to know exactly what the execution paths from WAS to rpm are. As long as the WAS installer is the only thing on the system doing rpm db commits there is no problem. If WAS is calling rpm recursively, there's no way that will work without concurrent database access, which requires NPTL.
------ Additional Comments From jaredj.com 2003-05-11 12:29 ------- Bob, I have no idea, as some of the rpms come from other products, and I don't know what they do internally. That's why I'd like to see RPM behave properly, and the patch I sent which forces RPM into NPTL mode always, should take care of that. My patch should also handle RPM calls inside RPMS, or whatnot, it tags the passthrough variables with the PID of the RPM process, and the sub processes (the scripts), use the ppid to find that, restore it, and so on. But, if RPM is invoked inside an rpm script, it'll detect the LinuxThreads setting, store it back out, set a new variable in the env with a new pid, and so on. With the way it's passed in my patch, it can't really inherit when it shouldn't, from what I can tell.
I have not been able to crash rpm with any LD_ASSUME_KERNEL variable value with the test case provided. Your patch does not handle rpm calls inside rpms.
rather, rpm calls in the %post of a package may work with your approach. but this is all trying to handle a hypothetical case where you MAY have two writers to the rpm database. If your packages don't call rpm in %post, and you don't invoke two package installs in parallel, you're ok. I don't think that the segfault that was reported initially acutally happens with gold code, and if that was the basis for the concern I think that you need to reassess the severity.
------ Additional Comments From jaredj.com 2003-05-11 13:39 ------- Concurrent RPM installs, one under LinuxThreads, will happen at some point. It's a very real problem. If this isn't fixed, older versions of WAS will simply not be supported on RHEL 3.
This does not make sense. It has nothing to do with the proper operation of WAS on RHEL 3. WAS will install and function properly on RHEL 3.
------ Additional Comments From jaredj.com 2003-05-11 13:58 ------- I can give an easy example of such a situation. Someone's say, running an old WAS install (5.0), which requires LinuxThreads to start the 1.3.1 JDK. During the install, they decide to run the up2date program and pick up the latest patches too. That kicks off. WAS hits the section where it does MQ/GsKit/registration code and starts up RPM to install those parts. At the same time, up2date engages installs of new RPMs for patches. And the end of the install, (perhaps a system reboot), everything initially will seem okay. Later, runs up2date to get more patches (say a week/month later), starts getting errors due to RPM DB corruption (packages missing, and so on). I think people generally assume they can run a few processes in parallel like that on Linux. It's not windows, which things tend to get flaky if you do more than one install at a time. :) Which! It could be users have kicked off a large blanket install of a lot of programs, one of which requires LinuxThreads, and the installs are running in parallel (Such as using make's parallel execution ability to automate something). Kaboom. Or, what if a user had LinuxThreads set in a window because they were running an older program which needed it, then goes to run an RPM install. At the same time, another user logged in decides to install something (Yes, two people installing stuff on the same box isn't good from a security/management point of view, but I'm certain it goes on). Again, likely corruption. Not all these cases are hypothetical. As a general user, before the problem was identified, I lost my machine a few times due to the RPM corruption that happened. If it happened to me, it'll happen to others. I'm really not trying to be a pain here. I feel this is a very real, and very serious, problem. This is a legacy support scenario. I spent time digging through the RPM code to come up with a possible fix (and I'm no expert on RPM code. That was the first time I'd ever looked at it), to help provide possible solutions that wouldn't be invasive. I don't think the concept in my patch is invasive at all, though I cannot claim it to be perfect as I do not know the RPM code well, and I don't regularly program in C (I like C, but due to my job, I don't get any sort of extended time to keep up to date andmy knowledge active with it), just now and then. I wrote it as a view quick example. If there are problems with it, I'd appreciate to hear what they are so I can learn and better understand possible other scenarios that my patch is useless for so I can better analyze and understand things in the future.
------ Additional Comments From jaredj.com 2003-05-11 14:02 ------- very, not view. Spellcheck is not helpful sometimes.
------ Additional Comments From jaredj.com 2003-05-11 14:22 ------- I'm hoping to get this fixed to save both IBM and RedHat customer support calls. It looks bad to install a program, and given a certain senario of possibilities, have the machine end up corrupted such that the updater program won't work, RPM dependency resolves fail, and so on. I *truly* feel it would benefit both our companies to make sure RPM tolerates this sort of senario safely. Neither of us would want a high-paying, large customer, coming to us upset that their machine is messed up just because they installed an application, and happened do do another install in parallel, or somesuch.
there is a best-effort amount of locking that is attempted between NPTL rpm and non-NPTL rpm. There could be some races, but there is a fcntl lock on the /var/lib/rpm/Pacakges file that is set when we detect that there is no NPTL available. I'm going to need some reproducible test case that demonstrates corruption before we're going to be able to make progress with this. Again, I've never gotten rpm to segfault as you demonstrate with the rpmbuild-in-a-loop test case. In that mode rpm is only doing read actions on the database to satisfy build dependencies. I think that there was some other sort of instability on your system that resulted in random corruption everywhere.
Jared, Have you been able to repro this on a system with a clean, fresh install ?
------ Additional Comments From jaredj.com 2003-05-11 16:21 ------- But, to quote you from a previous comment in the bug: "corruption should only occur if you have multiple database accesses. db4 is only designed to support one locking mechanism. In our case, we select posix mutexes. db4 depends on the PTHREAD_PROCESS_SHARED, which is only implemented in NPTL. So rpm's locking mechanism only works with NPTL." You said database locking only works when NPTL is enabled. That implies that running in LinuxThreads and calling RPM could potentially cause multiple database accesses and corrupt the DB. So, er, which is it? The whole reason I spent the time to write that patch was for concern that RPM always needed to be invoked under NPTL, based on that statement. If it will perform some sort of proper locking, regardless if rpm running under NPTL and LinuxThreads run concurrently, then I'd say the problem is dealt with. But, that's not what I was told previously in this bug.
Right, db4 can only have one type of locking and I was only looking at the db4 side of things. I missed where rpm does its own fcntl lock for NPTL-less operation.
------ Additional Comments From jaredj.com 2003-05-11 16:23 ------- Bob, I've not run it on the release candidates and attempted to do concurrent RPM access during an install that has LinuxThreads enabled. After being told it was a problem with RPM itself, and it's locking, from RedHat, I made sure to never invoke RPM under LinuxThreads so I wouldn't have to keep rebuilding my machine after it got chewed up. I don't have unlimited time to continually rebuild boxes.
You can back up your database, and often corruptions in the databases other than the Packages file can be repaired by rpm --rebuilddb.
------ Additional Comments From jaredj.com 2003-05-11 16:52 ------- --rebuilddb didn't fix it when I lost it the last times. I had to continually re-install when the database was wrecked. Interestingly, on the GM RHEL 3.0, the 2.2.5 (Non-floating) glibc doesn't crash the rpmbuild test, so I guess the stack overwrite that was happening there must've gotten fixed at some point. So, perhaps it was just a glibc stack over-write issue that was eating the DB beforehand, or perhaps not. I'll see what I can dig up when I have time. And as a request, when you re-evaluate something which changes your observation beforehand (such as the db4/rpm interaction), could you please post them into the bugzilla so they get back to me. Otherwise, I'm working on half/incomplete information and my decisions and advice to some managers, and even one of the directors, here are based on that. I'd really appreciate it.
I'll do my best, and apologize for the incorrect information I came up with the first time.
test comment
Jared, Any feedback ? If this was glibc that has been fixed that is great news, if not we have more work to continue.
------- Additional Comment #54 From Jared P. Jurkiewicz 2003-11-06 13:27 ------- Thanks. I'm trying to get you more information and I'm glad to see that the 2.2.5 export GLIBC seems to be working better now. But, I have bad news to report. :-( I broke the box again today, doing something fairly trivial. I installed WAS V5.1 (With LinuxThreads set before the install is run), and in another window, I just had a simple shell script running in a look and force installing a dummy RPM into the DB every 2 secods or so. (So I'd get a concurrent access sort of situation). That said, The MQ install script we call under the covers in the java installer (Which does RPM installs), reported a ton of failures. Namely: wmsetup: 06Nov03 12:19:35 ================================================================================ ================== wmsetup: 06Nov03 12:19:35 Date: Thu Nov 6 12:19:35 EST 2003 wmsetup: 06Nov03 12:19:35 ================================================================================ ================== wmsetup: 06Nov03 12:19:35 Hostname: arathorn.raleigh.ibm.com wmsetup: 06Nov03 12:19:35 Operating System: Linux wmsetup: 06Nov03 12:19:35 User: uid=0(root) gid=0(root) groups=0(root),1(bin),2 (daemon),3(sys),4(adm),6(disk),10(wheel),501(mqm),502(mqbrkrs) wmsetup: 06Nov03 12:19:35 wmsetup version: 1.22 wmsetup: 06Nov03 12:19:35 wsmfuncs.common version: 1.60 wmsetup: 06Nov03 12:19:35 wsmfuncs.Linux version: 1.47 wmsetup: 06Nov03 12:19:35 Command line is: /opt/wasinst/messaging/wmsetup install /opt/WebSphere/AppServer/logs/mq_install.log wmsetup: 06Nov03 12:19:35 Function is install wmsetup: 06Nov03 12:19:35 Checking pre-requisites ... wmsetup: 06Nov03 12:19:35 Getting OS level ... wmsetup: 06Nov03 12:19:35 Check_oslevel return 0 wmsetup: 06Nov03 12:19:35 Checking kernel ... wmsetup: 06Nov03 12:19:35 ... OK wmsetup: 06Nov03 12:19:35 Checking for group mqm ... wmsetup: 06Nov03 12:19:35 Check_group returning 0 wmsetup: 06Nov03 12:19:35 Checking for user mqm ... wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_user wmsetup: 06Nov03 12:19:35 Checking for group mqbrkrs ... wmsetup: 06Nov03 12:19:35 Check_group returning 0 wmsetup: 06Nov03 12:19:35 Check_root mqm wmsetup: 06Nov03 12:19:35 Checking for group "mqm" ... wmsetup: 06Nov03 12:19:35 Checking if user "root" is in group "mqm" wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_root wmsetup: 06Nov03 12:19:35 Check_root mqbrkrs wmsetup: 06Nov03 12:19:35 Checking for group "mqbrkrs" ... wmsetup: 06Nov03 12:19:35 Checking if user "root" is in group "mqbrkrs" wmsetup: 06Nov03 12:19:35 ... RC 0 from Check_root wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesJava ... wmsetup: 06Nov03 12:19:35 package MQSeriesJava is not installed wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesRuntime ... wmsetup: 06Nov03 12:19:35 package MQSeriesRuntime is not installed wmsetup: 06Nov03 12:19:35 Checking for installed MQSeriesJava-5.2.2 ... wmsetup: 06Nov03 12:19:36 package MQSeriesJava-5.2.2 is not installed wmsetup: 06Nov03 12:19:36 Checking for installed MQSeriesJava ... wmsetup: 06Nov03 12:19:36 package MQSeriesJava is not installed wmsetup: 06Nov03 12:19:36 Checking for installed wemps-runtime ... wmsetup: 06Nov03 12:19:36 package wemps-runtime is not installed wmsetup: 06Nov03 12:19:36 Return code 0 from Check_prereqs wmsetup: 06Nov03 12:19:36 Install mqjava ... wmsetup: 06Nov03 12:19:36 IsClient entered wmsetup: 06Nov03 12:19:39 IsClient exit RC = 4 wmsetup: 06Nov03 12:19:39 installing component mqjava ... wmsetup: 06Nov03 12:19:39 MQSeriesJava-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 12:19:41 Checking for previously installed packages ... wmsetup: 06Nov03 12:19:41 Check for previously installed packages complete wmsetup: 06Nov03 12:19:41 Installing MQSeriesJava-5.3.0-1.i386.rpm rpmdb: PANIC: Invalid argument rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from dbcursor->c_put: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from dbcursor->c_close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->cursor: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "MQSeriesJava" records from Name index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->cursor: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "Cleanup" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "DefaultConfiguration" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "IVTRun" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "IVTSetup" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "IVTTidy" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "JMSAdmin" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "JMSAdmin.config" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "JmsPostcardSample.ini" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "MQJMS_PSQ.mqsc" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "PSIVTRun" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "PSReportDump.class" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "formatLog" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "postcard" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "postcard.ini" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "runjms" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "com.ibm.mq.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "com.ibm.mqbind.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "com.ibm.mqjms.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "connector.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "fscontext.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "jms.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "jndi.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "jta.jar" records from Basenames index rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->get: DB_RUNRECOVERY: Fatal error, run database recovery error: error(-30982) getting "ldap.jar" records from Basenames index ... and it goes on and on like that. Looks like something in the best attemp locking isn't doing so well, or somesuch. It failed really badly. I'm glad I tgz'ed the RPM /var/lib/rpm directory ahead of time. I think this can be reproduced easily via a couple scripts, running in loops to install rpms, one in LinuxThreads mode, the other in NPTL. I'll try to set that up, but I wanted to get you this feedback so you could see what I was getting reported. I restored my RPM DB, and ran both installers in NPTL mode (The WAS install, and the trivial script/dummy rpm) and it installed fine. Here's the output from MQ's install script (which just calls RPM) wmsetup: 06Nov03 13:02:17 ================================================================================ ================== wmsetup: 06Nov03 13:02:17 Date: Thu Nov 6 13:02:17 EST 2003 wmsetup: 06Nov03 13:02:17 ================================================================================ ================== wmsetup: 06Nov03 13:02:17 Hostname: arathorn.raleigh.ibm.com wmsetup: 06Nov03 13:02:17 Operating System: Linux wmsetup: 06Nov03 13:02:17 User: uid=0(root) gid=0(root) groups=0(root),1(bin),2 (daemon),3(sys),4(adm),6(disk),10(wheel),501(mqm),502(mqbrkrs) wmsetup: 06Nov03 13:02:18 wmsetup version: 1.22 wmsetup: 06Nov03 13:02:18 wsmfuncs.common version: 1.60 wmsetup: 06Nov03 13:02:18 wsmfuncs.Linux version: 1.47 wmsetup: 06Nov03 13:02:18 Command line is: /opt/wasinst/messaging/wmsetup install /opt/WebSphere/AppServer/logs/mq_install.log wmsetup: 06Nov03 13:02:18 Function is install wmsetup: 06Nov03 13:02:18 Checking pre-requisites ... wmsetup: 06Nov03 13:02:18 Getting OS level ... wmsetup: 06Nov03 13:02:18 Check_oslevel return 0 wmsetup: 06Nov03 13:02:18 Checking kernel ... wmsetup: 06Nov03 13:02:18 ... OK wmsetup: 06Nov03 13:02:18 Checking for group mqm ... wmsetup: 06Nov03 13:02:18 Check_group returning 0 wmsetup: 06Nov03 13:02:18 Checking for user mqm ... wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_user wmsetup: 06Nov03 13:02:18 Checking for group mqbrkrs ... wmsetup: 06Nov03 13:02:18 Check_group returning 0 wmsetup: 06Nov03 13:02:18 Check_root mqm wmsetup: 06Nov03 13:02:18 Checking for group "mqm" ... wmsetup: 06Nov03 13:02:18 Checking if user "root" is in group "mqm" wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_root wmsetup: 06Nov03 13:02:18 Check_root mqbrkrs wmsetup: 06Nov03 13:02:18 Checking for group "mqbrkrs" ... wmsetup: 06Nov03 13:02:18 Checking if user "root" is in group "mqbrkrs" wmsetup: 06Nov03 13:02:18 ... RC 0 from Check_root wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesJava ... wmsetup: 06Nov03 13:02:18 package MQSeriesJava is not installed wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesRuntime ... wmsetup: 06Nov03 13:02:18 package MQSeriesRuntime is not installed wmsetup: 06Nov03 13:02:18 Checking for installed MQSeriesJava-5.2.2 ... wmsetup: 06Nov03 13:02:19 package MQSeriesJava-5.2.2 is not installed wmsetup: 06Nov03 13:02:19 Checking for installed MQSeriesJava ... wmsetup: 06Nov03 13:02:19 package MQSeriesJava is not installed wmsetup: 06Nov03 13:02:19 Checking for installed wemps-runtime ... wmsetup: 06Nov03 13:02:19 package wemps-runtime is not installed wmsetup: 06Nov03 13:02:19 Return code 0 from Check_prereqs wmsetup: 06Nov03 13:02:19 Install mqjava ... wmsetup: 06Nov03 13:02:19 IsClient entered wmsetup: 06Nov03 13:02:22 IsClient exit RC = 4 wmsetup: 06Nov03 13:02:22 installing component mqjava ... wmsetup: 06Nov03 13:02:22 MQSeriesJava-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:25 Checking for previously installed packages ... wmsetup: 06Nov03 13:02:25 Check for previously installed packages complete wmsetup: 06Nov03 13:02:25 Installing MQSeriesJava-5.3.0-1.i386.rpm wmsetup: 06Nov03 13:02:28 ... Return code 0 from rpm -i wmsetup: 06Nov03 13:02:28 Install mqm ... wmsetup: 06Nov03 13:02:28 IsClient entered wmsetup: 06Nov03 13:02:33 This is a currently a WAS client-only installation wmsetup: 06Nov03 13:02:33 IsClient exit RC = 0 wmsetup: 06Nov03 13:02:33 installing component mqm ... wmsetup: 06Nov03 13:02:33 MQSeriesClient-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_Zh_CN-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_Zh_TW-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_de-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_es-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_fr-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_it-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_ja-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_ko-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesMsg_pt-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesServer-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesSDK-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:33 MQSeriesRuntime-5.3.0-1.i386.rpm Found wmsetup: 06Nov03 13:02:35 Checking for previously installed packages ... wmsetup: 06Nov03 13:02:36 Check for previously installed packages complete wmsetup: 06Nov03 13:02:36 Installing MQSeriesRuntime-5.3.0-1.i386.rpm MQSeriesSDK-5.3.0-1.i386.rpm MQSeriesServer-5.3.0-1.i386.rpm MQSeriesMsg_pt- 5.3.0-1.i386.rpm MQSeriesMsg_ko-5.3.0-1.i386.rpm MQSeriesMsg_ja-5.3.0- 1.i386.rpm MQSeriesMsg_it-5.3.0-1.i386.rpm MQSeriesMsg_fr-5.3.0-1.i386.rpm MQSeriesMsg_es-5.3.0-1.i386.rpm MQSeriesMsg_de-5.3.0-1.i386.rpm MQSeriesMsg_Zh_TW-5.3.0-1.i386.rpm MQSeriesMsg_Zh_CN-5.3.0-1.i386.rpm MQSeriesClient-5.3.0-1.i386.rpm wmsetup: 06Nov03 13:03:15 ... Return code 0 from rpm -i wmsetup: 06Nov03 13:03:15 Setting capacity units /opt/mqm/bin/setmqcap: relocation error: /opt/mqm/lib/libmqmr_r.so: symbol errno, version GLIBC_2.0 not defined in file libc.so.6 with link time reference wmsetup: 06Nov03 13:03:15 Install mqm_csd ... wmsetup: 06Nov03 13:03:15 IsClient entered wmsetup: 06Nov03 13:03:21 This is not currently a WAS client-only installation wmsetup: 06Nov03 13:03:21 IsClient exit RC = 4 wmsetup: 06Nov03 13:03:21 installing component mqm_csd ... wmsetup: 06Nov03 13:03:21 MQSeriesClient-U486878-5.3.0-4.i386.rpm Found wmsetup: 06Nov03 13:03:21 MQSeriesRuntime-U486878-5.3.0-4.i386.rpm Found wmsetup: 06Nov03 13:03:21 MQSeriesSDK-U486878-5.3.0-4.i386.rpm Found wmsetup: 06Nov03 13:03:21 MQSeriesServer-U486878-5.3.0-4.i386.rpm Found wmsetup: 06Nov03 13:03:21 MQSeriesJava-U486878-5.3.0-4.i386.rpm Found wmsetup: 06Nov03 13:03:23 Determining which CSD packages are applicable ... wmsetup: 06Nov03 13:03:23 Starting with list: MQSeriesJava-U486878-5.3.0- 4.i386.rpm MQSeriesServer-U486878-5.3.0-4.i386.rpm MQSeriesSDK-U486878-5.3.0- 4.i386.rpm MQSeriesRuntime-U486878-5.3.0-4.i386.rpm MQSeriesClient-U486878- 5.3.0-4.i386.rpm wmsetup: 06Nov03 13:03:23 Ending with list: MQSeriesJava-U486878-5.3.0- 4.i386.rpm MQSeriesServer-U486878-5.3.0-4.i386.rpm MQSeriesSDK-U486878-5.3.0- 4.i386.rpm MQSeriesRuntime-U486878-5.3.0-4.i386.rpm MQSeriesClient-U486878- 5.3.0-4.i386.rpm wmsetup: 06Nov03 13:03:23 Checking for previously installed packages ... wmsetup: 06Nov03 13:03:24 Check for previously installed packages complete wmsetup: 06Nov03 13:03:24 Installing MQSeriesJava-U486878-5.3.0-4.i386.rpm MQSeriesServer-U486878-5.3.0-4.i386.rpm MQSeriesSDK-U486878-5.3.0-4.i386.rpm MQSeriesRuntime-U486878-5.3.0-4.i386.rpm MQSeriesClient-U486878-5.3.0-4.i386.rpm wmsetup: 06Nov03 13:08:45 ... Return code 0 from rpm -i wmsetup: 06Nov03 13:08:45 Install wemps ... wmsetup: 06Nov03 13:08:45 IsClient entered wmsetup: 06Nov03 13:08:52 This is not currently a WAS client-only installation wmsetup: 06Nov03 13:08:52 IsClient exit RC = 4 wmsetup: 06Nov03 13:08:52 installing component wemps ... wmsetup: 06Nov03 13:08:52 wemps-runtime-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-De_DE-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Es_ES-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Fr_FR-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-It_IT-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Ja_JP-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Ko_KR-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Pt_BR-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Zh_CN-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:52 wemps-msg-Zh_TW-2.1.0-0.i386.rpm Found wmsetup: 06Nov03 13:08:55 Checking for previously installed packages ... wmsetup: 06Nov03 13:08:55 Check for previously installed packages complete wmsetup: 06Nov03 13:08:55 Installing wemps-msg-Zh_TW-2.1.0-0.i386.rpm wemps-msg- Zh_CN-2.1.0-0.i386.rpm wemps-msg-Pt_BR-2.1.0-0.i386.rpm wemps-msg-Ko_KR-2.1.0- 0.i386.rpm wemps-msg-Ja_JP-2.1.0-0.i386.rpm wemps-msg-It_IT-2.1.0-0.i386.rpm wemps-msg-Fr_FR-2.1.0-0.i386.rpm wemps-msg-Es_ES-2.1.0-0.i386.rpm wemps-msg- De_DE-2.1.0-0.i386.rpm wemps-runtime-2.1.0-0.i386.rpm Preinstall Phase Executing Please Wait... Preinstall Phase Finished Postinstall Phase Executing Please Wait... Postinstall Phase Finished wmsetup: 06Nov03 13:09:20 ... Return code 0 from rpm -i wmsetup: 06Nov03 13:09:20 Return code 0 from Install_wsm wmsetup: 06Nov03 13:09:20 Exiting - return code 0 wmsetup: 06Nov03 13:09:20 ============================================================
------ Additional Comments From khoa.com 2003-06-11 15:50 ------- Add Salina to the CC list, so she can see all the updates. Thanks.
------ Additional Comments From jinge.com 2003-06-11 23:41 ------- I also saw the similar problem when I install CSM after setting LD_ASSUME_KERNEL=2.2.5. installms: Exit code 1 from command: /bin/rpm -U /csminstall/Linux/RedHatEL-ES/csm/1.3.2/packages/csm.dsh-1.3.2.10-2.i386.rpm 2>&1 Error message from cmd: rpmdb: PANIC: Invalid argument rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from dbcursor->c_put: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: fatal region error detected; run recovery .............. After I remove __db.00* from /var/lib/rpm, the rpm could restore to normal. But the problem will occur again after several installing. Can someone add me and keshav.com into CC list? So we can get updates. Thanks.
test to debug BugMail tool for Mark Wisner
------ Additional Comments From markwiz.com 2003-07-11 07:32 ------- added keshav.com to the cc list
I did a test: (I = Matt Wilson) mkdir -p /tmp/test/var/lib/rpm cp /usr/lib/rpmdb/i386-redhat-linux/redhat/* /tmp/test/var/lib/rpm/ This is a full and very large database. window 1: while :; do rpm -r /tmp/test -Uvh redhat-release-3-1.noarch.rpm --force --nodeps || break done window 2: while :; do LD_ASSUME_KERNEL=2.4.19 rpm -r /tmp/test -Uvh redhat-release-3-1.noarch.rpm --force --nodeps || break done Eventually I did see some database environment errors: rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from dbenv->close: DB_RUNRECOVERY: Fatal error, run database recovery But the database is still fully intact, rpm -qa on the database still shows 1392 package entries, etc.
Per Jared, Matt can you try 2.2.5 ?
------ Additional Comments From keshav.com 2003-14-11 13:22 ------- Hi - Any progress on this defect, is this going to make it in to the first update. This is a blocking defect for CSM and we would really like to have this in update 1.
------ Additional Comments From jaredj.com 2003-17-11 09:41 ------- I'm not certain if I've been able to corrupt the DB again, but I have been able to hang it several times using the example RedHat provided regarding a simple script install in two windows. I couldn't even do an rpm -qa unless I went in and deleted the __db.00? files. So, as it stands I don't think we have anything useful we can tell customers. In one case it throws errors to the screen and messes up installs. In another, it makes the DB completely locked up. Then, there's still the possible third of DB corruption. An also inteesting side note. After I killed the install the same rpm (In two windows, in a loop), and unlocked the DB by removing the __db files, I did an rpm -qa. It reported back that there were 679 packages in the DB. Just before the run, I had queried and it reported 677. Since it was installing the same package, shouldn't it have only reported 678? I don't think RPM will put in duplicate entries, will it?
test to debug running BugMail tool for Mark Wisner
Mark and Glen, Can you open up a test bug for your work and not use this one ? Everytime you touch the bug you spawn emails to everyone that is on the cc list and all who have posted. With this being a hot bug we all go take a look and see...."test comment"....Thanks.
How exactly is rpm invoked by the installation tools? By direct path (/bin/rpm) or by rpm in $PATH?
*** Bug 110554 has been marked as a duplicate of this bug. ***
------ Additional Comments From jaredj.com 2003-21-11 09:03 ------- Depends on the exec'er. I know ISMP references full pathname, whereas the MQ series install scripts invoke it via $PATH. So, in the case of the WAS installer, both ways are used.
------ Additional Comments From jaredj.com 2003-21-11 13:57 ------- Regarding the note above about the RPM patch and oddity there. I suspect you'd see the same if you ran without the patched rpm and just ran the test RPM scripts, both in NPTL mode. I've not verified this, however. I haven't had time to oldpackage back to the non-patched RPM.
This was emailed - But I will add it here : ------- Additional Comment #69 From Jared P. Jurkiewicz 2003-11-21 13:56 ------- A few more interesting side notes. I've gotten RPM to hang (The one running under NTPL) when there were dual LinuxThreads and NPTL RPM calls accessing at the same time. This seemed very easy to trivver with the -vvv option set in the script that was installing the RPM. In any event, here is snippied from the logs: NPTL running RPM call: ========================================== ... bunch of calls D: sanity checking 1 elements D: computing 0 file fingerprints Preparing packages for installation... D: computing file dispositions D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0 D: Expected size: 1169 = lead(96)+sigs(180)+pad(4)+data(889) D: Actual size: 1169 D: install: DummyRpm-1.0-0 has 0 files, test = 0 D: opening db index /var/lib/rpm/Name create mode=0x42 D: read h# 1006 Header SHA1 digest: OK (8103c2a4c7954dcc5eb9a97827d653781e1b9a99) DummyRpm-1.0-0 D: --- h# 1006 DummyRpm-1.0-0 D: removing "DummyRpm" from Name index. D: opening db index /var/lib/rpm/Group create mode=0x42 D: removing "System Environment/Daemons" from Group index. D: opening db index /var/lib/rpm/Requirename create mode=0x42 D: removing 2 entries from Requirename index. D: opening db index /var/lib/rpm/Providename create mode=0x42 D: removing "DummyRpm" from Providename index. D: opening db index /var/lib/rpm/Requireversion create mode=0x42 D: removing 2 entries from Requireversion index. D: opening db index /var/lib/rpm/Provideversion create mode=0x42 D: removing "1.0-0" from Provideversion index. D: opening db index /var/lib/rpm/Installtid create mode=0x42 D: removing 1 entries from Installtid index. D: opening db index /var/lib/rpm/Sigmd5 create mode=0x42 D: removing 1 entries from Sigmd5 index. D: opening db index /var/lib/rpm/Sha1header create mode=0x42 D: removing "8103c2a4c7954dcc5eb9a97827d653781e1b9a99" from Sha1header index. D: +++ h# 1007 Header SHA1 digest: OK (8103c2a4c7954dcc5eb9a97827d653781e1b9a99) At this point, it's hung and just stays there. LinuxThreads running RPM call: =================================== D: ============== 1315054.tmp D: Expected size: 1169 = lead(96)+sigs(180)+pad(4)+data(889) D: Actual size: 1169 D: 1315054.tmp: MD5 digest: OK (ecc0cece7925db82e0602eca45259c2f) D: added binary package [0] D: found 0 source and 1 binary packages D: unshared posix mutexes found(38), adding DB_PRIVATE, using fcntl lock D: opening db environment /var/lib/rpm/Packages create:cdb:mpool:private D: opening db index /var/lib/rpm/Packages rdonly mode=0x0 D: locked db index /var/lib/rpm/Packages D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0 D: opening db index /var/lib/rpm/Depends create mode=0x0 D: Requires: rpmlib(PayloadFilesHavePrefix) <= 4.0-1 YES (rpmlib provides) D: Requires: rpmlib(CompressedFileNames) <= 3.0.4-1 YES (rpmlib provides) D: closed db index /var/lib/rpm/Depends D: closed db index /var/lib/rpm/Packages D: closed db environment /var/lib/rpm/Packages D: ========== recording tsort relations D: ========== tsorting packages (order, #predecessors, #succesors, tree, depth) D: 0 0 0 0 0 +DummyRpm-1.0-0 D: installing binary packages D: unshared posix mutexes found(38), adding DB_PRIVATE, using fcntl lock D: opening db environment /var/lib/rpm/Packages create:cdb:mpool:private D: opening db index /var/lib/rpm/Packages create mode=0x42 D: mounted filesystems: D: i dev bsize bavail iavail mount point D: 0 0x0812 4096 777024 944553 / D: 1 0x0002 1024 0 -1 /proc D: 2 0x0008 1024 0 -1 /proc/bus/usb D: 3 0x0811 1024 48735 20030 /boot D: 4 0x0007 1024 0 -1 /dev/pts D: 5 0x0009 4096 128566 128565 /dev/shm D: sanity checking 1 elements D: computing 0 file fingerprints Preparing packages for installation... D: computing file dispositions D: ========== +++ DummyRpm-1.0-0 i386-linux 0x0 D: Expected size: 1169 = lead(96)+sigs(180)+pad(4)+data(889) D: Actual size: 1169 D: install: DummyRpm-1.0-0 has 0 files, test = 0 D: opening db index /var/lib/rpm/Name create mode=0x42 D: read h# 1006 Header SHA1 digest: OK (8103c2a4c7954dcc5eb9a97827d653781e1b9a99) DummyRpm-1.0-0 D: --- h# 1006 DummyRpm-1.0-0 D: removing "DummyRpm" from Name index. D: opening db index /var/lib/rpm/Group create mode=0x42 D: removing "System Environment/Daemons" from Group index. D: opening db index /var/lib/rpm/Requirename create mode=0x42 D: removing 2 entries from Requirename index. D: opening db index /var/lib/rpm/Providename create mode=0x42 D: removing "DummyRpm" from Providename index. And it just keeps going indefinitely. Very strange. ========================================================= Another thing I tried was rebuilding RPM itself with the patch that forces NPTL That certainly got around an RPM hang, but after some ranom number of loops through the script, it started spitting the environment fatal errors: D: ============== 1315054.tmp D: Expected size: 1169 = lead(96)+sigs(180)+pad(4)+data(889) D: Actual size: 1169 D: 1315054.tmp: MD5 digest: OK (ecc0cece7925db82e0602eca45259c2f) D: added binary package [0] D: found 0 source and 1 binary packages D: opening db environment /var/lib/rpm/Packages joinenv rpmdb: fatal region error detected; run recovery error: db4 error(-30982) from dbenv->open: DB_RUNRECOVERY: Fatal error, run database recovery D: opening db index /var/lib/rpm/Packages rdonly mode=0x0 error: cannot open Packages index using db3 - (-30982) error: cannot open Packages database in /var/lib/rpm D: ========== recording tsort relations D: ========== tsorting packages (order, #predecessors, #succesors, tree, depth) D: 0 0 0 0 0 +DummyRpm-1.0-0 D: ============== 1315054.tmp D: Expected size: 1169 = lead(96)+sigs(180)+pad(4)+data(889) D: Actual size: 1169 ============================================= Since these logs are huge, I'm going to directly mail them to Mark, along with the patched build of RPM, just in case he would like to look at is as well. ------- Additional Comment #70 From Jared P. Jurkiewicz 2003-11-21 13:57 ------- Regarding the note above about the RPM patch and oddity there. I suspect you'd see the same if you ran without the patched rpm and just ran the test RPM scripts, both in NPTL mode. I've not verified this, however. I haven't had time to oldpackage back to the non-patched RPM.
------ Additional Comments From jaredj.com 2003-03-12 13:51 ------- More experimentation results. I completely reloaded my box with RHEL 3, AS, and installed all current RHN updates on it. I backed up my RPM DB, just in case. I swapped the threading model to LinuxThreads, non-floating stack. (LD_ASSUME_KERNEL=2.2.5) I used the WA V5.1 installer (because I now can't get the V5.0 installer to invoke at all. The JDK segfaults no matter what glibc is used, not good. It did work in the past using 2.2.5 export). Before the install, RPM reported there were 697 packages in the DB. After the install, RPM reported there were 147 packages in the DB. This is bad... I'll be mailing the RPM package list of before and after the install to Mark and Bob.
------ Additional Comments From jaredj.com 2003-03-12 13:52 ------- Correction, it was 142 packages in the DB after the install.
------ Additional Comments From jaredj.com 2003-08-12 16:22 ------- Sent a trial CD to RedHat so they can easily reproduce in house. Couple things that are useful to know: The CD is vof V5.1, which will run the install in NPTL mode. It's the only images I had readily on hand that would work. To see the problem, set LinuxThreads before the run: export LD_ASSUME_KERNEL=2.2.5 will show the problem usually very readily. V5.0 requires LinuxThreads (For the 1.3.1 JVM to work, etcera), so this demonstrates the same RPM error V5.0 hits on RHEL 3 when run this way. There is a file on the root of the CD (install.sp) that sets debug on for ISMP that will show you all the RPM commands it's calling out to. This will normally end up in the instal log at the end of theinstall in $WAS_HOME/logs/log.txt You can just display it to the console while it's running as well by invoking the install as: ./install -is:javaconsole I have verified the trial install also triggers the failure. It took two installs to hit it, but I had the RPM DB report 697 before the install, then 179 entries afterward.
[root@msw root]# mount /mnt/cdrom [root@msw root]# cd /mnt/cdrom [root@msw root]# LD_ASSUME_KERNEL=2.2.5 ./install InstallShield Wizard Initializing InstallShield Wizard... Searching for Java(tm) Virtual Machine... ........ [root@msw root]# rpm --dbpath /root/rpm-backup -qa | wc -l 534 [root@msw root]# rpm -qa | wc -l 534 [root@msw root]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]* [root@msw root]#
This was because rpmbuild wasn't installed. ISMP builds "fake" packages with rpmbuild to "install" on the system. After I install the rpm-build package, I see what you're seeing, but this is due to the db environment: [root@msw WAS]# rpm -qa | wc -l 107 But: [root@msw WAS]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/[A-Z]* The database is intact [root@msw WAS]# rm /var/lib/rpm/__db* [root@msw rpm]# rpm -qa | wc -l 575 Before installing: [root@msw rpm]# rpm --dbpath /root/rpm-backup-2/ -qa | wc -l 548 [root@msw tmp]# rpm --dbpath /root/rpm-backup-2/ -qa | sort > before [root@msw tmp]# rpm -qa | sort > after [root@msw tmp]# diff -u before after --- before 2003-12-09 12:42:38.000000000 -0500 +++ after 2003-12-09 12:42:59.000000000 -0500 @@ -507,6 +508,33 @@ Wnn6-SDK-1.0-25 Wnn6-SDK-devel-1.0-25 words-2-21 +WSBAC1AA51-5.1-0 +WSBAS1AA51-5.1-0 +WSBAU1AA51-5.1-0 +WSBCO1AA51-5.1-0 +WSBCO5AA51-5.1-0 +WSBDM1AA51-5.1-0 +WSBDT1AA51-5.1-0 +WSBES1AA-5.0-0 +WSBGK2AA51-5.1-0 +WSBIH1AA51-1.3-28 +WSBJA1AA51-5.1-0 +WSBJD5AA51-1.3-1 +WSBJD7AA51-1.3-1 +WSBJD9AA51-1.3-1 +WSBLA1AA51-5.1-0 +WSBMQ1AA-5.0-0 +WSBMQ2AA-5.0-0 +WSBMQ3AA-5.0-0 +WSBMQ4AA-5.0-0 +WSBMS3AA-5.0-0 +WSBMS6AA-5.0-0 +WSBPL1AA51-5.1-0 +WSBPS1AA51-5.1-0 +WSBSM1AA51-5.1-0 +WSBSR1AA51-5.1-0 +WSBSR4AA51-5.1-0 +WSBTV1AA51-5.1-0 wvdial-1.53-11 Xaw3d-1.5-18 xchat-2.0.4-3.EL
------ Additional Comments From jaredj.com 2003-12-12 13:54 ------- Recieved a patch from RedHat that conceptually does the same thing the patch I submitted to them previously did (detect LD_ASSUME_KERNEL, if found, store off, reset to NPTL, re-exec. API's used are BSD originated ones, versus just pure POSIX ones for env manipulation I used, but same idea and all that :-). Po-ta-to, po-tah-to). It's enabled via setting RPM_FORCE_NPTL=1 in the environment. So, you have to forcibly enable it versus having the switch always enabled. It does correct the corruption problem I saw. (Before install, RPM DB shows 698 packages, after, 768, deinstall, goes back to 698, reinstall, up to 768, and so on. I repeated the install/deinstall procedure several times and verified it consistantly registers cleanly.) So, it looks like this should resolve it, and gives us a way to set the switch over to protect the RPMDB when RPM executables get called under the covers by a bunch of things. And since we have to document anyway to set LD_ASSUME_KERNEL for releases older than 5.1 to even invoke the install, it can be doc'ed to set both values.
----- Additional Comments From khoa.com 2004-03-26 10:41 ------- Closing this bug with Jared's agreement.
AFAICT, this bug is fixed.
----- Additional Comments From khoa.com 2004-11-11 10:27 EDT ------- Per the call with WebSphere this morning, this bug has come back in RHEL3 U3. So we need to reopen this. Thanks.
Reopened.
----- Additional Comments From salina.com 2004-11-11 11:08 EDT ------- Stacy, Please reopen problem at RH. See Khoa's last comment. Thanks.
IBM, Need some more details. RHEL 3 U3 shipped in September,
The requirement for NPTL was taken out at U3 but function did not change and WAS signed off on that test, iirc.
----- Additional Comments From markwiz.com 2004-11-12 12:01 EDT ------- This is a RHEL4 test that should also work for RHEL3 U4. We have a simpler reproducer for you. You'll need a clean install of RHEL 4, and the installers that were provided to you. In order to replicate the bug *without* re-installing RHEL 4, I advise you to tar up /var/lib/rpm prior to performing the steps listed. 1) adduser mqm 2) adduser mqbrkrs 3) vigr ... add root to groups mqm and mqbrkrs in /etc/group and /etc/gshadow 4) cd /opt 5) tar -xzf WASTrialInstaller.tgz 6) ulimit -s 2048 7) export LC_CTYPE=$LANG 8) cd WASTrialInst/instdir/messaging 9) ./wmsetup install /tmp/mqlog 10) cd .. 11) mkdir ptfs 12) cd ptfs 13) tar -xzf was51_fp1_linux.tar.gz 14) cd fixpacks 15) jar xf was51_fp1_linux.jar 16) cd ptfs/was51_fp1_linux/components/external.mq/CSD/ 17) chmod -R +x 18) export LD_ASSUME_KERNEL=2.4.19 19) ./wmservice install /tmp/mqupgradelog 20) unset LD_ASSUME_KERNEL At this point, issue an rpm -qa > installed_packages. This *should* provide a DB_PAGE_NOTFOUND. Removal procedure: 1) rm /var/lib/__db*. This is a precaution, to prevent the corrupted caches from "infecting" the rpm db. 2) cd /opt/WASInst 3) rpm -qa -last > some_text_file 4) Edit some_text_file, so that the only packages in the file are those containing the words MQSeries or wemps. 5) rpm -ev --noscripts `cat some_text_file` 6) cd /opt 7) rm -rf mqm wemps 8) cd /var 9) rm -rf mqm wemps 10) cd /usr/bin 11) symlinks -d . This will completely clean your system, but will leave your rpm db in a state that will not permit you to replicate the issue. In order to replicate the issue, cd to /var/lib/rpm, rm __db*, and then untar the backup of /var/lib/rpm over the current set of files.
----- Additional Comments From jaredj.com 2004-11-12 17:11 EDT ------- Reproduced on my X86 2-way machine. This is incredibly strange. ./wmservice mqlog.txt [root@arathorn CSD]# ./wmservice install /tmp/mqsetup.log [root@arathorn CSD]# rpm -qa > somefile [root@arathorn CSD]# echo $LD_ASSUME_KERNEL 2.4.19 [root@arathorn CSD]# unset LD_ASSUME_KERNEL [root@arathorn CSD]# rpm -qa > somefile2 error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page not found [root@arathorn CSD]# export LD_ASSUME_KERNEL=2.4.19 [root@arathorn CSD]# rpm -qa > somefile3 [root@arathorn CSD]# It barfs now OWNLY when LD_ASSUME_KERNEL us unset after an install when LD_ASSUME_KERNEL was set. This is from steps 18-20. If I reset LD_ASSUME_KERNEL, no DB_PAGE errors. If I unset it now, PAGE errors. And an interesting Bugzilla Victor just pointed me to: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=91933 More things I should point out I find suspect (but I may just be reading it wrong...) If you pass in a --disable-posixmutexes ... How does this code in rpmdb/db3.c know about it? It looks iike if it can compute a posix mutex, it'll use it regardless. Here's the code i'm wondering about... /* * Avoid incompatible DB_CREATE/DB_RDONLY flags on DBENV->open. */ if (dbi->dbi_use_dbenv) { #if HAVE_LIBPTHREAD if (rpmdb->db_dbenv == NULL) { /* Set DB_PRIVATE if posix mutexes are not shared. */ xx = db3_pthread_nptl(); if (xx) { dbi->dbi_eflags |= DB_PRIVATE; rpmMessage(RPMMESS_DEBUG, _("unshared posix mutexes found(%d), adding DB_PRIVATE, using fcntl lock "), xx); } } #endif if (access(dbhome, W_OK) == -1) { /* dbhome is unwritable, don't attempt DB_CREATE on ... Then the function: /** * Check that posix mutexes are shared. * @return 0 == shared. */ static int db3_pthread_nptl(void) /*@*/ { pthread_mutex_t mutex; pthread_mutexattr_t mutexattr, *mutexattrp = NULL; pthread_cond_t cond; pthread_condattr_t condattr, *condattrp = NULL; int ret = 0; ret = pthread_mutexattr_init(&mutexattr); if (ret == 0) { ret = pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED); mutexattrp = &mutexattr; } if (ret == 0) ret = pthread_mutex_init(&mutex, mutexattrp); ... So, how is DB_PRIVATE always getting set when the disable was passed? I don't see how it could, given that code. So, this implies to me that --disable-posixmutexes is broken from a confugure point of view. It's *not* disabling it all the way. If I'm wrong in my analysis of that db3.c file, please let me know.
----- Additional Comments From jaredj.com 2004-11-12 17:13 EDT ------- Note that the RPM code snippet included is from the RPM source packages of RHEL 4, Beta 2. I would think RHEL 3 is probably similar.
Yes, RHEL3 is very similar. What you miss is that shared posix mutexes are only available through the test above if/when NPTL is functional and enabled. Setting LD_ASSUME_KERNEL makes shared posix mutexes unavailable. The test is exactly what db4 does while checking flags. Setting DB_PRIVATE is not at all the right thing to do, because there is no locking whatsoevere with DB_PRIVATE set. The intent was to have something functional when a non-redhat (and hence non-nptl) 2.4 kernel without NPTL was used. This is a rather small corner case for RHEL/FC product. Backing up and designing a goal other than "Make it work." is perhaps better than mucking about with various locking schemes. I can easily insturment a WAS specific pathway that takes out a dirt simple fcntl exclusive lock on Packages in rpm and avoids the complexities involved with concurrent access locking and NPTL if you wish. AFAIK, that is not an acceptable solution to IBM.
------- Additional Comment #84 From Jared P. Jurkiewicz 2004-11-15 15:32 EDT [reply] ------- Internal Only Note my last comment was for RHEL 3, U4 beta. It fails there.
----- Additional Comments From jaredj.com 2004-11-15 17:06 EDT ------- Note: There is an error in one of my emit texts: Deleting the /var/lib/__db.00? Should be: Deleting the /var/lib/rpm/__db.00? The script does delete the 001, 002, 003 files individually in /var/lib/rpm, only the text is in error.
----- Additional Comments From jaredj.com 2004-11-15 17:06 EDT ------- Note: There is an error in one of my emit texts: Deleting the /var/lib/__db.00? Should be: Deleting the /var/lib/rpm__db.00? The script does delete the 001, 002, 003 files individually in /var/lib/rpm, only the text is in error.
Created attachment 106824 [details] "simple_failure_test.tar.gz"
----- Additional Comments From jaredj.com 2004-11-15 17:03 EDT ------- Dirt simple reproduction testcase Here's a dirt simple reproduction. I've been able to easily remove WAS completely from the equation and still reprroduce failures. This is the run from RHEL3, U4 Beta. Note the RPM's included in this testcase are dirt simple, and the binaries were built with RPM v4 from RHEL 3 and RHEL4. So, it removes concerns it's that MQ rpms were built with RPMv3 as the failure causes My run results: ./test_install Backing up /var/lib/rpm to: /backup_rpm.tar.gz ... Backup completed Setting LD_ASSUME_KERNEL to 2.4.19 Set LD_ASSUME_KERNEL to: 2.4.19 Installing test_passthrough RPM Pre script... LD_ASSUME_KERNEL = 2.4.19 Post script... LD_ASSUME_KERNEL = 2.4.19 Completed the install of the passthrough test RPM. Running 'rpm -qa' and piping to /dev/null. Should be no errors on stderr. rpm-qa run complete. Total # of RPMS in RPM DB: 718 Unsetting LD_ASSUME_KERNEL... LD_ASSUME_KERNEL is now: Running 'rpm -qa' and piping to /dev/null. There will likely be an error on stderr regarding DB_PAGE_NOTFOUND error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page not found rpm-qa run complete. Should be a failure. Now, installing another rpm without LD_ASSUME_KERNEL set. Note that a failure DOES NOT EMIT! It will 'install' fine, and intruduce corruption into the RPMDB from the caches. Pre script... LD_ASSUME_KERNEL not set! Post script... LD_ASSUME_KERNEL not set! Install complete. Re-running package count. You WILL see an error here. But now it's too late, RPMDB is corrupted. error: rpmdbNextIterator: skipping h# 718 blob size(1568): BAD, 8 + 16 * il(41) + dl(916) New RPM package count is: 718 Deleting the /var/lib/__db.00? files to see if it is only cache issue corruption... Deletion done. Re-running package count. You WILL see an error here. But now it's too late, RPMDB is corrupted. error: rpmdbNextIterator: skipping h# 718 blob size(1568): BAD, 8 + 16 * il(41) + dl(916) New RPM package count is: 718 Note that even with the deletion of the /var/lib/__db.00? files, corruption is permanent. Futher calls to rpm produce failures. Testrun complete. Please restore /backup_rpm.tar.gz over /var/lib/rpm to repair RPMDB corruption that occured.
----- Additional Comments From jaredj.com 2004-11-15 15:32 EDT ------- Note my last comment was for RHEL 3, U4 beta. It fails there.
This should be fixed in rpm-4.2.3-16_nonptl headed for RHEL3U5.
---- Additional Comments From khoa.com 2005-02-20 17:03 EST ------- Jared - please verify this when U5 is available and close this bug report if possible. Thanks.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED Impact|------ |Functionality ------- Additional Comments From vjo.com 2005-03-21 23:21 EST ------- Tested on RHEL 3 U5 beta 1 - issue has been resolved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-147.html
This bug is still around for me on RHEL 4 -- I ran into a repeatable corrupt rpmdb problem only when "LD_ASSUME_KERNEL=2.4.19" is set. This is a production machine and I can't do much testing, but when the LD_ASSUME_KERNEL was set in / etc/profile the rpmdb would get corrupt everytime I loaded an rpm or ran up2date. After recovery I removed the LD_ASSUME_KERNEL setting and everything went fine. Putting it back in again resulted in corruption. I'm on rpm-4.3.3-7_nonptl and kernel-smp-2.6.9-5.0.5.EL and all the stuff needed for Oracle 9i.
Regards comment #98 Please describe exactly the error message you are seeing: Are you seeing: error: db4 error(-30988) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page not found 403 Do you get this purely if using rpm or only when using both rpm and up2date. Does LD_ASSUME_KERNEL=2.4.0 work correctly for you (old LinuxThreads)?