Description of problem: Multiple install issues which could be kernel/glibc related. Install hang during the installation of jadetex package on x86. Filed a seperate bug # 127334 Install hang during install of fileroller. awk process was taking 100% of user space CPU. Installations finished after killing kspewhich (multiple times) and awk. after reboot system failed to find root since the installer neglected to create the initrd images. Version-Release number of selected component (if applicable): rel-eng/RHEL3-U3-re0705.2/i386/i386-as How reproducible: unsure Steps to Reproduce: 1. Install above distro 2. 3. Actual results: Failed installs Expected results: Additional info:
Created attachment 101666 [details] Anaconda Install.log Riddled with segfaults during install
most likely either kernel or glibc. These are the only significant changes since the last installable i386 tree.
The only kernel change since the last distro is the exec-shield update, so I'm assigning this bug to Ingo. In the meantime, is there any way we could set up an install to use the "noexec=off" kernel boot-up option (in /etc/grub.conf) to see if this makes a difference? If it doesn't, then could we try an install with the latest set of U3 packages except for the substitution of the kernel package with the prior one (version 2.4.21-15.19.EL)? Thanks. -ernie
with noexec=off, i still get the textmode traceback. With graphic install, i can't get X to start with noexec=off. John
Some of the other errors that Bill and I saw on my white box were of this form: awk: error while loading shared libraries: libm.so.6L ELF load command alignment not page-aligned awk: error while loading shared libraries: libm.so.6: cannot stat shared object: Error 9 I'm adding Jakub into the mix here, as the only other package that changed besides the kernel was glibc.
Some more findings: I took a U2 machine and upgraded the glibc to 2.3.2-95.22 (leaving the kernel at -15.EL) and all was well. So I upgraded the kernel to -16.EL and rebooted and got the attached panic. I'm about to try backing down glibc on the box and leaving the -16.EL kernel, but I suspect that's going to be "just fine" as that's what's running on the laptop I'm typing at right now and am not seeing any issues.
Created attachment 101682 [details] Kernal panic when booting -16.ELsmp kernel
OK, I lied. Getting the exact same panic even with dropping back to glibc 95.20.
Created attachment 101683 [details] Kernel panic from re0705.2 install on ix86 Kernel panic which pops up during installation. Happens right after USB initialization, so it might be the same thing. Not really sure.
The rpm core dump I was given has: #0 0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2, fpList=0x871e090) at fprint.c:250 250 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i], (gdb) l 245 if (i > 0 && dirIndexes[i - 1] == dirIndexes[i]) { 246 fpList[i].entry = fpList[i - 1].entry; 247 fpList[i].subDir = fpList[i - 1].subDir; 248 fpList[i].baseName = baseNames[i]; 249 } else { 250 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i], 251 1); 252 } 253 } 254 } (gdb) printf "%d %x\n", i, dirIndexes[i] 0 7273752f (gdb) bt #0 0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2, fpList=0x871e090) at fprint.c:250 #1 0x00a79957 in rpmdbFindFpList (db=0x49040000, fpList=0x86d2c98, matchList=0x86fbc90, numItems=2562) at rpmdb.c:3346 #2 0x008326e9 in rpmtsRun (ts=0x866d8a8, okProbs=0x0, ignoreSet=1224998912) at transaction.c:1195 #3 0x00821633 in rpmInstall (ts=0x866d8a8, ia=0x848ee0, fileArgv=0x86659e8) at rpminstall.c:679 #4 0x0804b5c8 in main (argc=5, argv=0xbfffa854) at rpmqv.c:781 #5 0x0021b79d in __libc_start_main () from /lib/tls/libc.so.6 #6 0x0804a7e1 in _start () (gdb) x/1s dirIndexes 0x871e080: "/usr" (gdb) up #1 0x00a79957 in rpmdbFindFpList (db=0x49040000, fpList=0x86d2c98, matchList=0x86fbc90, numItems=2562) at rpmdb.c:3346 3346 fpLookupList(fpc, dirNames, baseNames, dirIndexes, num, fps); (gdb) x/1s dirIndexes 0x871e080: "/usr" (gdb) down #0 0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2, fpList=0x871e090) at fprint.c:250 250 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i], (gdb) p baseNames[i] $5 = 0x49040000 <Address 0x49040000 out of bounds> (gdb) p dirNames[0] $6 = 0x871feb7 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Filter/Util/" (gdb) p dirNames[1] $7 = 0x871fefd "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Filter/" (gdb) p dirNames[10] $8 = 0x30 <Address 0x30 out of bounds> (gdb) p dirNames[2] $9 = 0x871ff3e "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/Util/Call/" (gdb) p dirNames[3] $10 = 0x871ff8e "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/Util/Exec/" (gdb) p dirNames[4] $11 = 0x871ffde "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/decrypt/" (gdb) p dirNames[5] $12 = 0x872002c "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/tee/" (gdb) p dirNames[6] $13 = 0x8720076 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/" (gdb) p dirNames[7] $14 = 0x87200b0 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/" (gdb) p dirNames[8] $15 = 0x87200ea "/usr/share/man/man3/" (gdb) p dirNames[9] $16 = 0x0 (gdb) p cache $17 = 0x87082e8
Jay Turner's panic describe in comments 6-8 are a different problem, which seems to be USB related. Jay, please file a different bug on that so that we can use this bug to track the random segfault problem. Thanks. -ernie
the print-fatal-signals=1 /etc/grub.conf kernel boot option will also produce traces of the segfaults, in the kernel log.
I was able to resurect a system with rel-eng/RHEL3-U3-re0705.1/i386/i386-as installed on it which has kernel 2.4.21-16.EL. I reinstalled most of the critical rpms that reported segfaulting in install.log. Now I'm running cerberus on the system to see how it holds up. **** Test in progress **** Wed Jul 7 16:09:36 EDT 2004: SAVE-STATE success: on 1/1 after 1m0s Wed Jul 7 16:09:35 EDT 2004: TTCP success: on 1/384 after 59s Wed Jul 7 16:10:27 EDT 2004: CRASHME success: on 1/64 after 1m51s Wed Jul 7 16:10:41 EDT 2004: TTCP success: on 2/384 after 2m5s Wed Jul 7 16:11:13 EDT 2004: CRASHME success: on 2/64 after 2m37s Wed Jul 7 16:11:35 EDT 2004: TTCP success: on 3/384 after 2m59s Wed Jul 7 16:11:49 EDT 2004: FIFOS_MMAP success: on 1/32 after 3m13s Wed Jul 7 16:12:01 EDT 2004: CRASHME success: on 3/64 after 3m25s Wed Jul 7 16:12:40 EDT 2004: TTCP success: on 4/384 after 4m4s Wed Jul 7 16:12:53 EDT 2004: CRASHME success: on 4/64 after 4m17s Very early still but no segfaults yet... I'm going to log in from home later and see if I can install the -BOOT kernel and see if that shows the segfaults.
Created attachment 101696 [details] log of installer crash
the previous attachment is a serial console capture of the bootup into text install of RHEL3-U3-re0705.2-i386-as-disc1-ftp.iso, using print-fatal-signals=1. It shows one instance of gzip segfaulting on a stack EIP address (huh?). Python also throws an exception, the python /proc/<PID>/maps file is in the file too. In particular it shows: b75f4000-b75fa000 rw-p 00001000 00:00 0 bfff3000-bfffe000 rwxp ffff4000 00:00 0 bfffe000-c0000000 rw-p fffff000 00:00 0 which AFAIK means that glibc used mprotect() to mark the stack executable. (but did so only from the point where glibc lives, so the topmost two pages (aux/arg data) are still non-executable.)
Created attachment 101699 [details] kernel log with exec-shield=0 This attachment is a serial log of the installer incidents with exec-shield=0 specified on the kernel command line, plus anaconda's maps file. there seem to be more segfaults in this mode, and these too are all on the stack.
i've recreated the disc1 ISO of re0705.2 using the vmlinuz from re0704.0, and this one boots up fine and i dont get the anaconda exception (and segfaults). so it's the -16.EL kernel. The 2.4.21-15.19.EL vmlinuz works. since booting with exec-shield=0 doesnt solve the problem, it must be some indirect effect of the exec-shield patch. using the -16.EL BOOT kernel on two testboxes didnt trigger any problems. So it must be some interaction between the installer environment and the exec-shield patch.
JohnF wrote the following in a message to ship-list: > Mike McLean is spinning a special test tree now that backs the > kernel down to 2.4.21-15.19.EL which was in the 0704.0 tree that > installed fine. This should help provide some more data on whether > this is kernel or glibc that's affecting the environment. At TimB's request, I've have just kicked off a Beehive build that uses the 2.4.21-16.EL kernel (with the new exec-shield update) but disables exec shield protection by default internally to the kernel. This test kernel is being built in 3.0E-scratch and will have kernel version 2.4.21-16.EL.ernie.exec_shield_off. If the test distro that John describes (using -15.19.EL) installs cleanly, then we would like to try the exact same test built with the -16.EL.ernie.exec_shield_off kernel (i.e., with the rest of the distro being the latest set of U3 packages outside of the kernel). I'll attach the patch that represents the differences between the -16.EL kernel and the -16.EL.ernie.exec_shield_off kernel in the following comment.
Created attachment 101702 [details] patch for -16.EL.ernie.exec_shield_off kernel
So, after a mid-air collision between Ingo and me updating this bug, I see that he has already gathered the test result that we wanted from the special test distro that MikeM is creating. Ingo, could you please try this again with the -16.EL.ernie.exec_shield_off when it is ready? (Or, if you don't feel like waiting for Beehive, you could apply the patch above to your own -16.EL test kernel and drop it in place.)
i created another ISO too, using 2.4.21-15.19.EL_ingoBOOT - which was the last pre-merge beehive build i did. this kernel boots fine and the install doesnt produce a single segfault! So i believe it's some change from my 'final' exec-shield patch and the one that got into -16.EL.
Created attachment 101705 [details] change from -15.19.EL_ingo to -16.EL Here are the changes between Ingo's 2.4.21-15.19.EL_ingo kernel in 3.0E-scratch and the -16.EL kernel in 3.0E-kernel. -ernie
So ... the only real difference was the HT-stack-jitter code that was added in the last minute. Ernie, lets zap it... I can try any resulting BOOT vmlinuz within 10 minutes.
I've just kicked off a new Beehive test build with the HT-stack-jitter change removed (from the -16.EL base). This kernel does have exec shield enabled by default, and it is being built in 3.0E-scratch as kernel version 2.4.21-16.EL.ernie.no_sp_jitter. I'll attach the patch in the following comment. -ernie
Created attachment 101706 [details] patch from -16.EL to -16.EL.ernie.no_sp_jitter test kernel
good news: -16.EL.ernie.no_sp_jitter is a winner it seems. I created a custom boot.iso using the vmlinuz from no_sp_jitter-BOOT and it booted up fine and the installer didnt show any of the segfaults and exceptions that the -16.EL kernel does. I could continue past the initial stage of installation and do an installation without any problems. (To double-check things i also created a boot.iso using the same method but this time taking -16.EL-BOOT's vmlinuz - the install showed the same instabilities as before.)
Can anyone understand why though? The arch_align_stack subtracts 0 up to 32K from sp, the HT jitter subtracts 0 up to 32K in addition to that (arch_align_stack is done twice, once in setup_arg_pages and once in create_elf_tables) Still, I believe many earlier kernels subtracted 0 up to 2M and it (mostly) worked - there were just issues with LinuxThreads which sets RLIMIT_STACK to 2M and so it could have basically no stack at all. But 64K is far smaller than 2M. BTW, I would sleep much easier in the night if U3 had PROT_GROWSDOWN, then ld.so doesn't have to try many mprotects and split the stack VMA into sub-vmas. I'll attach the patch.
Created attachment 101712 [details] linux-2.4.21-prot-growsdown.patch
Why it fails is a mystery to me too. the prot-growsdown.patch looks good to me - but we are awfully late in the game.
Just to check i've built kernel-BOOT-2.4.21-16.EL_ingo.i386.rpm, which is 16.EL + the prot-growsdown.patch. i created a boot.iso from it - this boot.iso throws an exception too, it's a similar gunzip segfault as we had before: gunzip/67: potentially unexpected fatal signal 11. userspace code at bffecf49: 52 8d bb 7c 03 ff ff 8d b3 88 03 ff ff 89 8b 34 Pid/TGid: 67/67, comm: gunzip EIP: 0023:[<bffecf49>] CPU: 0 EIP is at (2.4.21-16.EL_ingoBOOT) ESP: 002b:bfffdf80 EFLAGS: 00010282 Not tainted EAX: 080a7000 EBX: bfffecc4 ECX: bfff8668 EDX: bfffe070 ESI: bfffe114 EDI: bffffd9e EBP: bfffe0e0 DS: 002b ES: 002b FS: 0000 GS: 0000 CR0: 8005003b CR2: bfffdf7c CR3: 34f72000 CR4: 000006d0 Call Trace:
Created attachment 101714 [details] print maps and ulimits at fault time I've attached print-maps.patch which enhances print-fatal-signals to also print out the current memory-maps layout of the faulting process to the kernel log (output is identical to that of /proc/<PID>/maps). The patch also prints out the current ulimits of the process. I'm currently build ing a boot kernel with this patch applied, and will re-run the tests, which will hopefully result in more debugging info.
ok, the real bug is that in the unlimited RLIM_STACK case we map ld.so _just below_ the stack. This is horribly broken. Ernie?
Created attachment 101715 [details] boot log and detailed trace of the installer failure here's the full bootlog, including the maps file and the rlimit values. In particular: bffec000-bfffe000 r-xp 00000000 07:00 62859480 /mnt/runtime/lib/ld-2.3.2.so bfffe000-bffff000 rw-p 00011000 07:00 62859480 /mnt/runtime/lib/ld-2.3.2.so bffff000-c0000000 rwxp 00000000 00:00 0 STACK: cur: ffffffff [max: ffffffff] so we end up having a 4K stack only ...
narrowed it down a bit futher: to reproduce the bug do this on any -16.EL-ish kernel: ulimit -s unlimited i386 /bin/ls i.e. the last-minute arch_get_unmapped_area() changes that were intended to add the robustness of taroon's mapping method ended up borking this case.
another detail: stage2.img's /bin/gunzip binary doesnt have PT_GNU_STACK, that's why we fall back to the 'stock' taroon layout. gunzip didnt crash in the installer shell because that shell has a default 10 MB RLIM_STACK ulimit. Change it to 'ulimit -s unlimited' and the crashes can be reproduced.
Created attachment 101718 [details] fix topdown mmap allocator when ulimit -s unlimited did yet another review of the ulimit magic taroon does and found the bug this time. Oneliner patch attached.
The reason why the installer failed in so colorful ways was the following: mmap() allocated ld.so's mapping just below the current stack, essentially merging the two memory areas - leaving no guard area between the top of mmaps and the stack. When the stack underflowed it slowly overwrite ld.so's bss/data areas ... resulting in all sorts of weird ld.so failures and crashes ...
I think the print_maps part of the print-maps-2.4.21-CVS-A4 patch would be very useful to have in future kernels, maybe enabled with print-fatal-signals=2 or something like that. Especially when the address space is randomized, userspace backtrace and/or register dump is often not very useful without maps.
I have built a boot kernel with topdown-fix.patch applied and have recreated the boot.iso image, and the installer doesnt segfault anymore. Ernie, if you agree with topdown-fix.patch i'd suggest to build an official -17.EL kernel.
I have also followed up an observation mentioned earlier in the thread: "with noexec=off, i still get the textmode traceback. With graphic install, i can't get X to start with noexec=off." i have checked with this latest boot.iso that both the graphical and the text install works fine, with and without noexec=off. the reason for the earlier failures is well understood as well: with noexec=off we get more mappings at the end of the address space and the kernel bug that created a too small stack bit harder. More early applications (such as ddcdetect) would segfault causing the graphical install to fail.
A fix for this problem has just been committed to the RHEL3 U3 patch pool this afternoon (in kernel version 2.4.21-17.EL).
Closing this one out, as we've not have any installation issues with the recent kernels (least none that I know about.)
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html