Bug 127341 - general install flakiness
general install flakiness
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Ingo Molnar
:
Depends On:
Blocks: 116727
  Show dependency treegraph
 
Reported: 2004-07-06 15:55 EDT by Bill Peck
Modified: 2007-11-30 17:07 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-08-19 11:35:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Anaconda Install.log (63.96 KB, text/plain)
2004-07-06 16:41 EDT, Bill Peck
no flags Details
Kernal panic when booting -16.ELsmp kernel (7.95 KB, text/plain)
2004-07-07 09:49 EDT, Jay Turner
no flags Details
Kernel panic from re0705.2 install on ix86 (1.08 KB, text/plain)
2004-07-07 10:32 EDT, Jay Turner
no flags Details
log of installer crash (26.16 KB, text/plain)
2004-07-07 17:49 EDT, Ingo Molnar
no flags Details
kernel log with exec-shield=0 (25.59 KB, text/plain)
2004-07-07 18:06 EDT, Ingo Molnar
no flags Details
patch for -16.EL.ernie.exec_shield_off kernel (1.36 KB, patch)
2004-07-07 20:24 EDT, Ernie Petrides
no flags Details | Diff
change from -15.19.EL_ingo to -16.EL (1.62 KB, patch)
2004-07-07 20:57 EDT, Ernie Petrides
no flags Details | Diff
patch from -16.EL to -16.EL.ernie.no_sp_jitter test kernel (858 bytes, patch)
2004-07-07 21:37 EDT, Ernie Petrides
no flags Details | Diff
linux-2.4.21-prot-growsdown.patch (1.35 KB, patch)
2004-07-08 03:13 EDT, Jakub Jelinek
no flags Details | Diff
print maps and ulimits at fault time (3.05 KB, patch)
2004-07-08 05:05 EDT, Ingo Molnar
no flags Details | Diff
boot log and detailed trace of the installer failure (7.04 KB, text/plain)
2004-07-08 05:33 EDT, Ingo Molnar
no flags Details
fix topdown mmap allocator when ulimit -s unlimited (408 bytes, patch)
2004-07-08 06:13 EDT, Ingo Molnar
no flags Details | Diff

  None (edit)
Description Bill Peck 2004-07-06 15:55:54 EDT
Description of problem:
Multiple install issues which could be kernel/glibc related.

Install hang during the installation of jadetex package on x86. Filed
a seperate bug # 127334
Install hang during install of fileroller. awk process was taking 100%
 of user space CPU.

Installations finished after killing kspewhich (multiple times) and
awk.  after reboot system failed to find root since the installer
neglected to create the initrd images.

Version-Release number of selected component (if applicable):
rel-eng/RHEL3-U3-re0705.2/i386/i386-as

How reproducible:
unsure

Steps to Reproduce:
1. Install above distro
2.
3.
  
Actual results:
Failed installs

Expected results:


Additional info:
Comment 1 Bill Peck 2004-07-06 16:41:03 EDT
Created attachment 101666 [details]
Anaconda Install.log

Riddled with segfaults during install
Comment 2 Mike McLean 2004-07-06 16:45:16 EDT
most likely either kernel or glibc.  These are the only significant
changes since the last installable i386 tree.
Comment 3 Ernie Petrides 2004-07-06 17:00:22 EDT
The only kernel change since the last distro is the exec-shield
update, so I'm assigning this bug to Ingo.  In the meantime, is
there any way we could set up an install to use the "noexec=off"
kernel boot-up option (in /etc/grub.conf) to see if this makes
a difference?  If it doesn't, then could we try an install with
the latest set of U3 packages except for the substitution of the
kernel package with the prior one (version 2.4.21-15.19.EL)?

Thanks.  -ernie
Comment 4 John Flanagan 2004-07-07 08:48:39 EDT
with noexec=off, i still get the textmode traceback.  With graphic
install, i can't get X to start with noexec=off.

John
Comment 5 John Flanagan 2004-07-07 08:52:21 EDT
Some of the other errors that Bill and I saw on my white box were of
this form:

awk: error while loading shared libraries: libm.so.6L ELF load command
alignment not page-aligned

awk: error while loading shared libraries: libm.so.6: cannot stat
shared object: Error 9

I'm adding Jakub into the mix here, as the only other package that
changed besides the kernel was glibc.

Comment 6 Jay Turner 2004-07-07 09:48:35 EDT
Some more findings:

I took a U2 machine and upgraded the glibc to 2.3.2-95.22 (leaving the
kernel at -15.EL) and all was well.  So I upgraded the kernel to
-16.EL and rebooted and got the attached panic.  I'm about to try
backing down glibc on the box and leaving the -16.EL kernel, but I
suspect that's going to be "just fine" as that's what's running on the
laptop I'm typing at right now and am not seeing any issues.
Comment 7 Jay Turner 2004-07-07 09:49:25 EDT
Created attachment 101682 [details]
Kernal panic when booting -16.ELsmp kernel
Comment 8 Jay Turner 2004-07-07 09:58:44 EDT
OK, I lied.  Getting the exact same panic even with dropping back to
glibc 95.20.
Comment 9 Jay Turner 2004-07-07 10:32:37 EDT
Created attachment 101683 [details]
Kernel panic from re0705.2 install on ix86

Kernel panic which pops up during installation.  Happens right after USB
initialization, so it might be the same thing.	Not really sure.
Comment 10 Jakub Jelinek 2004-07-07 11:42:06 EDT
The rpm core dump I was given has:

#0  0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2,
    fpList=0x871e090) at fprint.c:250
250                 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i],
(gdb) l
245             if (i > 0 && dirIndexes[i - 1] == dirIndexes[i]) {
246                 fpList[i].entry = fpList[i - 1].entry;
247                 fpList[i].subDir = fpList[i - 1].subDir;
248                 fpList[i].baseName = baseNames[i];
249             } else {
250                 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i],
251                                      1);
252             }
253         }
254     }
(gdb) printf "%d %x\n", i, dirIndexes[i]
0 7273752f
(gdb) bt
#0  0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2,
    fpList=0x871e090) at fprint.c:250
#1  0x00a79957 in rpmdbFindFpList (db=0x49040000, fpList=0x86d2c98, matchList=0x86fbc90, numItems=2562) at rpmdb.c:3346
#2  0x008326e9 in rpmtsRun (ts=0x866d8a8, okProbs=0x0, ignoreSet=1224998912) at transaction.c:1195
#3  0x00821633 in rpmInstall (ts=0x866d8a8, ia=0x848ee0, fileArgv=0x86659e8) at rpminstall.c:679
#4  0x0804b5c8 in main (argc=5, argv=0xbfffa854) at rpmqv.c:781
#5  0x0021b79d in __libc_start_main () from /lib/tls/libc.so.6
#6  0x0804a7e1 in _start ()
(gdb) x/1s dirIndexes
0x871e080:       "/usr"
(gdb) up
#1  0x00a79957 in rpmdbFindFpList (db=0x49040000, fpList=0x86d2c98, matchList=0x86fbc90, numItems=2562) at rpmdb.c:3346
3346            fpLookupList(fpc, dirNames, baseNames, dirIndexes, num, fps);
(gdb) x/1s dirIndexes
0x871e080:       "/usr"
(gdb) down
#0  0x00a6aee6 in fpLookupList (cache=0x87082e8, dirNames=0x871e9b0, baseNames=0x87087e0, dirIndexes=0x871e080, fileCount=2,
    fpList=0x871e090) at fprint.c:250
250                 fpList[i] = doLookup(cache, dirNames[dirIndexes[i]], baseNames[i],
(gdb) p baseNames[i]
$5 = 0x49040000 <Address 0x49040000 out of bounds>
(gdb) p dirNames[0]
$6 = 0x871feb7 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Filter/Util/"
(gdb) p dirNames[1]
$7 = 0x871fefd "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/Filter/"
(gdb) p dirNames[10]
$8 = 0x30 <Address 0x30 out of bounds>
(gdb) p dirNames[2]
$9 = 0x871ff3e "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/Util/Call/"
(gdb) p dirNames[3]
$10 = 0x871ff8e "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/Util/Exec/"
(gdb) p dirNames[4]
$11 = 0x871ffde "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/decrypt/"
(gdb) p dirNames[5]
$12 = 0x872002c "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/auto/Filter/tee/"
(gdb) p dirNames[6]
$13 = 0x8720076 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/"
(gdb) p dirNames[7]
$14 = 0x87200b0 "/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/"
(gdb) p dirNames[8]
$15 = 0x87200ea "/usr/share/man/man3/"
(gdb) p dirNames[9]
$16 = 0x0
(gdb) p cache
$17 = 0x87082e8
Comment 11 Ernie Petrides 2004-07-07 15:23:01 EDT
Jay Turner's panic describe in comments 6-8 are a different
problem, which seems to be USB related.  Jay, please file a
different bug on that so that we can use this bug to track
the random segfault problem.

Thanks.  -ernie
Comment 12 Ingo Molnar 2004-07-07 16:03:04 EDT
the print-fatal-signals=1 /etc/grub.conf kernel boot option will also
produce traces of the segfaults, in the kernel log.
Comment 13 Bill Peck 2004-07-07 16:44:38 EDT
I was able to resurect a system with
rel-eng/RHEL3-U3-re0705.1/i386/i386-as installed on it which has
kernel 2.4.21-16.EL.

I reinstalled most of the critical rpms that reported segfaulting in
install.log.

Now I'm running cerberus on the system to see how it holds up.
**** Test in progress ****
Wed Jul  7 16:09:36 EDT 2004: SAVE-STATE success: on 1/1 after 1m0s
Wed Jul  7 16:09:35 EDT 2004: TTCP success: on 1/384 after 59s
Wed Jul  7 16:10:27 EDT 2004: CRASHME success: on 1/64 after 1m51s
Wed Jul  7 16:10:41 EDT 2004: TTCP success: on 2/384 after 2m5s
Wed Jul  7 16:11:13 EDT 2004: CRASHME success: on 2/64 after 2m37s
Wed Jul  7 16:11:35 EDT 2004: TTCP success: on 3/384 after 2m59s
Wed Jul  7 16:11:49 EDT 2004: FIFOS_MMAP success: on 1/32 after 3m13s
Wed Jul  7 16:12:01 EDT 2004: CRASHME success: on 3/64 after 3m25s
Wed Jul  7 16:12:40 EDT 2004: TTCP success: on 4/384 after 4m4s
Wed Jul  7 16:12:53 EDT 2004: CRASHME success: on 4/64 after 4m17s

Very early still but no segfaults yet...

I'm going to log in from home later and see if I can install the -BOOT
kernel and see if that shows the segfaults.
Comment 14 Ingo Molnar 2004-07-07 17:49:50 EDT
Created attachment 101696 [details]
log of installer crash
Comment 15 Ingo Molnar 2004-07-07 17:53:26 EDT
the previous attachment is a serial console capture of the bootup into
text install of RHEL3-U3-re0705.2-i386-as-disc1-ftp.iso, using
print-fatal-signals=1.

It shows one instance of gzip segfaulting on a stack EIP address
(huh?). Python also throws an exception, the python /proc/<PID>/maps
file is in the file too. In particular it shows:

b75f4000-b75fa000 rw-p 00001000 00:00 0
bfff3000-bfffe000 rwxp ffff4000 00:00 0
bfffe000-c0000000 rw-p fffff000 00:00 0

which AFAIK means that glibc used mprotect() to mark the stack
executable. (but did so only from the point where glibc lives, so the
topmost two pages (aux/arg data) are still non-executable.)
Comment 16 Ingo Molnar 2004-07-07 18:06:01 EDT
Created attachment 101699 [details]
kernel log with exec-shield=0

This attachment is a serial log of the installer incidents with exec-shield=0
specified on the kernel command line, plus anaconda's maps file.

there seem to be more segfaults in this mode, and these too are all on the
stack.
Comment 17 Ingo Molnar 2004-07-07 20:17:44 EDT
i've recreated the disc1 ISO of re0705.2 using the vmlinuz from
re0704.0, and this one boots up fine and i dont get the anaconda
exception (and segfaults).

so it's the -16.EL kernel. The 2.4.21-15.19.EL vmlinuz works.

since booting with exec-shield=0 doesnt solve the problem, it must be
some indirect effect of the exec-shield patch.

using the -16.EL BOOT kernel on two testboxes didnt trigger any
problems. So it must be some interaction between the installer
environment and the exec-shield patch.
Comment 18 Ernie Petrides 2004-07-07 20:22:44 EDT
JohnF wrote the following in a message to ship-list:

> Mike McLean is spinning a special test tree now that backs the
> kernel down to 2.4.21-15.19.EL which was in the 0704.0 tree that
> installed fine.  This should help provide some more data on whether
> this is kernel or glibc that's affecting the environment.

At TimB's request, I've have just kicked off a Beehive build that
uses the 2.4.21-16.EL kernel (with the new exec-shield update) but
disables exec shield protection by default internally to the kernel.
This test kernel is being built in 3.0E-scratch and will have kernel
version 2.4.21-16.EL.ernie.exec_shield_off.

If the test distro that John describes (using -15.19.EL) installs
cleanly, then we would like to try the exact same test built with
the -16.EL.ernie.exec_shield_off kernel (i.e., with the rest of the
distro being the latest set of U3 packages outside of the kernel).

I'll attach the patch that represents the differences between the
-16.EL kernel and the -16.EL.ernie.exec_shield_off kernel in the
following comment.
Comment 19 Ernie Petrides 2004-07-07 20:24:00 EDT
Created attachment 101702 [details]
patch for -16.EL.ernie.exec_shield_off kernel
Comment 20 Ernie Petrides 2004-07-07 20:28:02 EDT
So, after a mid-air collision between Ingo and me updating this
bug, I see that he has already gathered the test result that we
wanted from the special test distro that MikeM is creating.

Ingo, could you please try this again with the
-16.EL.ernie.exec_shield_off when it is ready?
(Or, if you don't feel like waiting for Beehive,
you could apply the patch above to your own -16.EL
test kernel and drop it in place.)
Comment 21 Ingo Molnar 2004-07-07 20:36:13 EDT
i created another ISO too, using 2.4.21-15.19.EL_ingoBOOT - which was
the last pre-merge beehive build i did.

this kernel boots fine and the install doesnt produce a single segfault!

So i believe it's some change from my 'final' exec-shield patch and
the one that got into -16.EL.
Comment 22 Ernie Petrides 2004-07-07 20:57:26 EDT
Created attachment 101705 [details]
change from -15.19.EL_ingo to -16.EL

Here are the changes between Ingo's 2.4.21-15.19.EL_ingo kernel
in 3.0E-scratch and the -16.EL kernel in 3.0E-kernel.  -ernie
Comment 23 Ingo Molnar 2004-07-07 21:05:19 EDT
So ... the only real difference was the HT-stack-jitter code that was
added in the last minute.

Ernie, lets zap it... I can try any resulting BOOT vmlinuz within 10
minutes.
Comment 24 Ernie Petrides 2004-07-07 21:36:47 EDT
I've just kicked off a new Beehive test build with the HT-stack-jitter
change removed (from the -16.EL base).  This kernel does have exec
shield enabled by default, and it is being built in 3.0E-scratch as
kernel version 2.4.21-16.EL.ernie.no_sp_jitter.

I'll attach the patch in the following comment.  -ernie
Comment 25 Ernie Petrides 2004-07-07 21:37:44 EDT
Created attachment 101706 [details]
patch from -16.EL to -16.EL.ernie.no_sp_jitter test kernel
Comment 26 Ingo Molnar 2004-07-07 22:18:40 EDT
good news: -16.EL.ernie.no_sp_jitter is a winner it seems.

I created a custom boot.iso using the vmlinuz from no_sp_jitter-BOOT
and it booted up fine and the installer didnt show any of the
segfaults and exceptions that the -16.EL kernel does. I could continue
past the initial stage of installation and do an installation without
any problems.

(To double-check things i also created a boot.iso using the same
method but this time taking -16.EL-BOOT's vmlinuz - the install showed
the same instabilities as before.)
Comment 27 Jakub Jelinek 2004-07-08 03:12:07 EDT
Can anyone understand why though?
The arch_align_stack subtracts 0 up to 32K from sp, the HT jitter
subtracts 0 up to 32K in addition to that (arch_align_stack is done twice, once in setup_arg_pages and once in create_elf_tables)
Still, I believe many earlier kernels subtracted 0 up to 2M and it
(mostly) worked - there were just issues with LinuxThreads which sets
RLIMIT_STACK to 2M and so it could have basically no stack at all.
But 64K is far smaller than 2M.

BTW, I would sleep much easier in the night if U3 had PROT_GROWSDOWN,
then ld.so doesn't have to try many mprotects and split the stack VMA
into sub-vmas.
I'll attach the patch.
Comment 28 Jakub Jelinek 2004-07-08 03:13:54 EDT
Created attachment 101712 [details]
linux-2.4.21-prot-growsdown.patch
Comment 29 Ingo Molnar 2004-07-08 03:31:03 EDT
Why it fails is a mystery to me too.

the prot-growsdown.patch looks good to me - but we are awfully late in
the game.
Comment 30 Ingo Molnar 2004-07-08 04:58:00 EDT
Just to check i've built kernel-BOOT-2.4.21-16.EL_ingo.i386.rpm, which
is 16.EL + the prot-growsdown.patch.

i created a boot.iso from it - this boot.iso throws an exception too,
it's a similar gunzip segfault as we had before:

gunzip/67: potentially unexpected fatal signal 11.
userspace code at bffecf49: 52 8d bb 7c 03 ff ff 8d b3 88 03 ff ff 89
8b 34
                                                                     
                              
Pid/TGid: 67/67, comm:               gunzip
EIP: 0023:[<bffecf49>] CPU: 0
EIP is at  (2.4.21-16.EL_ingoBOOT)
 ESP: 002b:bfffdf80 EFLAGS: 00010282    Not tainted
EAX: 080a7000 EBX: bfffecc4 ECX: bfff8668 EDX: bfffe070
ESI: bfffe114 EDI: bffffd9e EBP: bfffe0e0 DS: 002b ES: 002b FS: 0000
GS: 0000
CR0: 8005003b CR2: bfffdf7c CR3: 34f72000 CR4: 000006d0
Call Trace:
Comment 31 Ingo Molnar 2004-07-08 05:05:16 EDT
Created attachment 101714 [details]
print maps and ulimits at fault time

I've attached print-maps.patch which enhances print-fatal-signals to also print
out the current memory-maps layout of the faulting process to the kernel log
(output is identical to that of /proc/<PID>/maps).
The patch also prints out the current ulimits of the process. I'm currently
build ing a boot kernel with this patch applied, and will re-run the tests,
which will hopefully result in more debugging info.
Comment 32 Ingo Molnar 2004-07-08 05:30:38 EDT
ok, the real bug is that in the unlimited RLIM_STACK case we map ld.so
_just below_ the stack. This is horribly broken. Ernie?
Comment 33 Ingo Molnar 2004-07-08 05:33:13 EDT
Created attachment 101715 [details]
boot log and detailed trace of the installer failure

here's the full bootlog, including the maps file and the rlimit values. In
particular:

bffec000-bfffe000 r-xp 00000000 07:00 62859480	 /mnt/runtime/lib/ld-2.3.2.so
bfffe000-bffff000 rw-p 00011000 07:00 62859480	 /mnt/runtime/lib/ld-2.3.2.so
bffff000-c0000000 rwxp 00000000 00:00 0

   STACK: cur: ffffffff  [max: ffffffff]

so we end up having a 4K stack only ...
Comment 34 Ingo Molnar 2004-07-08 05:38:34 EDT
narrowed it down a bit futher: to reproduce the bug do this on any
-16.EL-ish kernel:

  ulimit -s unlimited
  i386 /bin/ls

i.e. the last-minute arch_get_unmapped_area() changes that were
intended to add the robustness of taroon's mapping method ended up
borking this case.
Comment 35 Ingo Molnar 2004-07-08 05:44:18 EDT
another detail: stage2.img's /bin/gunzip binary doesnt have
PT_GNU_STACK, that's why we fall back to the 'stock' taroon layout.

gunzip didnt crash in the installer shell because that shell has a
default 10 MB RLIM_STACK ulimit. Change it to 'ulimit -s unlimited'
and the crashes can be reproduced.
Comment 36 Ingo Molnar 2004-07-08 06:13:47 EDT
Created attachment 101718 [details]
fix topdown mmap allocator when ulimit -s unlimited

did yet another review of the ulimit magic taroon does and found the bug this
time. Oneliner patch attached.
Comment 37 Ingo Molnar 2004-07-08 06:44:52 EDT
The reason why the installer failed in so colorful ways was the
following: mmap() allocated ld.so's mapping just below the current
stack, essentially merging the two memory areas - leaving no guard
area between the top of mmaps and the stack.

When the stack underflowed it slowly overwrite ld.so's bss/data areas
... resulting in all sorts of weird ld.so failures and crashes ...
Comment 38 Jakub Jelinek 2004-07-08 07:26:24 EDT
I think the print_maps part of the print-maps-2.4.21-CVS-A4 patch
would be very useful to have in future kernels, maybe enabled with
print-fatal-signals=2 or something like that.
Especially when the address space is randomized, userspace backtrace
and/or register dump is often not very useful without maps.
Comment 39 Ingo Molnar 2004-07-08 07:56:46 EDT
I have built a boot kernel with topdown-fix.patch applied and have
recreated the boot.iso image, and the installer doesnt segfault anymore.

Ernie, if you agree with topdown-fix.patch i'd suggest to build an
official -17.EL kernel.
Comment 40 Ingo Molnar 2004-07-08 08:10:03 EDT
I have also followed up an observation mentioned earlier in the thread:

"with noexec=off, i still get the textmode traceback.  With graphic
install, i can't get X to start with noexec=off."

i have checked with this latest boot.iso that both the graphical and
the text install works fine, with and without noexec=off.

the reason for the earlier failures is well understood as well: with
noexec=off we get more mappings at the end of the address space and
the kernel bug that created a too small stack bit harder. More early
applications (such as ddcdetect) would segfault causing the graphical
install to fail.
Comment 41 Ernie Petrides 2004-07-08 18:56:12 EDT
A fix for this problem has just been committed to the RHEL3 U3
patch pool this afternoon (in kernel version 2.4.21-17.EL).
Comment 42 Jay Turner 2004-08-19 11:35:34 EDT
Closing this one out, as we've not have any installation issues with
the recent kernels (least none that I know about.)
Comment 43 John Flanagan 2004-09-02 00:31:54 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html

Note You need to log in before you can comment on or make changes to this bug.