Bug 104243 - raid1 install hangs installing kernel
raid1 install hangs installing kernel
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
All Linux
medium Severity high
: ---
: ---
Assigned To: Jason Baron
Brian Brock
Dell requests fix for U3.
:
: 107728 109047 (view as bug list)
Depends On:
Blocks: 107565
  Show dependency treegraph
 
Reported: 2003-09-11 14:07 EDT by James Laska
Modified: 2013-03-06 00:56 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-02-12 15:55:38 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kickstart file that reproduces the hangup (5.38 KB, text/plain)
2003-09-11 14:08 EDT, James Laska
no flags Details
dell ks.cfg (3.54 KB, text/plain)
2003-12-10 16:41 EST, Gary Lerhaupt
no flags Details
syslog from stalled install with sysrq info (27.46 KB, text/plain)
2003-12-16 15:20 EST, Gary Lerhaupt
no flags Details
Syslog from system with no QLogic 2342 (11.57 KB, text/plain)
2003-12-17 12:03 EST, Gary Lerhaupt
no flags Details

  None (edit)
Description James Laska 2003-09-11 14:07:17 EDT
Anaconda is hanging during the install of the kernel.  The install is a raid1 on
2 ide drives whose configs can be seen below (or in the ks.cfg attached):

raid /boot --level 1 --fstype ext3 --device 0 raid.00 raid.01
raid /var --level 1 --fstype ext3 --device 2 raid.20 raid.21
raid swap --level 1 --fstype ext3 --device 5 raid.30 raid.31
raid swap --level 1 --fstype ext3 --device 6 raid.32 raid.33
raid /tmp --level 1 --fstype ext3 --device 3 raid.10 raid.11
raid / --level 1 --fstype ext3 --device 1 raid.40 raid.41

Further examination shows that the hangs is occuring while running the --scripts
of kernel-2.4.9-e.24.i686.rpm.

postinstall scriptlet (using /bin/sh):
cd /boot
ln -sf vmlinuz-2.4.9-e.24 vmlinuz
ln -sf System.map-2.4.9-e.24 System.map
ln -sf module-info-2.4.9-e.24 module-info
[ -x /usr/sbin/module_upgrade ] && /usr/sbin/module_upgrade
[ -x /sbin/mkkerneldoth ] && /sbin/mkkerneldoth
if [ -x /sbin/new-kernel-pkg ] ; then
        /sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.9-e.24
fi

The process tables show that /sbin/new-kernel-pkg is calling /sbin/mkinitrd
which calls 'tar cf - .'.  The tar command is the command that is hanging.  If
you kill the tar process, the installation continues successfully (however I
gather the initrd is hosed -- the kickstart %post works around this by running
up2date).  I was able to strace the 'tar cf - .' process and see that it was
failing on a "SIGPIPE (Broken Pipe) write(1,....) = ESIGPIPE"

fd 1 is stdout correct?  Using this it seems that the offending line of code in
/sbin/mkinitrd is at line #434:

434: (cd $MNTIMAGE; tar cf - .) | (cd $MNTPOINT; tar xf -)

Further examination of the tar process shows that it's $HOME is /tmp (on the
installed system (eg /mnt/sysimage/tmp)).  Considering that /tmp is a raid
volume this might be an issue.  When /tmp is removed from the kickstart file as
a raid volume, the hang does not occur.

So I'm not sure who to assign this one to (kernel, mkinitrd, tar, raid)? 
According to nate@redhat.com this bug is easily reproducable on a Dell650 with 2
ide drives and somewhat random on another system with scsi storage.

Thoughts, comments?
Comment 3 Jeremy Katz 2003-10-14 16:26:22 EDT
This isn't mkinitrd's fault... it's just running tar to a pipe and then the tar
process is getting endless looping SIGPIPE's.  
Comment 4 Jeremy Katz 2003-10-22 12:46:51 EDT
*** Bug 107728 has been marked as a duplicate of this bug. ***
Comment 5 Narsi Subramanian 2003-10-24 10:14:35 EDT
An update from SUN

- Original Problem:
  We are attampting to install Redhat AS 2.1 QU2 on a Sun Fire V60x machine 
  (dual Xeon 2.8GHz, 1Gb RAM, 2 e1000 network interfaces) over the network
  (Kickstart).Everything works up to the point where the kernel RPM is unpacked,
  about 1m 10s after starting to install RPMS. The RPM is unpacked (the 
  progress bar goes to 100%), then the installation hangs. We believe this is 
  related to the dual network interfaces. We have encountered this on other 
  systems, and the workaround has always been to remove the extra (up to 4 extra 
  interfaces on some machines) network interfaces, and this has always caused
  installation to run smoothly. However, the workaround is not possible here, 
  since the built-in interfaces cannot be disabled individually. Anaconda is 
  anaconda-7.2-68_ELAS.

- New Update:
  By dropping this new kernel RPM (2.4.9-e.27.6) into the network RPM 
  repository and regenerating the hdlist file, I am now able to install 
  without any problems. As an added bonus, we can now skip the postinstall 
  kernel update.
  Still, the original problem is still there, and as far as I can see,
  plain-vanilla non-hacked RHAS 2.1 QU 2 can't be installed by Kickstart
  on a V60x.

Comment 6 Jeremy Katz 2003-11-04 16:09:48 EST
*** Bug 73414 has been marked as a duplicate of this bug. ***
Comment 7 Jeremy Katz 2003-11-04 16:10:45 EST
*** Bug 109047 has been marked as a duplicate of this bug. ***
Comment 8 Larry Troan 2003-11-06 09:52:58 EST
Bug 109047 was opened against Dell Issue Tracker 28914 which is a sev
1 and was DUP'd to this bug. 

Dell requests this be fixed for U3 though it came in after the MUSTFIX
deadline so not marked with Blocker bug. 
Updating this Bug severity to HIGH to reflect sev 1.
Comment 9 Gary Lerhaupt 2003-12-10 16:40:46 EST
I can reproduce this issue 100% of the time on RHEL2.1 U3 beta on a 
Dell PowerEdge 6650, QLogic 2342 card inserted and the ks.cfg file 
(linux ks=floppy install) that I will attach.
Comment 10 Gary Lerhaupt 2003-12-10 16:41:52 EST
Created attachment 96456 [details]
dell ks.cfg
Comment 12 Jay Turner 2003-12-12 10:55:44 EST
Gary, we don't have any 2342 cards in either Centennial or Westford
(least none that we find) but I've tried an install with a 2312 card
and am not seeing any problems.  Can you provide some more details
about how the card is configured (point-to-point, loop?)  Also, I'm
assuming that you're putting all of the partitions on the connected array?
Comment 15 Larry Troan 2003-12-15 10:06:41 EST
FROM BUGZILLA 109047 (marked as DUP of this Bug). 
Additional Comment #5 From Gary Lerhaupt on 2003-11-04 16:39 -------

A couple housekeeping questions. 

First, where it says "Resolved" above, I assume this means that this 
specific bugzilla is resolved as a duplicate, not that the 
underlying issue is resolved (since the underlying issue is still in 
the new state).

Secondly, is 104243 marked as a MUSTFIX for Q3?


------- Additional Comment #6 From Gary Lerhaupt on 2003-11-06 12:37
-------

Updating the severity.  Any feedback on my questions above?


------- Additional Comment #7 From Larry Troan on 2003-12-15 10:03 -------

"Resolved" does mean DUP of 104243 -- not underlying problem resolved.

Bug 104243 is not a MUSTFIX for U3. Neither was 109047 as it came in
beyond the Engineering cutoff... Sue Denham is discussing with Dale at
Dell about the criticality of getting this resolved and into Update 1.

This problem is being tracker by Bug 104243.
Comment 16 Gary Lerhaupt 2003-12-15 10:18:05 EST
There are no cables attached to the 2342 during installation.  All 
partitions are on the local disk.  Apparently this has been 
replicated on U3 beta 2.
Comment 17 Gary Lerhaupt 2003-12-15 14:39:23 EST
What change was made in U3 beta 2 that was thought to address this?
Comment 18 Gary Lerhaupt 2003-12-15 14:59:17 EST
I just found this in /mnt/sysimage/tmp/install.log

Installing kernel
tar: error while loading shared libraries: libredhat-kernel.so.1: 
cannot open shared object file: No such file or directory

---
On an installed system, libredhat-kernel.so.1 is a symlink to 
libredhat-kernel.so.1.0.1.  Neither of these appear to exist 
in /mnt/sysimage/lib.  

Thoughts?
Comment 19 Tom Coughlan 2003-12-15 15:11:18 EST
Gary,

Can you take a look at /var/log/messages after Anaconda gets up and
running?  Is there any sign of a problem when the QLA2342 driver is
loaded?  Differences when the QLA2342 is not present?

Tom
Comment 20 Gary Lerhaupt 2003-12-15 15:23:29 EST
/var/state/xkb/syslog seems normal.  It reports two lips and two 
disconnected cables as I would expect and continues on its way.  What 
are your thoughts on the tar error above?
Comment 21 Gary Lerhaupt 2003-12-15 15:36:27 EST
Please see this bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=67213

I wonder if the lack of libredhat-kernel, which Adrian suggests might 
break up2date, is possibly breaking mkinitrd/tar during install.
Comment 22 Tim Burke 2003-12-15 16:30:12 EST
Adding in a piece of trivia gleaned from a conversation w/ Dale:

They had no storage whatsoever connected to the qlogic card.  They
weren't installing to it. Rather they were installing to the baseboard
disks which weren't on the QLA2342.
Comment 25 Jay Turner 2003-12-16 13:38:26 EST
Received the card this morning, threw it in a 6650, and booted up the
install (using the kickstart provided by Dell, with minor tweaks to
work in our environment) and I'm not seeing any hangs.  We're going to
keep poking around here, as well as try to find some of the machines
which IS was reporting problems on and see if we can replicate.
Comment 26 Gary Lerhaupt 2003-12-16 15:20:08 EST
Created attachment 96569 [details]
syslog from stalled install with sysrq info
Comment 27 Tom Coughlan 2003-12-16 16:42:35 EST
Gary,

Can you give us the same syslog information from a system that does
not have the QLA2342 card installed?  Then we can see what is
different in a systme that works.

Thanks.

Tom
Comment 28 Gary Lerhaupt 2003-12-16 17:14:11 EST
Hmm.  At what point during the install do you want me to run the 
sysrq on this system without the 2342 installed?  This seems 
arbitrary to me.

Comment 29 Gary Lerhaupt 2003-12-16 17:46:00 EST
In answer to Sue's question to Dale:

This issue first showed up in RHEL2.1 Update 2.
Comment 30 Jason Baron 2003-12-16 19:34:09 EST
i think Tom was looking for syslog from the install, not the sysreq data
Comment 32 Jason Baron 2003-12-17 10:44:36 EST
regarding the tar eror from comment 18, it is something that certainly
needs to be fixed, but i'm not yet convinced it is the root of the
installer hang. i tried installing kernels on running system with the
libredhat symlink removed and i get the same errors from tar, but the
installation finishes...The initrd is hosed and needs to be re-made
but i don't see any sort of hang.
Comment 33 Larry Woodman 2003-12-17 11:12:55 EST
Can you get us a "top" and "vmstat 1" output when the system is 
hung up in tar so we can verify the system is looping in the kernel. 
Also, please get us several "AltSysrq P" outputs so we can see where
in the kernel it is looping.

Thanks, Larry
Comment 34 Gary Lerhaupt 2003-12-17 12:03:28 EST
Created attachment 96586 [details]
Syslog from system with no QLogic 2342
Comment 35 Gary Lerhaupt 2003-12-17 13:35:02 EST
Below are a couple of sysrq-p's.  There is no top or vmstat available 
during install.

<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f285>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb77c EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 40029ff8 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ff94)
<4>[<c010a954>]  (0xdcc3ffa8)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd2>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd0>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd0>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c011a24b>]  (0xdcc3ff94)
<4>[<c0108458>]  (0xdcc3ffac)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ffa0)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ffa0)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e8>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000000 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c011a24b>]  (0xdcc3ff94)
<4>[<c0108458>]  (0xdcc3ffac)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
Comment 36 Jeremy Katz 2003-12-17 17:01:08 EST
I've put an updates.img up at
http://people.redhat.com/~katzj/tarhang.img.  It contains a workaround
that I think should solve the hang you're seeing.  Instructions for use:
1) Download and copy to a floppy disk
2) Boot with 'linux updates'
3) Provide the floppy when prompted
4) See what happens :)

Feedback would be much appreciated.
Comment 37 Gary Lerhaupt 2003-12-18 11:35:04 EST
Using the above updates disk along with my kickstart floppy, the 
issue seems to have been cleared up.  This appears to be a suitable 
workaround for now but I need to understand the impacts of setting 
LD_ASSUME_KERNEL=2.2.5 and do more testing to ensure this is proper.
Comment 38 Larry Troan 2004-01-06 14:11:45 EST
FROM ISSUE TRACKER
Event posted 12-29-2003 12:28pm by glerhaupt with duration of 0.00
I cannot reproduce the issue in RHEL 2.1 U3 RC 1.  It appears fixed.

Can you comment on what the exact fix was?  By the way, it appears in
addition to needing a QLA2342 to reproduce this issue, you also need a
PERC3 in conjunction with it.
Comment 39 Jeremy Katz 2004-01-06 14:18:46 EST
Basically what was happening was that the kernel was being installed
before libredhat-kernel.  The librt in the i686 glibc depends on the
existence of the libredhat-kernel stub for some AIO functions.  When
we shipped RHEL2.1 originally, the version of tar included did not
depend on librt.  Later, an errata version of tar began linking
against librt to provide sub-second resolution on timestamps.  

Setting LD_ASSUME_KERNEL makes it so that the i386 glibc is used for
any scriptlet processing and thus not the librt that depends on
libredhat-kernel's functionality.  This is only a workaround for U3. 
In the future, we're going to go back to a fixed version of tar which
doesn't have this requirement on librt. 
Comment 49 Jason Baron 2004-02-12 15:55:38 EST
This appears fixed to me at this point. I'm closing 

Note You need to log in before you can comment on or make changes to this bug.