Bug 104243

Summary: raid1 install hangs installing kernel
Product: Red Hat Enterprise Linux 2.1 Reporter: James Laska <jlaska>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: bruno.verkist, coughlan, dale_kaisner, damorep, gary_lerhaupt, jakub, jturner, katzj, knoel, ltroan, mgalgoci, nate, tao, yngve.svendsen
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: Dell requests fix for U3.
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-02-12 20:55:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 107565    
Attachments:
Description Flags
kickstart file that reproduces the hangup
none
dell ks.cfg
none
syslog from stalled install with sysrq info
none
Syslog from system with no QLogic 2342 none

Description James Laska 2003-09-11 18:07:17 UTC
Anaconda is hanging during the install of the kernel.  The install is a raid1 on
2 ide drives whose configs can be seen below (or in the ks.cfg attached):

raid /boot --level 1 --fstype ext3 --device 0 raid.00 raid.01
raid /var --level 1 --fstype ext3 --device 2 raid.20 raid.21
raid swap --level 1 --fstype ext3 --device 5 raid.30 raid.31
raid swap --level 1 --fstype ext3 --device 6 raid.32 raid.33
raid /tmp --level 1 --fstype ext3 --device 3 raid.10 raid.11
raid / --level 1 --fstype ext3 --device 1 raid.40 raid.41

Further examination shows that the hangs is occuring while running the --scripts
of kernel-2.4.9-e.24.i686.rpm.

postinstall scriptlet (using /bin/sh):
cd /boot
ln -sf vmlinuz-2.4.9-e.24 vmlinuz
ln -sf System.map-2.4.9-e.24 System.map
ln -sf module-info-2.4.9-e.24 module-info
[ -x /usr/sbin/module_upgrade ] && /usr/sbin/module_upgrade
[ -x /sbin/mkkerneldoth ] && /sbin/mkkerneldoth
if [ -x /sbin/new-kernel-pkg ] ; then
        /sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.9-e.24
fi

The process tables show that /sbin/new-kernel-pkg is calling /sbin/mkinitrd
which calls 'tar cf - .'.  The tar command is the command that is hanging.  If
you kill the tar process, the installation continues successfully (however I
gather the initrd is hosed -- the kickstart %post works around this by running
up2date).  I was able to strace the 'tar cf - .' process and see that it was
failing on a "SIGPIPE (Broken Pipe) write(1,....) = ESIGPIPE"

fd 1 is stdout correct?  Using this it seems that the offending line of code in
/sbin/mkinitrd is at line #434:

434: (cd $MNTIMAGE; tar cf - .) | (cd $MNTPOINT; tar xf -)

Further examination of the tar process shows that it's $HOME is /tmp (on the
installed system (eg /mnt/sysimage/tmp)).  Considering that /tmp is a raid
volume this might be an issue.  When /tmp is removed from the kickstart file as
a raid volume, the hang does not occur.

So I'm not sure who to assign this one to (kernel, mkinitrd, tar, raid)? 
According to nate this bug is easily reproducable on a Dell650 with 2
ide drives and somewhat random on another system with scsi storage.

Thoughts, comments?

Comment 3 Jeremy Katz 2003-10-14 20:26:22 UTC
This isn't mkinitrd's fault... it's just running tar to a pipe and then the tar
process is getting endless looping SIGPIPE's.  

Comment 4 Jeremy Katz 2003-10-22 16:46:51 UTC
*** Bug 107728 has been marked as a duplicate of this bug. ***

Comment 5 Narsi Subramanian 2003-10-24 14:14:35 UTC
An update from SUN

- Original Problem:
  We are attampting to install Redhat AS 2.1 QU2 on a Sun Fire V60x machine 
  (dual Xeon 2.8GHz, 1Gb RAM, 2 e1000 network interfaces) over the network
  (Kickstart).Everything works up to the point where the kernel RPM is unpacked,
  about 1m 10s after starting to install RPMS. The RPM is unpacked (the 
  progress bar goes to 100%), then the installation hangs. We believe this is 
  related to the dual network interfaces. We have encountered this on other 
  systems, and the workaround has always been to remove the extra (up to 4 extra 
  interfaces on some machines) network interfaces, and this has always caused
  installation to run smoothly. However, the workaround is not possible here, 
  since the built-in interfaces cannot be disabled individually. Anaconda is 
  anaconda-7.2-68_ELAS.

- New Update:
  By dropping this new kernel RPM (2.4.9-e.27.6) into the network RPM 
  repository and regenerating the hdlist file, I am now able to install 
  without any problems. As an added bonus, we can now skip the postinstall 
  kernel update.
  Still, the original problem is still there, and as far as I can see,
  plain-vanilla non-hacked RHAS 2.1 QU 2 can't be installed by Kickstart
  on a V60x.



Comment 6 Jeremy Katz 2003-11-04 21:09:48 UTC
*** Bug 73414 has been marked as a duplicate of this bug. ***

Comment 7 Jeremy Katz 2003-11-04 21:10:45 UTC
*** Bug 109047 has been marked as a duplicate of this bug. ***

Comment 8 Larry Troan 2003-11-06 14:52:58 UTC
Bug 109047 was opened against Dell Issue Tracker 28914 which is a sev
1 and was DUP'd to this bug. 

Dell requests this be fixed for U3 though it came in after the MUSTFIX
deadline so not marked with Blocker bug. 
Updating this Bug severity to HIGH to reflect sev 1.

Comment 9 Gary Lerhaupt 2003-12-10 21:40:46 UTC
I can reproduce this issue 100% of the time on RHEL2.1 U3 beta on a 
Dell PowerEdge 6650, QLogic 2342 card inserted and the ks.cfg file 
(linux ks=floppy install) that I will attach.

Comment 10 Gary Lerhaupt 2003-12-10 21:41:52 UTC
Created attachment 96456 [details]
dell ks.cfg

Comment 12 Jay Turner 2003-12-12 15:55:44 UTC
Gary, we don't have any 2342 cards in either Centennial or Westford
(least none that we find) but I've tried an install with a 2312 card
and am not seeing any problems.  Can you provide some more details
about how the card is configured (point-to-point, loop?)  Also, I'm
assuming that you're putting all of the partitions on the connected array?

Comment 15 Larry Troan 2003-12-15 15:06:41 UTC
FROM BUGZILLA 109047 (marked as DUP of this Bug). 
Additional Comment #5 From Gary Lerhaupt on 2003-11-04 16:39 -------

A couple housekeeping questions. 

First, where it says "Resolved" above, I assume this means that this 
specific bugzilla is resolved as a duplicate, not that the 
underlying issue is resolved (since the underlying issue is still in 
the new state).

Secondly, is 104243 marked as a MUSTFIX for Q3?


------- Additional Comment #6 From Gary Lerhaupt on 2003-11-06 12:37
-------

Updating the severity.  Any feedback on my questions above?


------- Additional Comment #7 From Larry Troan on 2003-12-15 10:03 -------

"Resolved" does mean DUP of 104243 -- not underlying problem resolved.

Bug 104243 is not a MUSTFIX for U3. Neither was 109047 as it came in
beyond the Engineering cutoff... Sue Denham is discussing with Dale at
Dell about the criticality of getting this resolved and into Update 1.

This problem is being tracker by Bug 104243.

Comment 16 Gary Lerhaupt 2003-12-15 15:18:05 UTC
There are no cables attached to the 2342 during installation.  All 
partitions are on the local disk.  Apparently this has been 
replicated on U3 beta 2.

Comment 17 Gary Lerhaupt 2003-12-15 19:39:23 UTC
What change was made in U3 beta 2 that was thought to address this?

Comment 18 Gary Lerhaupt 2003-12-15 19:59:17 UTC
I just found this in /mnt/sysimage/tmp/install.log

Installing kernel
tar: error while loading shared libraries: libredhat-kernel.so.1: 
cannot open shared object file: No such file or directory

---
On an installed system, libredhat-kernel.so.1 is a symlink to 
libredhat-kernel.so.1.0.1.  Neither of these appear to exist 
in /mnt/sysimage/lib.  

Thoughts?

Comment 19 Tom Coughlan 2003-12-15 20:11:18 UTC
Gary,

Can you take a look at /var/log/messages after Anaconda gets up and
running?  Is there any sign of a problem when the QLA2342 driver is
loaded?  Differences when the QLA2342 is not present?

Tom

Comment 20 Gary Lerhaupt 2003-12-15 20:23:29 UTC
/var/state/xkb/syslog seems normal.  It reports two lips and two 
disconnected cables as I would expect and continues on its way.  What 
are your thoughts on the tar error above?

Comment 21 Gary Lerhaupt 2003-12-15 20:36:27 UTC
Please see this bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=67213

I wonder if the lack of libredhat-kernel, which Adrian suggests might 
break up2date, is possibly breaking mkinitrd/tar during install.

Comment 22 Tim Burke 2003-12-15 21:30:12 UTC
Adding in a piece of trivia gleaned from a conversation w/ Dale:

They had no storage whatsoever connected to the qlogic card.  They
weren't installing to it. Rather they were installing to the baseboard
disks which weren't on the QLA2342.


Comment 25 Jay Turner 2003-12-16 18:38:26 UTC
Received the card this morning, threw it in a 6650, and booted up the
install (using the kickstart provided by Dell, with minor tweaks to
work in our environment) and I'm not seeing any hangs.  We're going to
keep poking around here, as well as try to find some of the machines
which IS was reporting problems on and see if we can replicate.

Comment 26 Gary Lerhaupt 2003-12-16 20:20:08 UTC
Created attachment 96569 [details]
syslog from stalled install with sysrq info

Comment 27 Tom Coughlan 2003-12-16 21:42:35 UTC
Gary,

Can you give us the same syslog information from a system that does
not have the QLA2342 card installed?  Then we can see what is
different in a systme that works.

Thanks.

Tom


Comment 28 Gary Lerhaupt 2003-12-16 22:14:11 UTC
Hmm.  At what point during the install do you want me to run the 
sysrq on this system without the 2342 installed?  This seems 
arbitrary to me.



Comment 29 Gary Lerhaupt 2003-12-16 22:46:00 UTC
In answer to Sue's question to Dale:

This issue first showed up in RHEL2.1 Update 2.

Comment 30 Jason Baron 2003-12-17 00:34:09 UTC
i think Tom was looking for syslog from the install, not the sysreq data

Comment 32 Jason Baron 2003-12-17 15:44:36 UTC
regarding the tar eror from comment 18, it is something that certainly
needs to be fixed, but i'm not yet convinced it is the root of the
installer hang. i tried installing kernels on running system with the
libredhat symlink removed and i get the same errors from tar, but the
installation finishes...The initrd is hosed and needs to be re-made
but i don't see any sort of hang.

Comment 33 Larry Woodman 2003-12-17 16:12:55 UTC
Can you get us a "top" and "vmstat 1" output when the system is 
hung up in tar so we can verify the system is looping in the kernel. 
Also, please get us several "AltSysrq P" outputs so we can see where
in the kernel it is looping.

Thanks, Larry

Comment 34 Gary Lerhaupt 2003-12-17 17:03:28 UTC
Created attachment 96586 [details]
Syslog from system with no QLogic 2342

Comment 35 Gary Lerhaupt 2003-12-17 18:35:02 UTC
Below are a couple of sysrq-p's.  There is no top or vmstat available 
during install.

<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f285>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb77c EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 40029ff8 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ff94)
<4>[<c010a954>]  (0xdcc3ffa8)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd2>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd0>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c0106fd0>] CPU: 0EIP is at  
<4> EFLAGS: 00000282    Not tainted
<4>EAX: 00000004 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c011a24b>]  (0xdcc3ff94)
<4>[<c0108458>]  (0xdcc3ffac)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ffa0)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0023:[<4013f284>] CPU: 0EIP is at  
<4> ESP: 002b:bffeb778 EFLAGS: 00000282    Not tainted
<4>EAX: ffffffe0 EBX: 00000001 ECX: 080720e0 EDX: 00001800
<4>ESI: 00001800 EDI: 080720e0 EBP: bffeb7a8 DS: 002b ES: 002b
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: 
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e6>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000001 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c0108458>]  (0xdcc3ffa0)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>
<6>SysRq : Show Regs
<4>
<4>Process: 767, {                 tar}
<4>Kernel 2.4.9-e.33BOOT
<4>EIP: 0010:[<c011e1e8>] CPU: 0EIP is at  
<4> EFLAGS: 00000286    Not tainted
<4>EAX: 00000000 EBX: dcc3e000 ECX: 0000000d EDX: 00000000
<4>ESI: 0000000d EDI: 00000286 EBP: 00000000 DS: 0018 ES: 0018
<4>CR0: 8005003b CR2: 080cf00c CR3: 1cbe1000 CR4: 000006d0
<4>Call Trace: [<c013df1d>]  (0xdcc3ff50)
<4>[<c0135b99>]  (0xdcc3ff7c)
<4>[<c011a24b>]  (0xdcc3ff94)
<4>[<c0108458>]  (0xdcc3ffac)
<4>[<c0107003>]  (0xdcc3ffc0)
<4>


Comment 36 Jeremy Katz 2003-12-17 22:01:08 UTC
I've put an updates.img up at
http://people.redhat.com/~katzj/tarhang.img.  It contains a workaround
that I think should solve the hang you're seeing.  Instructions for use:
1) Download and copy to a floppy disk
2) Boot with 'linux updates'
3) Provide the floppy when prompted
4) See what happens :)

Feedback would be much appreciated.

Comment 37 Gary Lerhaupt 2003-12-18 16:35:04 UTC
Using the above updates disk along with my kickstart floppy, the 
issue seems to have been cleared up.  This appears to be a suitable 
workaround for now but I need to understand the impacts of setting 
LD_ASSUME_KERNEL=2.2.5 and do more testing to ensure this is proper.

Comment 38 Larry Troan 2004-01-06 19:11:45 UTC
FROM ISSUE TRACKER
Event posted 12-29-2003 12:28pm by glerhaupt with duration of 0.00
I cannot reproduce the issue in RHEL 2.1 U3 RC 1.  It appears fixed.

Can you comment on what the exact fix was?  By the way, it appears in
addition to needing a QLA2342 to reproduce this issue, you also need a
PERC3 in conjunction with it.


Comment 39 Jeremy Katz 2004-01-06 19:18:46 UTC
Basically what was happening was that the kernel was being installed
before libredhat-kernel.  The librt in the i686 glibc depends on the
existence of the libredhat-kernel stub for some AIO functions.  When
we shipped RHEL2.1 originally, the version of tar included did not
depend on librt.  Later, an errata version of tar began linking
against librt to provide sub-second resolution on timestamps.  

Setting LD_ASSUME_KERNEL makes it so that the i386 glibc is used for
any scriptlet processing and thus not the librt that depends on
libredhat-kernel's functionality.  This is only a workaround for U3. 
In the future, we're going to go back to a fixed version of tar which
doesn't have this requirement on librt. 

Comment 49 Jason Baron 2004-02-12 20:55:38 UTC
This appears fixed to me at this point. I'm closing