Bug 112953

Summary:	LTC5443-Network install of ES 2.1 QU2 fails on xSeries BladeCenter
Product:	Red Hat Enterprise Linux 2.1	Reporter:	IBM Bug Proxy <bugproxy>
Component:	anaconda	Assignee:	Jeremy Katz <katzj>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike McLean <mikem>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.1	CC:	tao
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-06-09 20:36:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2004-01-06 16:46:53 UTC

The following has be reported by IBM LTC:  
Network install of RH ES 2.1 QU2 fails on xSeries BladeCenter
Hardware Environment:

IBM BladeCenter with DLINK ESM / IBM MM / any number of 8678 blades


Software Environment:

Red Hat Enterprise ES 2.1 QU 2 update 


Steps to Reproduce:

From the testing that we have done we have ruled out that this is a
hardware issue.
The testing in the lab is as follows:
1.  Created our own RedHat kickstart using RedHat AS 2.1 QU2 and ran
kickstart
over the weekend on 5 blades.  It ran successfully.
2.  Using the customer's CD's (and we verified the media and successfully
compared them to ES 2.1 QU2 ISOs downloaded directly from RedHat)  we
were able
to recreate the hang on all 7 blades we were using for test.  It failed on
either Kernel-smp.2.4xxxx or kernel.2.4.xxxx
on all of them.  The hardware is not hung, the software is stalled. 
If you
change to the shell you can kill a tar process and the install will
complete.

It will install sometimes, and fail others, so that would explain why the
customer does not see this will all their installs.

We have been able to install over the network using RedHat AS 2.1
QU2(over the
weekend) and RedHat 9.0(for weeks and thousands of installs on 28 blades).
We do not see this as a hardware problem.  Something is wrong with the
ES 2.1
QU2 distribution.


Additional Information:

Customer's normal operating proceedure is to power off and reboot and
blade that 
hangs or has some other problem.  If the blade does not boot cleanly, the
immediately re-image the blade.   Therefore, installation hangs are
disruptive
to their environment.   

Customer is convinced this is an IBM HW issue, but we cannot find any
evidence
of HW problem.    We would like the LTC to provide some  method of
determining
why the tar process is hung and some workaround or fix.
This case has a fair amount of political heat behind it because of the
customer's contention that IBM HW is the root cause of this failure.

I can provide any other information or remotes access to a BladeCenter
if required.I just checked the NOS info website....The HS20 blades
have been certified
for RH Enterprise Linux 2.1.  I've sent email to the NOS team asking
if they
have seen this bug (on RH ES 2.1).  Awaiting their reply....Got the
following reply from Wendy Hung (NOS Team in Raleigh):
-----------------------------------------------------------------------------
Hi Khoa,
The NOS team does not usually perform kickstart installs as part of
testing.  
I did take a look at the bug and have some recommendations:
- Do not use the customer's CDs.  Burn new ones to test.
- Do not use the same kickstart file used for AS.  Be sure that the
syntax 
inthe kickstart file is correct.

Also, I was not able to find any reports of problems installing ES on Red 
Hat's Bugzilla site.

Thanks,
Wendy Hung
-------------------------------------------------------------------------------Ted
- please try the recommendations from Wendy Hung above.  If the blades
still hang, please re-open this bug report and we will submit this to
Red Hat.
Thanks.Khoa,  it appeared in our testing to be bad media.   We sent
the customer our 
CDs to use on their systems and the failure still occured.
We would like to engage your team and see if your engineers can figure
this out.

ThanksOK, the customer has IGS support line contract, so I'd like to
engage
the IGS L3 team on this.  I'd also recommend that this bug be submitted
to Red Hat for investigation - and I believe that Red Hat already has
some bladecenter hardware which we shipped to them as part of their
OS support.  Thanks.I already sent note to engage the IGS L3
team....No response from IGS L3 team so far.....I just pinged Chris
Ansari (manager)
again.  In all cases that I have seen so far, the system is hung while
trying to install
the base kernel RPM.  It is dying in mkinitrd in the following tar
sequence:

(cd $MNTIMAGE; tar cf - .) | (cd $MNTPOINT; tar xf -)

The really weird part is that the first tar is succeeding, but the
second tar is
dying with the following error:

"tar: error while loading shared libraries: libredhat-kernel.so.1:
cannot open
shared object file: No such file or directory"

I have been able to run this tar sequence manually on a hung system
with no
problem, so this appears to be a very timing specific library issue. 
There is
no hardware specific issue that I can see that triggers this (the
second tar
should not even be looking for this library).  As this appears to be
specific to
the RedHat installer (the libaio RPM that provides the library that tar
intermittently thinks it needs is not installed until later), and this
customer
does not have a SupportLine contract as earlier reported, this really
needs to
be driven back to RedHat.  I'm going to recommend that the PE group
experiment
with the new U3 release bits to see if this timing issue has been
resolved in
the latest install image.Glen - please submit this to Red Hat.  Thanks.

Comment 1 Jeremy Katz 2004-01-06 16:50:56 UTC

This is worked around in update 3.

Comment 2 IBM Bug Proxy 2004-01-13 02:57:37 UTC

----- Additional Comments From ruddk.com  2004-01-12 18:48 -------
This has not been fixed in update 3.  The PE group was still able to
replicate an install hang using the latest U3 bits.  It has the exact
same fingerprint (although it is dying in the 2.4.9-e.34 kernel RPM
this time):

...
Installing kernel.
tar: error while loading shared libraries: libredhat-kernel.so.1: cannot
open shared object file: No such file or directory

This is really looking like some sort of intermittent library path bug.
What I am seeing is that the i686 libraries are being picked up very
intermittently.  When things work, this is what an ldd on tar returns:

        libpthread.so.0 => /lib/libpthread.so.0 (0x40018000)
        librt.so.1 => /lib/librt.so.1 (0x4004b000)
        libc.so.6 => /lib/libc.so.6 (0x4005d000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

When it doesn't work, ldd returns:

        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40018000)
        librt.so.1 => /lib/i686/librt.so.1 (0x40049000)
        libc.so.6 => /lib/i686/libc.so.6 (0x4005c000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
        libredhat-kernel.so.1 => not found

It is /lib/i686/librt.so.1 that has the reference to
libredhat-kernel.so.1.  As the libaio RPM has not been installed yet
any tar instance that picks up /lib/i686/librt.so.1 will fail.

As a test, I chroot'ed to /mnt/sysimage, and ran "ldd /bin/tar" in
a loop for a while.  Out of 2059 loops, /lib/i686/librt.so.1 was
referenced 42 times.

Wild hunch:  It is possible that the heavy USB activity found in a blade
environment is a catalyst for this problem.  It appeared that I was able
to generate more of these incorrect library references when I was
changing the remote console between blades.

Comment 3 IBM Bug Proxy 2004-01-13 22:17:20 UTC

----- Additional Comments From lepore.com  2004-01-13 17:05 -------
Glen,

In light of the new information gathered, could you re-open the Red Hat bug 
(it's currently closed)?

Thanks.

Mike

Comment 4 Bob Johnson 2004-01-13 22:46:46 UTC

IBM - this is for Issue tracker - shipping product - I don't go
hunting for bugs that's why you have a TAM and issue tracker.

Comment 5 IBM Bug Proxy 2004-01-14 15:42:14 UTC

----- Additional Comments From lepore.com  2004-01-14 10:30 -------
Glen will be moving this to Issue Tracker. We will set-up a meeting to discuss 
this with Red Hat once this is open in issue tracker.

Comment 6 IBM Bug Proxy 2004-01-14 17:57:33 UTC

----- Additional Comments From gjlynx.com(prefers email via gjohnson.com)  2004-01-14 12:44 -------

Comment 7 IBM Bug Proxy 2004-01-14 21:38:16 UTC

----- Additional Comments From khoa.com  2004-01-14 16:26 -------
Latest update from Kevin Rudd:

This does not appear to be something that is limited to the blade
servers.  I have been successful in replicating the inconsistent
library behavior on an x440 system that I have in the lab.  This was
done on both the U2 and U3 releases of rhes21.  I'm about to load up
rhas21 to confirm that this is really not an issue in that environment.

My earlier thoughts about USB being a factor can be ignored.

Thanks,
       -Kevin

Comment 8 IBM Bug Proxy 2004-01-14 22:37:06 UTC

----- Additional Comments From ruddk.com  2004-01-14 17:26 -------
Ignore the thoughts about USB being a factor.  The inconsistent nature
of the problem can be misleading at times.

I have been able to replicate the library behavior on a non-blade system
(an x440 system).  In addition, I have replicated this with both AS2.1
and ES2.1

For my test, I modified the kernel RPM so that a long
sleep was added to it's %post processing.  This pauses the install
process at the same point that it has been found in in the previous
hangs.  Once at that point, I am able to switch over to the shell
virtual console (F2), and run ldd loop tests.  My test is just a simple
loop:

chroot /mnt/sysimage /bin/bash

i=0
while true
do
if ldd /bin/tar | grep i686
then
  echo "i686 path picked up after $i loops"
  break
fi
i=$((i+1))
done

I have seen /lib/i686/librt.so.1 referenced after as few as 14 loops
and as many as 8067 loops.

Comment 9 IBM Bug Proxy 2004-01-14 22:40:05 UTC

----- Additional Comments From ruddk.com  2004-01-14 17:26 -------
Ignore the thoughts about USB being a factor.  The inconsistent nature
of the problem can be misleading at times.

I have been able to replicate the library behavior on a non-blade system
(an x440 system).  In addition, I have replicated this with both AS2.1
and ES2.1

For my test, I modified the kernel RPM so that a long
sleep was added to it's %post processing.  This pauses the install
process at the same point that it has been found in in the previous
hangs.  Once at that point, I am able to switch over to the shell
virtual console (F2), and run ldd loop tests.  My test is just a simple
loop:

chroot /mnt/sysimage /bin/bash

i=0
while true
do
if ldd /bin/tar | grep i686
then
  echo "i686 path picked up after $i loops"
  break
fi
i=$((i+1))
done

I have seen /lib/i686/librt.so.1 referenced after as few as 14 loops
and as many as 8067 loops.

Comment 10 IBM Bug Proxy 2004-01-15 16:17:17 UTC

----- Additional Comments From lepore.com  2004-01-15 11:06 -------
The comment above indicates this was fixed in U3, so it seems there may already 
be some understanding of the issue and/or it's root cause. This issue is top 
priorty for IBM right now because several hundred machines will not be shipped 
until this is resolved. Hopefully Kevin's information to reproduce the failure 
in U3, combined with any knowledge Red Hat already has on this, could be enough 
for us to find a fix. Red Hat's help on this would be very valuable and much 
appreciated.

Thanks.

Comment 12 IBM Bug Proxy 2004-01-15 20:58:16 UTC

https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=31562&gid=43
added issue tracker and reopening here.

Comment 13 Jeremy Katz 2004-06-09 20:36:08 UTC

Even better fix went into U4 (fixed tar package)