Red Hat Bugzilla – Bug 112953
LTC5443-Network install of ES 2.1 QU2 fails on xSeries BladeCenter
Last modified: 2007-11-30 17:06:53 EST
The following has be reported by IBM LTC:
Network install of RH ES 2.1 QU2 fails on xSeries BladeCenter
IBM BladeCenter with DLINK ESM / IBM MM / any number of 8678 blades
Red Hat Enterprise ES 2.1 QU 2 update
Steps to Reproduce:
From the testing that we have done we have ruled out that this is a
The testing in the lab is as follows:
1. Created our own RedHat kickstart using RedHat AS 2.1 QU2 and ran
over the weekend on 5 blades. It ran successfully.
2. Using the customer's CD's (and we verified the media and successfully
compared them to ES 2.1 QU2 ISOs downloaded directly from RedHat) we
to recreate the hang on all 7 blades we were using for test. It failed on
either Kernel-smp.2.4xxxx or kernel.2.4.xxxx
on all of them. The hardware is not hung, the software is stalled.
change to the shell you can kill a tar process and the install will
It will install sometimes, and fail others, so that would explain why the
customer does not see this will all their installs.
We have been able to install over the network using RedHat AS 2.1
weekend) and RedHat 9.0(for weeks and thousands of installs on 28 blades).
We do not see this as a hardware problem. Something is wrong with the
Customer's normal operating proceedure is to power off and reboot and
hangs or has some other problem. If the blade does not boot cleanly, the
immediately re-image the blade. Therefore, installation hangs are
to their environment.
Customer is convinced this is an IBM HW issue, but we cannot find any
of HW problem. We would like the LTC to provide some method of
why the tar process is hung and some workaround or fix.
This case has a fair amount of political heat behind it because of the
customer's contention that IBM HW is the root cause of this failure.
I can provide any other information or remotes access to a BladeCenter
if required.I just checked the NOS info website....The HS20 blades
have been certified
for RH Enterprise Linux 2.1. I've sent email to the NOS team asking
have seen this bug (on RH ES 2.1). Awaiting their reply....Got the
following reply from Wendy Hung (NOS Team in Raleigh):
The NOS team does not usually perform kickstart installs as part of
I did take a look at the bug and have some recommendations:
- Do not use the customer's CDs. Burn new ones to test.
- Do not use the same kickstart file used for AS. Be sure that the
inthe kickstart file is correct.
Also, I was not able to find any reports of problems installing ES on Red
Hat's Bugzilla site.
- please try the recommendations from Wendy Hung above. If the blades
still hang, please re-open this bug report and we will submit this to
Thanks.Khoa, it appeared in our testing to be bad media. We sent
the customer our
CDs to use on their systems and the failure still occured.
We would like to engage your team and see if your engineers can figure
ThanksOK, the customer has IGS support line contract, so I'd like to
the IGS L3 team on this. I'd also recommend that this bug be submitted
to Red Hat for investigation - and I believe that Red Hat already has
some bladecenter hardware which we shipped to them as part of their
OS support. Thanks.I already sent note to engage the IGS L3
team....No response from IGS L3 team so far.....I just pinged Chris
again. In all cases that I have seen so far, the system is hung while
trying to install
the base kernel RPM. It is dying in mkinitrd in the following tar
(cd $MNTIMAGE; tar cf - .) | (cd $MNTPOINT; tar xf -)
The really weird part is that the first tar is succeeding, but the
second tar is
dying with the following error:
"tar: error while loading shared libraries: libredhat-kernel.so.1:
shared object file: No such file or directory"
I have been able to run this tar sequence manually on a hung system
problem, so this appears to be a very timing specific library issue.
no hardware specific issue that I can see that triggers this (the
should not even be looking for this library). As this appears to be
the RedHat installer (the libaio RPM that provides the library that tar
intermittently thinks it needs is not installed until later), and this
does not have a SupportLine contract as earlier reported, this really
be driven back to RedHat. I'm going to recommend that the PE group
with the new U3 release bits to see if this timing issue has been
the latest install image.Glen - please submit this to Red Hat. Thanks.
This is worked around in update 3.
----- Additional Comments From firstname.lastname@example.org 2004-01-12 18:48 -------
This has not been fixed in update 3. The PE group was still able to
replicate an install hang using the latest U3 bits. It has the exact
same fingerprint (although it is dying in the 2.4.9-e.34 kernel RPM
tar: error while loading shared libraries: libredhat-kernel.so.1: cannot
open shared object file: No such file or directory
This is really looking like some sort of intermittent library path bug.
What I am seeing is that the i686 libraries are being picked up very
intermittently. When things work, this is what an ldd on tar returns:
libpthread.so.0 => /lib/libpthread.so.0 (0x40018000)
librt.so.1 => /lib/librt.so.1 (0x4004b000)
libc.so.6 => /lib/libc.so.6 (0x4005d000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
When it doesn't work, ldd returns:
libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40018000)
librt.so.1 => /lib/i686/librt.so.1 (0x40049000)
libc.so.6 => /lib/i686/libc.so.6 (0x4005c000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
libredhat-kernel.so.1 => not found
It is /lib/i686/librt.so.1 that has the reference to
libredhat-kernel.so.1. As the libaio RPM has not been installed yet
any tar instance that picks up /lib/i686/librt.so.1 will fail.
As a test, I chroot'ed to /mnt/sysimage, and ran "ldd /bin/tar" in
a loop for a while. Out of 2059 loops, /lib/i686/librt.so.1 was
referenced 42 times.
Wild hunch: It is possible that the heavy USB activity found in a blade
environment is a catalyst for this problem. It appeared that I was able
to generate more of these incorrect library references when I was
changing the remote console between blades.
----- Additional Comments From email@example.com 2004-01-13 17:05 -------
In light of the new information gathered, could you re-open the Red Hat bug
(it's currently closed)?
IBM - this is for Issue tracker - shipping product - I don't go
hunting for bugs that's why you have a TAM and issue tracker.
----- Additional Comments From firstname.lastname@example.org 2004-01-14 10:30 -------
Glen will be moving this to Issue Tracker. We will set-up a meeting to discuss
this with Red Hat once this is open in issue tracker.
----- Additional Comments From email@example.com(prefers email via firstname.lastname@example.org) 2004-01-14 12:44 -------
----- Additional Comments From email@example.com 2004-01-14 16:26 -------
Latest update from Kevin Rudd:
This does not appear to be something that is limited to the blade
servers. I have been successful in replicating the inconsistent
library behavior on an x440 system that I have in the lab. This was
done on both the U2 and U3 releases of rhes21. I'm about to load up
rhas21 to confirm that this is really not an issue in that environment.
My earlier thoughts about USB being a factor can be ignored.
----- Additional Comments From firstname.lastname@example.org 2004-01-14 17:26 -------
Ignore the thoughts about USB being a factor. The inconsistent nature
of the problem can be misleading at times.
I have been able to replicate the library behavior on a non-blade system
(an x440 system). In addition, I have replicated this with both AS2.1
For my test, I modified the kernel RPM so that a long
sleep was added to it's %post processing. This pauses the install
process at the same point that it has been found in in the previous
hangs. Once at that point, I am able to switch over to the shell
virtual console (F2), and run ldd loop tests. My test is just a simple
chroot /mnt/sysimage /bin/bash
if ldd /bin/tar | grep i686
echo "i686 path picked up after $i loops"
I have seen /lib/i686/librt.so.1 referenced after as few as 14 loops
and as many as 8067 loops.
----- Additional Comments From email@example.com 2004-01-15 11:06 -------
The comment above indicates this was fixed in U3, so it seems there may already
be some understanding of the issue and/or it's root cause. This issue is top
priorty for IBM right now because several hundred machines will not be shipped
until this is resolved. Hopefully Kevin's information to reproduce the failure
in U3, combined with any knowledge Red Hat already has on this, could be enough
for us to find a fix. Red Hat's help on this would be very valuable and much
added issue tracker and reopening here.
Even better fix went into U4 (fixed tar package)