Bug 463275 - Failure of Compute Node to Install on RHEL 5.2 HPC combination
Summary: Failure of Compute Node to Install on RHEL 5.2 HPC combination
Keywords:
Status: NEW
Alias: None
Product: Red Hat HPC Solution
Classification: Red Hat
Component: initrd-templates
Version: 5.2
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: OCS Support
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-09-22 19:48 UTC by Tom Lehmann
Modified: 2023-05-20 07:33 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Tom Lehmann 2008-09-22 19:48:09 UTC
Description of problem: Head node fails to load the compute node during initial installation


Version-Release number of selected component (if applicable): Beta HPC package RHEL 5.2


How reproducible:  Load RHEL and HPC package on an Intel SR1250ML (Melstone) Server.  The head node loads without any problem, but when you try to load the compute node with addhost, it fails to get a DCHP lease and the installation fails.


Steps to Reproduce:
1. Load head node with RHEL 5.2 and Beta HPC package.  (no problem with defaults)
2. Run through tests before using addhost.  All should be successful.
3. invoke addhost on head node.  select compute-rhel group and rack 0.  Addhost goes into listen mode.
4. Boot the compute node.  The PXE will be successful and the first stage will load correctly.  When first stage starts to execute it trys to get a DHCP lease and fails.  Installation stalls at that point.
  
Actual results: No installation on compute node

Solution: After about a week of investigation it was discovered that the igb driver that the RHEL 5.2 (and Kusu) is using is out of date.  When the /tftpboot/kusu/initrd image is updated with the latest version of the Intel igb driver, the system loads correctly and the installation succeeds.

The SR1250ML utilizes the Intel 82575EB NIC chip which requires the igb driver.


Expected results:


Additional info:

Comment 1 Daniel Riek 2008-09-29 02:06:21 UTC
Is the hardware you are using certified for Red Hat Enterprise Linux?

If you are using hardware that requires drivers new, than what has been enabled in Red Hat Enterprise linux so far, it obviously can't be expected to work.

Red Hat has a way to release drivers asynchronously. But before that happens, the hardware in qustion can't be certified for RHEL.

There also might not yet be support for these asynchronously added drivers in the HPC Solution. - This is a question to Platform.

Comment 2 Rafael Garabato 2008-09-29 13:45:54 UTC
I have the same problem.

Where can I find the list of Certified Hardware for Red Hat Enterprise Linux?

The Gigabit Chipset that I have is 82575EB. And as far as I know, the igb driver supports this chipset, however I don't know if is certified for Red Hat.

Comment 3 Daniel Riek 2008-09-29 18:27:24 UTC
The HW Cert can be found at https://hardware.redhat.com/

If it doesn't work, I guess it is not yet certified. You can request a new driver through the regular Intel / Red Hat process.

Comment 4 Tom Lehmann 2008-09-30 13:48:09 UTC
The system I am using (Intel OEM SR1250ML)contains two of the X38ML motherboards that are on the RHEL certified list.  The system can be loaded from DVV media without any issue.  It's only when you try to load using PXE that the problem occurs due to the incorrect driver in the first stage.

Comment 5 Ronald Pacheco 2008-09-30 17:19:43 UTC
Please test the igb driver from Bug 436040 and post your results here and in that BZ.

Comment 6 Tom Lehmann 2008-10-01 02:30:52 UTC
I do not seem to have access to Bug 436040.  The driver I used was the latest posted at the Intel support web site.

Comment 7 Rafael Garabato 2008-10-01 19:21:39 UTC
What should I do? Install one of the kernel rpm packages available al http://people.redhat.com/agospoda/#rhel5?

kernel-2.6.18-116.el5.gtest.57.x86_64.rpm

Comment 8 Ronald Pacheco 2008-10-01 20:14:52 UTC
Tom,

Please go to the url referenced in Comment #7 to access rpms with the drivers we're considering for RHEL 5.3.

Comment 9 Tom Lehmann 2008-10-02 21:38:37 UTC
Ronald,

Sorry, no joy.  The new kernel RPM apparently doesn't correct the situation.

I loaded the head node from my original media (DVD).  After updating the system and registering it with RHN I downloaded and installed the new kernel RPM you suggested above.  As far as I can tell, the update was successful.

I then registered the system for the HPC feature on RHN.

I ran the yum install ocs and the ocs-setup to get the head node ready to install compute nodes.

All of the intemediate tests passed so I started addhost on the head node.  I booted the compute node from the network.  The PXE boot went correctly and the head node noted the compute-00-00 at the proper address.  On the compute node the first stage started running and tried to find the boot files on the head node.  It failed in apparently the same way as the unmodified version of the HPC package.

Question: The HPC installation asks the system to copy all of the install media to the system.  Is there the chance that the installation procedure overwrites some of the files modified in the RPM?  Remember, I fixed the initrd just before I ran addhost.  If the ocs installation had overwritten any files, my modification would have overwritten what had just been done.

Tom Lehmann

Comment 10 Rafael Garabato 2008-10-07 21:49:00 UTC
The new driver & kernel seems to work fine. At least for the provisioning.

Mark helped me to get it working. 
The node group must be updated to use the new kernel.

These are the steps I followed:

1) Copy the kernel into the /depot/kits/rhel/5/x86_64/Server
	$> cp kernel-2.6.18-116.el5.gtest.57.x86_64.rpm /depot/kits/rhel/5/x86_64/Server/
2) Run repoman
	$> repoman -u  -r rhel5_x86_64
3) Edit the kusudb database driverpack table and change the "dpname" of the kernel rpm 
	$> sqlrunner -q "update driverpacks SET dpname=\"kernel-2.6.18-116.el5.gtest.57.x86_64.rpm\" where dpid=1"
4) Run driverpatch
	$> driverpatch nodegroup name=compute-rhel

Comment 11 OCS Support 2008-10-08 17:42:28 UTC
What about using the "pci=nomsi" kernel parameter.  It's mentioned in bug 460349.

You can set the kernel parameter for a nodegroup using ngedit.  We used it with RHEL 5.2 on a small Melstone cluster that was exhibiting the issue, and it seemed to help.


Note You need to log in before you can comment on or make changes to this bug.