IT#189898: ice-csn nodes with RHEL5.2 won't PXE boot unless power-cycled Description of problem: PXE boot on ice-csn nodes works only once after the AC power cords are plugged in. How reproducible: Always. Steps to Reproduce: The leader nodes are configured to have PXE as the first boot option in their boot order. The same results are seen when using F12 to trigger a PXE boot. * halt system * unplug system * plug in system * power up * PXE configured to boot from local disk works * system starts up * after system is up, issue a reboot command * system reboots, BIOS comes up, etc. * PXE boot is attempted but fails: -> PXE moving dash appears on the console -> DHCP server sees the DHCP requests from the node -> PXE can't hear the DHCP server replies -> PXE fails saying it didn't hear valid responses * repeat the cycle from the top (including unplugging AC power) and PXE works again It appears that the driver puts the NIC into some weird state where it cannot PXE boot after the kernel is loaded. If the e1000 driver is replaced with one built from sourceforge and the steps above repeated, the PXE boot does not fail. There's no need to unplug the node to get PXE to work. The sourceforge driver used was e1000-7.6.15.5; it was only tested on SLES, but should presumably work on RHEL too. Actual results: PXE boot works only once after AC power cords are plugged in. Expected results: PXE boot works every time.
From erikj: rhel5.2 and rhel5.1 have the problem. rehl5.0 does not have the problem (but also wasn't using e1000e as far as I can tell). Here is the driver information I have for what's in rhel5.2 (broken for pxe after reboots on ice-csn hardware) and what is in the SF version: old driver from rhel5.2: driver: e1000e version: 0.2.0 firmware-version: 2.1-12 bus-info: 0000:04:00.0 new driver from sourceforge (works): driver: e1000e version: 0.4.1.7-NAPI firmware-version: 2.1-12 bus-info: 0000:04:00.0 ================================== lsmod confirms that e1000 not e1000e is loaded in 5.0. Linus' tree has drivers/net/e1000e/netdev.c:#define DRV_VERSION "0.3.3.3-k2" George
There are substantial differences between upstream (Linus' tree) and whats in 5.3 for the files under driver/net/e1000e. This is good news in that a backport of a specific patch may be all that is required. Alright John, I will get back to other work :). George
Has is been confirmed that this is still broken on the latest 5.3 development kernels? There have been some significant changes to the e1000e driver for 5.3 (we updated to 0.3.3.3-k2). I'd love to know if this has been fixed upstream and in rhel5.3 or if we need to add this to the list of bugs-fixed-in-sourceforge-driver-but-not-upstream.
Andy, I think this was fixed upstream (if it was the same problem) and I think you already backported the change. Can somebody at RH try it with your 5.3 test kernel?
George, any chance you can test the latest rhel5.3 beta kernel?
I will tomorrow morning. I got a different behavior Thursday afternoon with the patch from rhkernel list applied, but I am not sure if the problem is solved. George
Gospo, I am seeing a different behavior, but I am seeing another problem, probably unrelated. I get beyond the point where PXE won't load anaconda, but anaconda crashes after asking for the nfs mount point. This is rebooting from the 2.5.18-118 kernel. I also had a problem with a failure to format the disk drive on one attempt. I need to try this on other machines. George
George, that is interesting. It seems to be like this might indicate that the PXE failure from the description might be fixed though. Do you think that's a fair statement?
Hi Andy, Yes, I have tried the beta on a couple of systems and it is working fine. The snapshot1 is currently being worked on by SGI QE and when they confirm it working I will flag this as verified. Thanks, George
closing based on comment #9