Bug 151872
Summary: | Installer fails to properly assign IP address from DHCP during kickstart | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | root |
Component: | anaconda | Assignee: | Anaconda Maintenance Team <anaconda-maint-list> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike McLean <mikem> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | dhu, hhd405131, prante, rareigh, steve |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | U5/U1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-09-19 17:51:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
root
2005-03-23 03:58:29 UTC
Does it help if you boot with 'linux linksleep=30'? I tried the linksleep option and 4 times out of 5 it failed with the exact same symptoms as described above. On the 4th time it worked but did not work when rebooted and tried for a 5th time. This is consistent with what I described above where it works randomly every 5 to 10 times. So I don't think linksleep made any difference. *** Bug 153748 has been marked as a duplicate of this bug. *** I'm having similar problems on an HP DL 380 G4 (w/ Broadcom NetXtreme 5704 gigabit NIC's plugged into Cisco Catalyst 4000 10/100). PXE DHCP works fine, then kickstart can't get an IP via DHCP. In my case, the DHCP server doesn't even hear from the client post-PXE-DHCP request and pre-kickstart. Doesn't matter what length I change the linksleep too. I've tried values as high as two minutes (120). ES 4. So I thought it was related to PXE and tried a boot disk and got the same thing: pump told us: no DHCP reply received then kicks me out to the IP config screen. A patch would be greatly appreciated. I had the same problem and traced it back to the LAN switch spanning tree configuration. My understanding is that during initialization the port got reset, and caused the switch's spanning tree logic to set it to a non-forwarding state for several seconds. This in turn caused dhcp to time out and fail. Changing the port STP configuration to Edge Port eliminated the problem. If it was an STP problem wouldn't this affect other computers on the same switch? There were no problems with DHCP on other computers or even on the same computer booting in rescue mode. On most switches, the "Edge Port" setting is individually configured for each port. Even if all ports are configured the same, my experience was that some DHCP requests (such as the one during Anaconda install) consistently failed, while others (such as the ones from PXE boot and dhclient) consistently succeeded, even on the same port. Perhaps it has to do with their timeout tolerance, or perhaps some drivers reset the port in a way that triggers STP non-forwarding state, while others don't. run into the same problem when kickstart install FC4-test2 on a Dell desktop. the integrated ethernet is: 02:0c.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) I disable the on-board NIC, install a 100M intel e100 NIC, installation went smoothly with no problem at all. After the install, remove the e100, re-enable the on-board gigabit NIC and it works fine. Looks like the problem exists only with kickstart and gigabit NICs (broadcom, intel). That's consistent with what I'm seeing too - it worked fine on an old desktop machine with just some old NIC in it. But it doesn't work on our new servers with Tyan i7210 mobo and inbuilt gigabit NIC. All of my test boxes at this point are gigabit and I'm not seeing anything like this at all. What sort of switch are you plugged into? Netgear FSM726S managed stackable switch. What setting(s) in particular should I look at on the switch? Seeing this too with a Dell GX260 and it's integrated e1000 NIC connected to a Cisco Catalyst 4006. STP is enabled on the ports. If I connect the Dell to an elcheapo-noname switch kickstart works fine. I can't duplicate this unless the port is set to do spanning tree, which normally a port intended for use with non-routing hardware should not be. With some routers it should also be possible to attain correct functionality with spanning tree turned on, but only using cisco's "bpduguard" feature or an analogous feature. I haven't tested this, though, so your mileage may vary. Got this fixed by setting a 'portfast' option on the Cisco switch. We have an installation with RHEL3 U5 on several HP Proliant DL-360 G4 and DL-380 G4 together with Cisco Catalyst switch (spanning tree) and both were constantly failing to perform kickstarts because of this issue. Neither DHCP packets could be observed on the wire, nor NFS mounts succeeded. It's not the switch, kickstart failed even with crosslink cable. The patch to the "pump library" https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=110036#c5 helped us. 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T) It's diffcult to say who's to blame. These onboard Gigabit Broadcom NetXtreme Ethernet NICs can take up to two minutes(!) under certain circumstances after initialization with both tg3 and bcm5700 drivers before they are able to send packets over the net. We confirmed this by waiting after error message for a while just to continue manual install by simply pressing the return key. It would be kind if that issue can be fixed. Why is "pumpDisableInterface" used in anaconda to confuse the Gigabit devices? Or is it a driver or setup issue? It took us three days to figure it out. The patch in bug 110036 is already included in RHEL4 GA and RHEL3 U4 and later. This should be resolved in RHEL3 and RHEL4 with current update releases as long as you are running with either a) spanning tree disabled or b) port fast enabled. If you are still having problems and can GUARANTEE that this is the case, please open a separate issue per person. This is unfortunately something where symptoms make things look the same when there are a number of possible root causes that get very confused if multiple people try to use the same report. If you have this issue on an HP switch, make sure that LACP is also turned off on the port you are connecting to as it can add 3 seconds or more of time before the link comes up. So even if you are using {no STP, STP/portfast, RSTP/edge, MSTP/edge} for fast spanning tree, you may still time out with LACP and a NIC that is slow to come online (tg3 in my case). From the cli you can do soemthing like: conf t no int <int list> lacp So I did: no int 1-22 lacp to disable lacp on ports 1-22 which are connected to end stations. This will bounce all of those ports, so only do it on the ports you have to during business hours :-) Before I did this I saw the following in the switch log: I 10/11/05 11:55:43 ports: port 9 is now off-line I 10/11/05 11:55:46 ports: port 9 is Blocked by LACP I 10/11/05 11:55:49 ports: port 9 is Blocked by STP I 10/11/05 11:55:49 ports: port 9 is now on-line and I could not kickstart. Now I see: I 10/11/05 12:14:01 ports: port 9 is now off-line I 10/11/05 12:14:04 ports: port 9 is Blocked by STP I 10/11/05 12:14:04 ports: port 9 is now on-line and kickstart works fine. FYI linksleep didn't seem to do anything, the loader waited 4 seconds. I got this fixed by enabling spanning-tree portfast (STP) on the switches. Works great now. |