Bug 151872

Summary: Installer fails to properly assign IP address from DHCP during kickstart
Product: Red Hat Enterprise Linux 4 Reporter: root
Component: anacondaAssignee: Anaconda Maintenance Team <anaconda-maint-list>
Status: CLOSED CURRENTRELEASE QA Contact: Mike McLean <mikem>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: dhu, hhd405131, prante, rareigh, steve
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: U5/U1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-19 17:51:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description root 2005-03-23 03:58:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050301 Firefox/1.0.1 Red Hat/1.0.1-1.4.3.centos4.1

Description of problem:
When trying to kickstart a system the installer will frequently and randomly fail to properly assign the IP address offered to it by the DHCP server. Sometimes it will apparently fail to contact the DHCP server completely. Booting the same machine from the same CD and entering rescue mode works reliably - ie. the IP address gets assigned properly from DHCP. However doing it in kickstart mode can take anywhere from 5 to 10 tries to get it to work, depending on my luck :-)


Version-Release number of selected component (if applicable):
anaconda-that-comes-with-RHEL4

How reproducible:
Sometimes

Steps to Reproduce:
1. After setting up DHCP on the network server with the correct MAC address for the network card boot off RHEL4 CD1.
2. type 'linux ks=http://192.168.6.9/ks/' at the installer boot prompt. Where 192.168.6.9 is the IP of the server with the kickstart files on it.
3. Select the correct ethernet interface for kickstart when prompted. 
  

Actual Results:  Most of the time it would try DHCP for a while then give up, presenting the 'Configure TCP/IP' screen. The Alt-F3 console would show a message like "pump told us: no DHCP reply isn't a wireless adaptor". The DHCP log on the server show this:Mar 23 14:39:05 app dhcpd: DHCPDISCOVER from 00:e0:81:64:1f:7a via eth0
Mar 23 14:39:05 app dhcpd: DHCPOFFER on 192.168.6.46 to 00:e0:81:64:1f:7a via eth0
Mar 23 14:39:05 app dhcpd: DHCPDISCOVER from 00:e0:81:64:1f:7a via eth0
Mar 23 14:39:05 app dhcpd: DHCPOFFER on 192.168.6.46 to 00:e0:81:64:1f:7a via eth0
ie. there is no DHCPACK so the attempted-kickstart never properly completes the DHCP transaction to get an IP Address. 

Interestingly if I assign the IP address and such manually from 'Configure TCP/IP' it still won't work - it'll try for a while then silently fail over to a CD based installation.



Expected Results:  The machine should have correctly completed the DHCP transaction and begun the kickstart. This correct randomly operations happens once every few tries - probably every 5th to every 10th time

Additional info:

These cards are all built-in gigbit ethernet cards on Tyan i7210 motherboards and all come up usign the e1000 driver in the RHEL4 install.

Comment 2 Jeremy Katz 2005-03-28 18:52:55 UTC
Does it help if you boot with 'linux linksleep=30'?

Comment 3 root 2005-03-28 23:16:42 UTC
I tried the linksleep option and 4 times out of 5 it failed with the exact same
symptoms as described above. On the 4th time it worked but did not work when
rebooted and tried for a 5th time. This is consistent with what I described
above where it works randomly every 5 to 10 times.

So I don't think linksleep made any difference.

Comment 4 Chris Lumens 2005-04-08 04:40:08 UTC
*** Bug 153748 has been marked as a duplicate of this bug. ***

Comment 5 Steve Seremeth 2005-04-08 19:36:57 UTC
I'm having similar problems on an HP DL 380 G4 (w/ Broadcom NetXtreme 5704
gigabit NIC's plugged into Cisco Catalyst 4000 10/100).  PXE DHCP works fine,
then kickstart can't get an IP via DHCP.  In my case, the DHCP server doesn't
even hear from the client post-PXE-DHCP request and pre-kickstart.  Doesn't
matter what length I change the linksleep too.  I've tried values as high as two
minutes (120).  ES 4.  So I thought it was related to PXE and tried a boot disk
and got the same thing:

pump told us: no DHCP reply received

then kicks me out to the IP config screen.  A patch would be greatly appreciated.

Comment 6 Haris Hadjiioannou 2005-04-26 03:40:46 UTC
I had the same problem and traced it back to the LAN switch spanning tree
configuration. 

My understanding is that during initialization the port got reset, and caused
the switch's spanning tree logic to set it to a non-forwarding state for several
seconds. This in turn caused dhcp to time out and fail.

Changing the port STP configuration to Edge Port eliminated the problem.


Comment 7 root 2005-04-26 03:48:22 UTC
If it was an STP problem wouldn't this affect other computers on the same
switch? There were no problems with DHCP on other computers or even on the same
computer booting in rescue mode.

Comment 8 Haris Hadjiioannou 2005-04-29 18:49:07 UTC
On most switches, the "Edge Port" setting is individually configured for each
port. Even if all ports are configured the same, my experience was that some
DHCP requests (such as the one during Anaconda install) consistently failed,
while others (such as the ones from PXE boot and dhclient) consistently
succeeded, even on the same port.

Perhaps it has to do with their timeout tolerance, or perhaps some drivers reset
the port in a way that triggers STP non-forwarding state, while others don't.


Comment 9 Need Real Name 2005-05-13 17:41:59 UTC
run into the same problem when kickstart install FC4-test2 on a Dell desktop.
the integrated ethernet is: 
02:0c.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet
Controller (rev 02)

I disable the on-board NIC, install a 100M intel e100 NIC, installation went
smoothly with no problem at all. After the install, remove the e100, re-enable
the on-board gigabit NIC and it works fine.

Looks like the problem exists only with kickstart and gigabit NICs (broadcom,
intel).


Comment 10 root 2005-05-22 23:34:05 UTC
That's consistent with what I'm seeing too - it worked fine on an old desktop
machine with just some old NIC in it. But it doesn't work on our new servers
with Tyan i7210 mobo and inbuilt gigabit NIC.

Comment 11 Jeremy Katz 2005-05-23 15:35:26 UTC
All of my test boxes at this point are gigabit and I'm not seeing anything like
this at all.  What sort of switch are you plugged into?

Comment 12 root 2005-05-24 07:13:18 UTC
Netgear FSM726S managed stackable switch.

What setting(s) in particular should I look at on the switch?

Comment 13 Janne Liimatainen 2005-06-10 08:28:50 UTC
Seeing this too with a Dell GX260 and it's integrated e1000 NIC connected to a
Cisco Catalyst 4006. STP is enabled on the ports.

If I connect the Dell to an elcheapo-noname switch kickstart works fine.

Comment 14 Peter Jones 2005-06-13 15:09:59 UTC
I can't duplicate this unless the port is set to do spanning tree, which
normally a port intended for use with non-routing hardware should not be.

With some routers it should also be possible to attain correct functionality
with spanning tree turned on, but only using cisco's "bpduguard" feature or an
analogous feature.  I haven't tested this, though, so your mileage may vary.

Comment 15 Janne Liimatainen 2005-06-17 13:24:12 UTC
Got this fixed by setting a 'portfast' option on the Cisco switch.

Comment 16 Jörg Prante 2005-06-25 21:56:28 UTC
We have an installation with RHEL3 U5 on several HP Proliant DL-360 G4 and
DL-380 G4 together with Cisco Catalyst switch (spanning tree) and both were
constantly failing to perform kickstarts because of this issue.

Neither DHCP packets could be observed on the wire, nor NFS mounts succeeded.

It's not the switch, kickstart failed even with crosslink cable.

The patch to the "pump library"
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=110036#c5
helped us. 

02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
        Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter
(PCI-X, 10,100,1000-T)


It's diffcult to say who's to blame. These onboard Gigabit Broadcom NetXtreme
Ethernet NICs can take up to two minutes(!) under certain circumstances after
initialization with both tg3 and bcm5700 drivers before they are able to send
packets over the net. We confirmed this by waiting after error message for a
while just to continue manual install by simply pressing the return key. 

It would be kind if that issue can be fixed. Why is "pumpDisableInterface" used
in anaconda to confuse the Gigabit devices? Or is it a driver or setup issue? It
took us three days to figure it out.

Comment 17 Jeremy Katz 2005-06-27 20:16:41 UTC
The patch in bug 110036 is already included in RHEL4 GA and RHEL3 U4 and later.

Comment 18 Jeremy Katz 2005-09-19 17:51:07 UTC
This should be resolved in RHEL3 and RHEL4 with current update releases as long
as you are running with either a) spanning tree disabled or b) port fast enabled.

If you are still having problems and can GUARANTEE that this is the case, please
open a separate issue per person.  

This is unfortunately something where symptoms make things look the same when
there are a number of possible root causes that get very confused if multiple
people try to use the same report.

Comment 21 Allen Smith 2005-10-11 21:27:03 UTC
If you have this issue on an HP switch, make sure that LACP is also turned off
on the port you are connecting to as it can add 3 seconds or more of time before
the link comes up. So even if you are using {no STP, STP/portfast, RSTP/edge,
MSTP/edge} for fast spanning tree, you may still time out with LACP and a NIC
that is slow to come online (tg3 in my case).

From the cli you can do soemthing like:

conf t
no int <int list> lacp

So I did:

no int 1-22 lacp

to disable lacp on ports 1-22 which are connected to end stations. This will
bounce all of those ports, so only do it on the ports you have to during
business hours :-)

Before I did this I saw the following in the switch log:

I 10/11/05 11:55:43 ports: port 9 is now off-line
I 10/11/05 11:55:46 ports: port 9 is Blocked by LACP
I 10/11/05 11:55:49 ports: port 9 is Blocked by STP
I 10/11/05 11:55:49 ports: port 9 is now on-line

and I could not kickstart.

Now I see:

I 10/11/05 12:14:01 ports: port 9 is now off-line
I 10/11/05 12:14:04 ports: port 9 is Blocked by STP
I 10/11/05 12:14:04 ports: port 9 is now on-line

and kickstart works fine.


FYI linksleep didn't seem to do anything, the loader waited 4 seconds.









Comment 22 Perry Huang 2006-07-31 16:44:49 UTC
I got this fixed by enabling spanning-tree portfast (STP) on the switches. Works great now.