189795 – DHCP timeouts during Kickstart

Bug 189795 - DHCP timeouts during Kickstart

Summary: DHCP timeouts during Kickstart

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	anaconda
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Martin Sivák
QA Contact:	Alexander Todorov
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	189792 226814 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-04-24 18:35 UTC by Jon Stanley
Modified:	2018-10-19 21:42 UTC (History)
CC List:	15 users (show)
Fixed In Version:	RHBA-2008-0653
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:05:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Dumb hardcoded fix to pump timeout and retries (1.31 KB, patch) 2007-06-29 21:48 UTC, Doug Scoular	no flags	Details \| Diff
Add dhcptimeout parameter to loader (2.35 KB, patch) 2007-11-27 12:04 UTC, Martin Sivák	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0653	0	normal	SHIPPED_LIVE	anaconda bug fix and enhancement update	2008-07-23 15:01:42 UTC

Description Jon Stanley 2006-04-24 18:35:44 UTC

+++ This bug was initially created as a clone of Bug #136482 +++

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20040913 Firefox/0.10.1

Description of problem:
As of RHEL 3 Update 2 I cannot Kickstart via DHCP.

This problem is due to the way network interfaces are being brought
up. I'm not sure if the problem lies in Prior to RHEL 3 Update 2,
anaconda would not bring the network interface down and then back up
in order initiate a DHCP request, it would simply do a
"hot-reconfiguration" of the Kickstart interface. In other words, the
i nterface doesn't lose its link with the switch to get the address. 

Now, when kickstart requests a DHCP address it completely downs the
interface and then brings it back up. This is a problem for us because
the DHCP timeout is shorter than the time it takes our switch ports
(Cisco 2900) to go into a forwarding state. As a result, our servers
are never able to get their DHCP lease.

I'm not sure if this problem is in anaconda, dhclient, initscripts, or
the tg3 driver itself.

This problem is in RHEL 3 Updates 2 & 3, as well as RHEL 4 Beta 1, and RHEL 4 
at least until update 2.

Version-Release number of selected component (if applicable):
RHEL 3 Updates 2 & 3 

How reproducible:
Always

Steps to Reproduce:
1. Request a new DHCP address via Anaconda/Kickstart

Actual Results:  The interface is disabled entirely, then re-enabled,
which causes the switchport to be reset every time.

Expected Results:  The interface should not be completely turned off
then on to get a DHCP address.

Additional info:

This is a new behavior for Red Hat Linux. In previous releases (RHEL 3
Update 1 and before) it could get a DHCP address without resetting the
interface.

-- Additional comment from katzj on 2004-10-20 09:42 EST --
Update 2 actually didn't change the behavior at all, but some drivers
changed and seem to exacerbate the behavior more.  Update 3 adds some
fixes and Update 4 (beta to be released soon) adds another set.

-- Additional comment from matt on 2004-10-20 09:52 EST --
I know that when I use the boot.iso from the initial release of RHEL 3
and Update 1 that I don't have this problem. I never lose the link
between my NIC and the switch during DHCP requests. However, on
Updates 2 and 3, I do. This problem persists on RHEL 4 Beta 1.

-- Additional comment from james_wildman on 2004-11-02 13:53 EST --
I observed the same symptoms with U2, U3, and RH4 Beta 1 on a new HP 
DL585.  If I used a static ip and was willing to cycle through 
the "Can't find server" message a few times (1-3), it would go ahead 
and install.  I don't have access to the switch to tell what it was 
seeing.  

lspci yields...
lspci...
02:06.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 
Gigabit Ethernet (rev 10)
02:06.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 
Gigabit Ethernet (rev 10)

-- Additional comment from marc-redhatbugzilla on 2005-02-16 08:56 
EST --
This is the same bug as Bug#15896 which was marked WONTFIX many years ago.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=15896

http://lists.us.dell.com/pipermail/linux-poweredge/2004-March/037152.html
has more info -- it's related to spanning tree convergence time, which
exceeds the dhcp retry timout period for the second dhcp sequence --
the one where anaconda is just about to mount the nfs install media.

-- Additional comment from jon.stanley on 2006-03-15 12:34 EST --
I need to disagree with the WONTFIX of the other bug.  I have this problem, and
it's very pervasive on Cisco hardware.

The workaround for this on the network side is to turn on 'spanning-tree
portfast' on an IOS based switch.  However, this is not viable in all network
topologies or with all network administration practices.

The purpose of portfast is to cause a port to go into STP forwarding state,
immediately when link comes up, rather than listening for BPDU's, and then
deciding to forward.  With portfast turned on, if there is a loop in the network
(for instance someone hooks a switch up to the port, with two uplinks into the
layer 2 infrastructure, you have a loop).

-- Additional comment from katzj on 2006-04-24 14:19 EST --
Mass-closing lots of old bugs which are in MODIFIED (and thus presumed to be
fixed).  If any of these are still a problem, please reopen or file a new bug
against the release which they're occurring in so they can be properly tracked.

Comment 1 Jeremy Katz 2006-04-26 18:34:40 UTC

*** Bug 189792 has been marked as a duplicate of this bug. ***

Comment 2 Stephen P. Schaefer 2006-06-28 18:45:53 UTC

WONTFIX is unacceptable: even with spanning-tree portfast set on the Cisco
2900's, they still don't settle down fast enough for me, dropping the DHCP
request.  More specifically, after PXE successfully uses DHCP and tftp to get
the kernel onto the box, my DHCP server never sees anaconda's DHCP DISCOVER.  I
don't see this problem outside the kickstart environment (well, not reproducibly
or often), so perhaps you can adjust the timeouts and the number of retries used
by anaconda.  I'm not the most brilliant code hacker, but when I last looked at
the anaconda code, I didn't see any provision for retrying the DHCP DISCOVER.

My usual workaround is to put a $40 5-port no-name unmanaged "switch" in between
the system and the Cisco switch, but at the moment my Sun V40z's PXE won't work
through the $40 switch, only directly connected to the 2900 - ARGGGH!

Comment 3 Jon Stanley 2006-06-28 19:04:44 UTC

I saw over on nahant-list that a new feature has been added to anaconda in
RHEL4U4 beta, where there's a kernel argument of nicdelay=<x> that controls how
long anaconda waits before it sends the initial DHCP request.  I've not actually
used/seen this, but it seems to be a workaround to this problem if it really
does exist.

It would also be worthwhile to note that anaconda drops link between DHCP tries,
so you get into a vicious cycle.  Anaconda should not drop link.

FWIW - I just kickstarted some boxes off a Cisco 3560, and portfast got DHCP to
work.

Comment 4 Ray Van Dolson 2007-01-04 01:22:49 UTC

I am also having this issue with RHEL Update 4.

I've tried setting nicdelay to 240 and linksleep to 60.  This results in long
delays during DHCP discovery, but no IP address.  pump continues to fail.

As soon as I put an intermediary hub between the server and the switch, an IP is
grabbed immediately.  A caveat... with nicdelay and linksleep set so high, it's
obvious that the link is brought up and down at several points before the
installation begins.  Why is this?  Shouldn't the link just stay up after an
initial IP is acquired?

Obviously disabling STP would likely fix the problem.  However, ISC's DHCP
client (which is used once the system is installed) has no problems getting an
IP address, so it seems pump should be able to deal with this as well.

RHEL4 Update 4
Dell PowerEdge 2950

What additional information can I provide to help get this issue fixed? 
Unfortunately, I don't know the brand of switch being used in our datacenter,
but can get that information.

Comment 5 Gavin Edwards 2007-01-24 16:54:05 UTC

I have just tried installing my first Dell PowerEdge 2950 server with RHEL 4 WS
Update for and am having exactly the same problem.

I have a build server environment where I create a simple boot CD that loads
isolinux and tells it to search for a kickstart script via NFS and I see the
following behaviour. This is done by using the standard Red Hat tools to do so
(So I'm not using any third party information and am using the isolinux, kernel,
anaconda... etc versions that are correct for the versin of Red Hat I am trying
to install)

1. The CD boots isolinux and the machine tries to obtain a DHCP address.
2. The machine reports that the DHCP request failed and asks for static IP
information, which I fill in.
3. The NIC comes up on the static IP, retrieves the kickstart file which tells
it to use DHCP, then the NIC switches to using the DHCP assigned address
successfully and the actual install begins.

My network expert colleague reports that as far as the DHCP server is concerned
only one DHCP request is made and a dynamic IP is successfully assigned.

Comment 6 Ray Van Dolson 2007-01-24 18:43:33 UTC

Interesting.

One odd thing I've noticed is that when I specify a _static_ IP from the boot
parameters, if I set the nicdelay as low as even 10 seconds everything works fine.

However, if I choose DHCP and set nicdelay even up to 200 seconds, the pump
client is never able to acquire an IP address.

I would think if STP truly is to blame, either the DHCP request would eventually
work with a long enough delay and/or the static IP request would not work so
quickly (it should also have to wait for the link to come up).

And the fact still remains that in the exact same environment, ISC's client
works perfectly.

Gavin, are you able to set up a mirror port and do some packet dumps comparing
DHCP requests from pump vs DHCP requests from ISC (once the machine is configured)?

I have to put in a request to get my hands on a Catalyst switch to reproduce our
server room environment which could take a little while. :-)

Comment 7 Gavin Edwards 2007-01-25 13:16:58 UTC

Hi Ray,

I asked my networking colleague about this and he said he doesn't really have
the time to spend on doing this at the moment, especially considering we have a
workaround (putting in the IP address manually). Sorry.

Comment 8 Ray Van Dolson 2007-01-25 16:29:33 UTC

Unfortunate.  One of those problems that's annoying enough to want to fix, but
not quite bad enough to warrant the time. :-)

Hopefully will get my hands on a Catalyst so I can do some testing myself.

Comment 9 william ramthun 2007-02-27 18:17:04 UTC

I'm also experiencing a similar issue where network negotiation between server
NIC and switch port does not occur before anaconda times out, thus my
http://kickstart config is not read, and I don't win the prize...:

I'm installing from boot.iso mounted via Virtual CD-ROM thru the HP ILO port.

Install/boot OS: RHEL3 U5
Server Hardware: HP DL385 (using onboard NICs)
Switch Hardware: Cisco 6509

"Spanning-Tree Portfast" is enabled on my switch port

Boot image is utilizing the tg3 NIC driver.  The server NICs are Broadcom.

Comment 10 Ray Van Dolson 2007-02-27 18:34:04 UTC

If you specify a static IP address instead of relying on DHCP do things work for
you?

IMO, if the problem really is STP et al, there should _still_ be a delay with
the static IP as the switch should take the same amount of time to bring up the
port and perform STP calcuations whether or not DHCP or static assignment is used.

Comment 11 Doug Scoular 2007-05-07 04:38:47 UTC

Hi All,

Just a note to say that I've been told that having portfast set on all ports
on a switch would be a very bad idea. So I'm not convinced that enabling portfast
is a valid workaround. Neither is insisting on static IPs as our automated
bulk deployment system relies on DHCP and it would create a huge infrastructure
change.

Understanding why anaconda and more specifically, pump, is failing would be
a better path to take. I think the problem really relates to Anaconda's use of
pump.

Has anyone got a detailed technical description of what the root cause is ?

I'll try and do some investigation myself... 

Cheers,

Doug

Comment 12 Doug Scoular 2007-06-29 05:52:09 UTC

Hi All,
Okay, I finally had some time and cause to revisit this issue.

When a switch is using the Spanning Tree Protocol(STP) it can take up to
50 seconds after link is raised on a port for the algorithm to allow the
forwarding of packets on the port.

RedHat provide two solutions to counter this situation:

1) Enable portfast on the port to reduce the time between link being raised and
   packets getting forwarded.
2) Use nicdelay and linksleep to increase the amount of time the anaconda stage
   one loader will wait and retry dhcp.

Both these solutions have drawbacks. 1) is not always possible if it is against
a site's policy and requires the intervention of a network engineer. 2) doesn't
work because the anaconda stage one loader relies on pumpDhcpClassRun without
passing an "override" parameter to change the pump default timeout and number 
of retries. Since pumpDhcpClassRun brings down the link and then only waits a
default of 30 seconds it never sees any packets and certainly no DHCPOFFERS.

I have raised a RedHat issue and provided a simplistic patch which passes
pumpDhcpClassRun an appropriate override parameter. So I'm hopeful
this bugzilla will be squashed soon.

Cheers,

Doug

Comment 13 Ray Van Dolson 2007-06-29 15:22:00 UTC

Excellent, thanks Doug.  Probably better to have opened an official support
issue than relying on bugzilla ;)

Could you post back here and let us all know what your resolution is?

Would be great to see this in RHEL4 U6 (or an interim update), not just RHEL5.

Comment 14 Doug Scoular 2007-06-29 21:45:38 UTC

Hi Ray,
I have opened an official support issue (125366), but I wanted to keep
the wider community informed. We don't use RHEL5 at all and I build my
embryonic fix against RHEL4U4. I'm hoping RH will take my patch and
identification of the core problem and use these to produce a more
wide ranging patch.

However, for those who cannot wait for a better, professionally produced patch,
here's my diff with hideous hard-coded values. It's purpose was really to
highlight that the issue lies with pump's pumpDhcpClassRun method being left
to it's own devices rather than being overridden. I chose somewhat excessive
values, but they seem to work. Here's my patch against anaconda 10.1.1.46:

diff -uNr anaconda-10.1.1.46/loader2/net.c anaconda-10.1.1.46-dug/loader2/net.c
--- anaconda-10.1.1.46/loader2/net.c    2006-04-20 06:30:56.000000000 +1000
+++ anaconda-10.1.1.46-dug/loader2/net.c        2007-06-29 13:05:50.000000000 +1000
@@ -689,11 +689,29 @@

 char * doDhcp(char * ifname,
               struct networkDeviceConfig *dev, char * dhcpclass) {
+    extern int num_link_checks;
+    extern int post_link_sleep;
+    struct pumpOverrideInfo override;
+
+    /*
+     * Originally thought I could use num_link_checks and
+     * post_link_sleep but this confuses the two sets of wait code.
+     *
+     * Unsure if we should have customisable waits for link in
+     * anaconda at all when we want to DHCP. Let pump handle
+     * custom waits and retries methinks - dscoular
+     * Hard coding for now.
+     */
+    memset(&override, 0, sizeof(override));
+    pumpInitOverride ( &override );
+    override.timeout = 100; /* post_link_sleep; out for now */
+    override.numRetries = 40; /* num_link_checks; out for now */
+
     setupWireless(dev);
     logMessage("running dhcp for %s", ifname);
     return pumpDhcpClassRun(ifname, 0, 0, NULL,
                             dhcpclass ? dhcpclass : "anaconda",
-                            &dev->dev, NULL);
+                            &dev->dev, &override);

 }

I'll attach it too just in case it gets munged. You'll have to rebuild a
patched anaconda from the source rpm and then take the loader binary fron
anaconda-10.1.1.46/loader2/loader and inject it into your initrd.img as
/sbin/loader.

Cheers,

Doug

Comment 15 Doug Scoular 2007-06-29 21:48:09 UTC

Created attachment 158272 [details]
Dumb hardcoded fix to pump timeout and retries

Probably best to wait for an offical fix.

Comment 16 David Mays 2007-07-18 13:00:14 UTC

I have been fighting this for a few days now, and this dialog seems to help me 
understand. I am having this problem, but it seems to shutdown the interface 
down when loading the first rpm package. It sometimes shuts down on the 
minstg2.img load, but always on the rpm load. 

I did load SUSE 10.2 and they seem to resolve this weirdness by loading in a 
single stage, the pxeboot loads a 64MB ramdisk and loads all it needs in a 
single intird.img file. This would be a major change of direction of the 
anaconda installer, but would it not be a lot cleaner? All the servers that I 
am buying for prodution right now have +2GB of ram, how about loading the 
first cdrom into ram and going from there?

Comment 17 David Cantrell 2007-07-24 15:25:50 UTC

*** Bug 226814 has been marked as a duplicate of this bug. ***

Comment 19 Martin Sivák 2007-11-27 12:04:26 UTC

Created attachment 269681 [details]
Add dhcptimeout parameter to loader

This should be a little bit better, than the "dumb hardcoded" patch

Comment 20 Eugene Teo (Security Response) 2007-12-17 02:28:14 UTC

Martin,

Can you create a test package for testing please? I tried to apply your patch to
anaconda-10.1.1.63 but it does not patch properly.

Thanks,
Eugene

Comment 21 Martin Sivák 2007-12-19 07:58:25 UTC

(In reply to comment #20)

> I tried to apply your patch to
> anaconda-10.1.1.63 but it does not patch properly.
> 
> Thanks,Eugene

According to RHEL trees, the anaconda version in RHEL4 is 10.1.1.67. What is the
exact version of RHEL you are using?

Comment 22 Eugene Teo (Security Response) 2007-12-19 08:07:31 UTC

Customer is running RHEL4U4.

Comment 24 Eugene Teo (Security Response) 2008-01-03 06:41:11 UTC

Hi Martin,

Just to let you know that I have generated a test RPM. Getting my customer to
test the patch you submitted. Will ping you when I get some feedback.

Comment 29 Martin Sivák 2008-02-04 10:00:04 UTC

And the patch is already in our RHEL4 queue, so I'm setting this to MODIFIED.

Comment 43 Ray Van Dolson 2008-06-03 20:54:02 UTC

Folks, is this slated for inclusion in RHEL 4.7?

Comment 44 James G. Brown III 2008-06-03 21:05:15 UTC

This is will be included in 4.7 as a parameter called dhcptimeout, i.e.
dhcptimeout=60

- James

Comment 45 Ray Van Dolson 2008-06-03 21:50:51 UTC

Thanks James.

I tested the patch mentioned in #19 above and built a new loader against the
anaconda 10.1.1.81 sources.

The patch applied cleanly, but when booting with the dhcptimeout parameter, we
encountered issues (resulting in it not working at all for us).  I added two log
entries in doDhcp in net.c as follows:

    logMessage("override.timeout set to %d", override.timeout);
    logMessage("dhcpTimeout set to %d", dev->dhcpTimeout);

Here is what I observed:

1. Booted with modified initrd and dhcptimeout=90
2. Initial automated DHCP request fails, observed the following in tty2:

  override.timeout set to 0
  dhcpTimeout set to 0

3. Back on tty1 we just told the installer to "try again" with DHCP and observed
the following on tty2:

  override.timeout set to 45
  dhcpTimeout set to -1222348285

An IP address is successfully obtained here, but...

4. Immediately anaconda of course wants to bounce the interface:

  override.timeout set to 0
  dhcpTimeout set to 0

This fails obviously and we cannot continue.

I modified doDhcp to hardcode a timeout of 90 seconds for override.timeout and
now we can complete our installation with DHCP.

Any idea why the above is happening?  The patch appears to have touched both
loader.c and net.c correctly -- I can see the dhcptimeout parsing logic in
loader.c, but somewhere along the way (and between interface bounces too
perhaps) the information is lost and we drop back to default behavior.

I'd be happy to attach the tarball of my patched source for verification if you
like.

Comment 46 Ray Van Dolson 2008-06-04 19:53:05 UTC

This happens in RHEL 5.x as well.  I haven't tried patching the initrd yet however.

Comment 48 errata-xmlrpc 2008-07-24 19:05:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0653.html

Comment 54 Alexander Todorov 2008-09-24 06:52:04 UTC

(In reply to comment #46)
> This happens in RHEL 5.x as well.  I haven't tried patching the initrd yet
> however.

Ray,
is there a bz for RHEL 5.x ? Can you please file one if you still see this issue?

Thanks!

Comment 55 Martin Sivák 2008-09-24 08:43:22 UTC

The patches which made it into 4.7 were:


http://git.fedorahosted.org/git/anaconda.git?p=anaconda.git;a=commit;h=774f045eb9bb10c764e30ae1f864ecd5cf5730b0
http://git.fedorahosted.org/git/anaconda.git?p=anaconda.git;a=commit;h=3685a4d392551b299dd828b02927a744b3bacadc

And they solved all the issues mentioned here (the uninitialized variable and not passing the value correctly inside the net code)

Comment 56 Martin Sivák 2008-09-24 08:46:23 UTC

The bugzilla numbers for 5.x are #198147, #254032 and they were verified as well.

Note You need to log in before you can comment on or make changes to this bug.