+++ This bug was initially created as a clone of Bug #136482 +++ From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20040913 Firefox/0.10.1 Description of problem: As of RHEL 3 Update 2 I cannot Kickstart via DHCP. This problem is due to the way network interfaces are being brought up. I'm not sure if the problem lies in Prior to RHEL 3 Update 2, anaconda would not bring the network interface down and then back up in order initiate a DHCP request, it would simply do a "hot-reconfiguration" of the Kickstart interface. In other words, the i nterface doesn't lose its link with the switch to get the address. Now, when kickstart requests a DHCP address it completely downs the interface and then brings it back up. This is a problem for us because the DHCP timeout is shorter than the time it takes our switch ports (Cisco 2900) to go into a forwarding state. As a result, our servers are never able to get their DHCP lease. I'm not sure if this problem is in anaconda, dhclient, initscripts, or the tg3 driver itself. This problem is in RHEL 3 Updates 2 & 3, as well as RHEL 4 Beta 1, and RHEL 4 at least until update 2. Version-Release number of selected component (if applicable): RHEL 3 Updates 2 & 3 How reproducible: Always Steps to Reproduce: 1. Request a new DHCP address via Anaconda/Kickstart Actual Results: The interface is disabled entirely, then re-enabled, which causes the switchport to be reset every time. Expected Results: The interface should not be completely turned off then on to get a DHCP address. Additional info: This is a new behavior for Red Hat Linux. In previous releases (RHEL 3 Update 1 and before) it could get a DHCP address without resetting the interface. -- Additional comment from katzj on 2004-10-20 09:42 EST -- Update 2 actually didn't change the behavior at all, but some drivers changed and seem to exacerbate the behavior more. Update 3 adds some fixes and Update 4 (beta to be released soon) adds another set. -- Additional comment from matt on 2004-10-20 09:52 EST -- I know that when I use the boot.iso from the initial release of RHEL 3 and Update 1 that I don't have this problem. I never lose the link between my NIC and the switch during DHCP requests. However, on Updates 2 and 3, I do. This problem persists on RHEL 4 Beta 1. -- Additional comment from james_wildman on 2004-11-02 13:53 EST -- I observed the same symptoms with U2, U3, and RH4 Beta 1 on a new HP DL585. If I used a static ip and was willing to cycle through the "Can't find server" message a few times (1-3), it would go ahead and install. I don't have access to the switch to tell what it was seeing. lspci yields... lspci... 02:06.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 02:06.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) -- Additional comment from marc-redhatbugzilla on 2005-02-16 08:56 EST -- This is the same bug as Bug#15896 which was marked WONTFIX many years ago. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=15896 http://lists.us.dell.com/pipermail/linux-poweredge/2004-March/037152.html has more info -- it's related to spanning tree convergence time, which exceeds the dhcp retry timout period for the second dhcp sequence -- the one where anaconda is just about to mount the nfs install media. -- Additional comment from jon.stanley on 2006-03-15 12:34 EST -- I need to disagree with the WONTFIX of the other bug. I have this problem, and it's very pervasive on Cisco hardware. The workaround for this on the network side is to turn on 'spanning-tree portfast' on an IOS based switch. However, this is not viable in all network topologies or with all network administration practices. The purpose of portfast is to cause a port to go into STP forwarding state, immediately when link comes up, rather than listening for BPDU's, and then deciding to forward. With portfast turned on, if there is a loop in the network (for instance someone hooks a switch up to the port, with two uplinks into the layer 2 infrastructure, you have a loop). -- Additional comment from katzj on 2006-04-24 14:19 EST -- Mass-closing lots of old bugs which are in MODIFIED (and thus presumed to be fixed). If any of these are still a problem, please reopen or file a new bug against the release which they're occurring in so they can be properly tracked.
*** Bug 189792 has been marked as a duplicate of this bug. ***
WONTFIX is unacceptable: even with spanning-tree portfast set on the Cisco 2900's, they still don't settle down fast enough for me, dropping the DHCP request. More specifically, after PXE successfully uses DHCP and tftp to get the kernel onto the box, my DHCP server never sees anaconda's DHCP DISCOVER. I don't see this problem outside the kickstart environment (well, not reproducibly or often), so perhaps you can adjust the timeouts and the number of retries used by anaconda. I'm not the most brilliant code hacker, but when I last looked at the anaconda code, I didn't see any provision for retrying the DHCP DISCOVER. My usual workaround is to put a $40 5-port no-name unmanaged "switch" in between the system and the Cisco switch, but at the moment my Sun V40z's PXE won't work through the $40 switch, only directly connected to the 2900 - ARGGGH!
I saw over on nahant-list that a new feature has been added to anaconda in RHEL4U4 beta, where there's a kernel argument of nicdelay=<x> that controls how long anaconda waits before it sends the initial DHCP request. I've not actually used/seen this, but it seems to be a workaround to this problem if it really does exist. It would also be worthwhile to note that anaconda drops link between DHCP tries, so you get into a vicious cycle. Anaconda should not drop link. FWIW - I just kickstarted some boxes off a Cisco 3560, and portfast got DHCP to work.
I am also having this issue with RHEL Update 4. I've tried setting nicdelay to 240 and linksleep to 60. This results in long delays during DHCP discovery, but no IP address. pump continues to fail. As soon as I put an intermediary hub between the server and the switch, an IP is grabbed immediately. A caveat... with nicdelay and linksleep set so high, it's obvious that the link is brought up and down at several points before the installation begins. Why is this? Shouldn't the link just stay up after an initial IP is acquired? Obviously disabling STP would likely fix the problem. However, ISC's DHCP client (which is used once the system is installed) has no problems getting an IP address, so it seems pump should be able to deal with this as well. RHEL4 Update 4 Dell PowerEdge 2950 What additional information can I provide to help get this issue fixed? Unfortunately, I don't know the brand of switch being used in our datacenter, but can get that information.
I have just tried installing my first Dell PowerEdge 2950 server with RHEL 4 WS Update for and am having exactly the same problem. I have a build server environment where I create a simple boot CD that loads isolinux and tells it to search for a kickstart script via NFS and I see the following behaviour. This is done by using the standard Red Hat tools to do so (So I'm not using any third party information and am using the isolinux, kernel, anaconda... etc versions that are correct for the versin of Red Hat I am trying to install) 1. The CD boots isolinux and the machine tries to obtain a DHCP address. 2. The machine reports that the DHCP request failed and asks for static IP information, which I fill in. 3. The NIC comes up on the static IP, retrieves the kickstart file which tells it to use DHCP, then the NIC switches to using the DHCP assigned address successfully and the actual install begins. My network expert colleague reports that as far as the DHCP server is concerned only one DHCP request is made and a dynamic IP is successfully assigned.
Interesting. One odd thing I've noticed is that when I specify a _static_ IP from the boot parameters, if I set the nicdelay as low as even 10 seconds everything works fine. However, if I choose DHCP and set nicdelay even up to 200 seconds, the pump client is never able to acquire an IP address. I would think if STP truly is to blame, either the DHCP request would eventually work with a long enough delay and/or the static IP request would not work so quickly (it should also have to wait for the link to come up). And the fact still remains that in the exact same environment, ISC's client works perfectly. Gavin, are you able to set up a mirror port and do some packet dumps comparing DHCP requests from pump vs DHCP requests from ISC (once the machine is configured)? I have to put in a request to get my hands on a Catalyst switch to reproduce our server room environment which could take a little while. :-)
Hi Ray, I asked my networking colleague about this and he said he doesn't really have the time to spend on doing this at the moment, especially considering we have a workaround (putting in the IP address manually). Sorry.
Unfortunate. One of those problems that's annoying enough to want to fix, but not quite bad enough to warrant the time. :-) Hopefully will get my hands on a Catalyst so I can do some testing myself.
I'm also experiencing a similar issue where network negotiation between server NIC and switch port does not occur before anaconda times out, thus my http://kickstart config is not read, and I don't win the prize...: I'm installing from boot.iso mounted via Virtual CD-ROM thru the HP ILO port. Install/boot OS: RHEL3 U5 Server Hardware: HP DL385 (using onboard NICs) Switch Hardware: Cisco 6509 "Spanning-Tree Portfast" is enabled on my switch port Boot image is utilizing the tg3 NIC driver. The server NICs are Broadcom.
If you specify a static IP address instead of relying on DHCP do things work for you? IMO, if the problem really is STP et al, there should _still_ be a delay with the static IP as the switch should take the same amount of time to bring up the port and perform STP calcuations whether or not DHCP or static assignment is used.
Hi All, Just a note to say that I've been told that having portfast set on all ports on a switch would be a very bad idea. So I'm not convinced that enabling portfast is a valid workaround. Neither is insisting on static IPs as our automated bulk deployment system relies on DHCP and it would create a huge infrastructure change. Understanding why anaconda and more specifically, pump, is failing would be a better path to take. I think the problem really relates to Anaconda's use of pump. Has anyone got a detailed technical description of what the root cause is ? I'll try and do some investigation myself... Cheers, Doug
Hi All, Okay, I finally had some time and cause to revisit this issue. When a switch is using the Spanning Tree Protocol(STP) it can take up to 50 seconds after link is raised on a port for the algorithm to allow the forwarding of packets on the port. RedHat provide two solutions to counter this situation: 1) Enable portfast on the port to reduce the time between link being raised and packets getting forwarded. 2) Use nicdelay and linksleep to increase the amount of time the anaconda stage one loader will wait and retry dhcp. Both these solutions have drawbacks. 1) is not always possible if it is against a site's policy and requires the intervention of a network engineer. 2) doesn't work because the anaconda stage one loader relies on pumpDhcpClassRun without passing an "override" parameter to change the pump default timeout and number of retries. Since pumpDhcpClassRun brings down the link and then only waits a default of 30 seconds it never sees any packets and certainly no DHCPOFFERS. I have raised a RedHat issue and provided a simplistic patch which passes pumpDhcpClassRun an appropriate override parameter. So I'm hopeful this bugzilla will be squashed soon. Cheers, Doug
Excellent, thanks Doug. Probably better to have opened an official support issue than relying on bugzilla ;) Could you post back here and let us all know what your resolution is? Would be great to see this in RHEL4 U6 (or an interim update), not just RHEL5.
Hi Ray, I have opened an official support issue (125366), but I wanted to keep the wider community informed. We don't use RHEL5 at all and I build my embryonic fix against RHEL4U4. I'm hoping RH will take my patch and identification of the core problem and use these to produce a more wide ranging patch. However, for those who cannot wait for a better, professionally produced patch, here's my diff with hideous hard-coded values. It's purpose was really to highlight that the issue lies with pump's pumpDhcpClassRun method being left to it's own devices rather than being overridden. I chose somewhat excessive values, but they seem to work. Here's my patch against anaconda 10.1.1.46: diff -uNr anaconda-10.1.1.46/loader2/net.c anaconda-10.1.1.46-dug/loader2/net.c --- anaconda-10.1.1.46/loader2/net.c 2006-04-20 06:30:56.000000000 +1000 +++ anaconda-10.1.1.46-dug/loader2/net.c 2007-06-29 13:05:50.000000000 +1000 @@ -689,11 +689,29 @@ char * doDhcp(char * ifname, struct networkDeviceConfig *dev, char * dhcpclass) { + extern int num_link_checks; + extern int post_link_sleep; + struct pumpOverrideInfo override; + + /* + * Originally thought I could use num_link_checks and + * post_link_sleep but this confuses the two sets of wait code. + * + * Unsure if we should have customisable waits for link in + * anaconda at all when we want to DHCP. Let pump handle + * custom waits and retries methinks - dscoular + * Hard coding for now. + */ + memset(&override, 0, sizeof(override)); + pumpInitOverride ( &override ); + override.timeout = 100; /* post_link_sleep; out for now */ + override.numRetries = 40; /* num_link_checks; out for now */ + setupWireless(dev); logMessage("running dhcp for %s", ifname); return pumpDhcpClassRun(ifname, 0, 0, NULL, dhcpclass ? dhcpclass : "anaconda", - &dev->dev, NULL); + &dev->dev, &override); } I'll attach it too just in case it gets munged. You'll have to rebuild a patched anaconda from the source rpm and then take the loader binary fron anaconda-10.1.1.46/loader2/loader and inject it into your initrd.img as /sbin/loader. Cheers, Doug
Created attachment 158272 [details] Dumb hardcoded fix to pump timeout and retries Probably best to wait for an offical fix.
I have been fighting this for a few days now, and this dialog seems to help me understand. I am having this problem, but it seems to shutdown the interface down when loading the first rpm package. It sometimes shuts down on the minstg2.img load, but always on the rpm load. I did load SUSE 10.2 and they seem to resolve this weirdness by loading in a single stage, the pxeboot loads a 64MB ramdisk and loads all it needs in a single intird.img file. This would be a major change of direction of the anaconda installer, but would it not be a lot cleaner? All the servers that I am buying for prodution right now have +2GB of ram, how about loading the first cdrom into ram and going from there?
*** Bug 226814 has been marked as a duplicate of this bug. ***
Created attachment 269681 [details] Add dhcptimeout parameter to loader This should be a little bit better, than the "dumb hardcoded" patch
Martin, Can you create a test package for testing please? I tried to apply your patch to anaconda-10.1.1.63 but it does not patch properly. Thanks, Eugene
(In reply to comment #20) > I tried to apply your patch to > anaconda-10.1.1.63 but it does not patch properly. > > Thanks,Eugene According to RHEL trees, the anaconda version in RHEL4 is 10.1.1.67. What is the exact version of RHEL you are using?
Customer is running RHEL4U4.
Hi Martin, Just to let you know that I have generated a test RPM. Getting my customer to test the patch you submitted. Will ping you when I get some feedback.
And the patch is already in our RHEL4 queue, so I'm setting this to MODIFIED.
Folks, is this slated for inclusion in RHEL 4.7?
This is will be included in 4.7 as a parameter called dhcptimeout, i.e. dhcptimeout=60 - James
Thanks James. I tested the patch mentioned in #19 above and built a new loader against the anaconda 10.1.1.81 sources. The patch applied cleanly, but when booting with the dhcptimeout parameter, we encountered issues (resulting in it not working at all for us). I added two log entries in doDhcp in net.c as follows: logMessage("override.timeout set to %d", override.timeout); logMessage("dhcpTimeout set to %d", dev->dhcpTimeout); Here is what I observed: 1. Booted with modified initrd and dhcptimeout=90 2. Initial automated DHCP request fails, observed the following in tty2: override.timeout set to 0 dhcpTimeout set to 0 3. Back on tty1 we just told the installer to "try again" with DHCP and observed the following on tty2: override.timeout set to 45 dhcpTimeout set to -1222348285 An IP address is successfully obtained here, but... 4. Immediately anaconda of course wants to bounce the interface: override.timeout set to 0 dhcpTimeout set to 0 This fails obviously and we cannot continue. I modified doDhcp to hardcode a timeout of 90 seconds for override.timeout and now we can complete our installation with DHCP. Any idea why the above is happening? The patch appears to have touched both loader.c and net.c correctly -- I can see the dhcptimeout parsing logic in loader.c, but somewhere along the way (and between interface bounces too perhaps) the information is lost and we drop back to default behavior. I'd be happy to attach the tarball of my patched source for verification if you like.
This happens in RHEL 5.x as well. I haven't tried patching the initrd yet however.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0653.html
(In reply to comment #46) > This happens in RHEL 5.x as well. I haven't tried patching the initrd yet > however. Ray, is there a bz for RHEL 5.x ? Can you please file one if you still see this issue? Thanks!
The patches which made it into 4.7 were: http://git.fedorahosted.org/git/anaconda.git?p=anaconda.git;a=commit;h=774f045eb9bb10c764e30ae1f864ecd5cf5730b0 http://git.fedorahosted.org/git/anaconda.git?p=anaconda.git;a=commit;h=3685a4d392551b299dd828b02927a744b3bacadc And they solved all the issues mentioned here (the uninitialized variable and not passing the value correctly inside the net code)
The bugzilla numbers for 5.x are #198147, #254032 and they were verified as well.