Red Hat Bugzilla – Bug 123053
boot hangs bringing up eth0 (dhclient dhcp internet link)
Last modified: 2014-03-16 22:45:13 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1)
Description of problem:
The boot process will hang indefinitely while bringing up eth0 (the
internet over a dhcp-based cable modem network). Must ctl-alt-del to
get any response out of the system. There are a few weird aspects to
what I am seeing.
Background: I build and deploy firewalls based on FC1. Recently on 3
separate deployments, I've noticed said bug during deployment. I
first install and configure FC1 off-site, during which eth0 comes up
fine during a test boot on a local cable modem network. I then deploy
the firewall onsite and immediately get said bug symptoms. Rebooting
doesn't help. It won't get past waiting for eth0. The deployment
sites are on the same cable modem ISP as me, but are in a different
neighbourhood and a different subnet.
Now is where it gets weird. If I reboot the machine, this time with
the eth0 LAN cable physically unplugged, the system boots fine (with
eth0 of course failing with a no media error). I can then plug in the
cable and from a shell do an ifup eth0 which will bring up the
internet with no problems. Even weirder, now if I reboot with the
cable plugged in, eth0 starts up fine! The problem, so far, does not
seem to ever reoccur on that deployment.
I've had this happen on at least 3 deployments recently that I can
remember. However, I've deployed at least 4 other FC1 firewalls since
December and I don't recall having this problem with those. It almost
feels as though the problem is caused by a regression bug in some
update that has been released in the past few months. I am currently
standardized on kernel-2.4.22-1.2179.nptl and all other packages are
at the newest stable release versions.
The problem appears (just a hunch) to be caused by a major change in
the IP address between my off-site test deployment and the on-site
real deployment. It's like dhclient can't cope with the change. Now,
our ISP's dhcp server is notoriously slow, easily taking 10-40 seconds
to respond. Perhaps bug #120531 is part of the problem.
As mentioned, this problem only seems to occur the first time the
machine is deployed, so that makes it very hard to test new
theories/ideas to try to stomp this bug. To do a test run I must
build a new system and go on-site to do the one-shot test deployment.
The other big problem that I'm really at a loss to understand is that
/var/log/messages does not have any entries from the attempted
(hanging) boots! I thought syslogd started first, but the logs are
completely devoid of information regarding these failed boots.
Perhaps ctl-alt-del reboots don't flush the logs to disk? Regardless
of why, this makes debugging this problem even worse.
Another thought is the network cards I am using. The latest time the
bug happened, it was a tulip-driver card (SMC1244-TX), which is the
predominant card I use lately. However, the bug has also occured
where a ne2k-pci card was eth0. These cards all have no problems once
past this initial bug. However, perhaps bug #111610 is related as
that seems to indicate a regression in the tulip driver. In some
cases, the cards were working perfectly in previous RH7.3 deployments
Lastly, though I have looked into this at great length, there is a
small possibility that my custom iptables setup is somehow preventing
eth0 from coming up that initial time. But even if I was stupidly
dropping DHCP packets, the boot-time eth0 ifup should not hang
indefinitely. Someone should add a SIGALRM sanity timer in there or
something. It seems unlikely it's my firewall anyways, as the manual
ifup or subsequent reboots have no problems with my firewall
(unchanged) in place and I have gone over the rules many times for
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. build new firewall deployment, test boot off-site
2. take on-site to deploy, boot with LAN cable plugged in
Actual Results: boot hangs saying "bringing up eth0", never finishing
(waited at least an hour a few times).
Expected Results: should bring up eth0 ("OK"), or fail after an
appropriate timeout and continue booting.
I'm assuming you can shut the machine down 'normally' (it's not a
Enabling sysrq and getting the output of sysrq-t may help, I suppose.
Ctl-alt-del does a "normal" shutdown procedure, so (I think) the
answer you're looking for is "yes".
I'll try what you suggest next time I see the problem (after I look up
how to do what you say!).
A followup: this bug *may* have been caused by me dropping some incorrect
packets in my iptables setup. I've since decommissioned the system that showed
this behaviour the most and updated my iptables script. The problem has *not*
returned yet so either it was hardware specific or related to my iptables.
However, *even* if I'm dropping bootps/bootpc traffic incorrectly the machine
should at least boot past that point (after a timeout) rather than hanging.
Closing bugs on older releases. Apologies for any lack of response.
Unfortunately, without really specific knowledge of what was getting dropped,
it's hard to determine why it was hanging. As such, I have to close this bug. If
the issue comes back, please reopen with the sysrq output, and any information
on the iptables config that is causing this.
Follow-up: this problem has not shown up on any FC3 boxes I manage. It was
either unique to FC1 or it was a problem with my iptables setup I have since fixed.