From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114 Description of problem: The boot process will hang indefinitely while bringing up eth0 (the internet over a dhcp-based cable modem network). Must ctl-alt-del to get any response out of the system. There are a few weird aspects to what I am seeing. Background: I build and deploy firewalls based on FC1. Recently on 3 separate deployments, I've noticed said bug during deployment. I first install and configure FC1 off-site, during which eth0 comes up fine during a test boot on a local cable modem network. I then deploy the firewall onsite and immediately get said bug symptoms. Rebooting doesn't help. It won't get past waiting for eth0. The deployment sites are on the same cable modem ISP as me, but are in a different neighbourhood and a different subnet. Now is where it gets weird. If I reboot the machine, this time with the eth0 LAN cable physically unplugged, the system boots fine (with eth0 of course failing with a no media error). I can then plug in the cable and from a shell do an ifup eth0 which will bring up the internet with no problems. Even weirder, now if I reboot with the cable plugged in, eth0 starts up fine! The problem, so far, does not seem to ever reoccur on that deployment. I've had this happen on at least 3 deployments recently that I can remember. However, I've deployed at least 4 other FC1 firewalls since December and I don't recall having this problem with those. It almost feels as though the problem is caused by a regression bug in some update that has been released in the past few months. I am currently standardized on kernel-2.4.22-1.2179.nptl and all other packages are at the newest stable release versions. The problem appears (just a hunch) to be caused by a major change in the IP address between my off-site test deployment and the on-site real deployment. It's like dhclient can't cope with the change. Now, our ISP's dhcp server is notoriously slow, easily taking 10-40 seconds to respond. Perhaps bug #120531 is part of the problem. As mentioned, this problem only seems to occur the first time the machine is deployed, so that makes it very hard to test new theories/ideas to try to stomp this bug. To do a test run I must build a new system and go on-site to do the one-shot test deployment. The other big problem that I'm really at a loss to understand is that /var/log/messages does not have any entries from the attempted (hanging) boots! I thought syslogd started first, but the logs are completely devoid of information regarding these failed boots. Perhaps ctl-alt-del reboots don't flush the logs to disk? Regardless of why, this makes debugging this problem even worse. Another thought is the network cards I am using. The latest time the bug happened, it was a tulip-driver card (SMC1244-TX), which is the predominant card I use lately. However, the bug has also occured where a ne2k-pci card was eth0. These cards all have no problems once past this initial bug. However, perhaps bug #111610 is related as that seems to indicate a regression in the tulip driver. In some cases, the cards were working perfectly in previous RH7.3 deployments for years. Lastly, though I have looked into this at great length, there is a small possibility that my custom iptables setup is somehow preventing eth0 from coming up that initial time. But even if I was stupidly dropping DHCP packets, the boot-time eth0 ifup should not hang indefinitely. Someone should add a SIGALRM sanity timer in there or something. It seems unlikely it's my firewall anyways, as the manual ifup or subsequent reboots have no problems with my firewall (unchanged) in place and I have gone over the rules many times for sanity checks. Version-Release number of selected component (if applicable): initscripts-7.42.2-1 How reproducible: Sometimes Steps to Reproduce: 1. build new firewall deployment, test boot off-site 2. take on-site to deploy, boot with LAN cable plugged in Actual Results: boot hangs saying "bringing up eth0", never finishing (waited at least an hour a few times). Expected Results: should bring up eth0 ("OK"), or fail after an appropriate timeout and continue booting. Additional info:
I'm assuming you can shut the machine down 'normally' (it's not a kernel lockup). Enabling sysrq and getting the output of sysrq-t may help, I suppose.
Ctl-alt-del does a "normal" shutdown procedure, so (I think) the answer you're looking for is "yes". I'll try what you suggest next time I see the problem (after I look up how to do what you say!).
A followup: this bug *may* have been caused by me dropping some incorrect packets in my iptables setup. I've since decommissioned the system that showed this behaviour the most and updated my iptables script. The problem has *not* returned yet so either it was hardware specific or related to my iptables. However, *even* if I'm dropping bootps/bootpc traffic incorrectly the machine should at least boot past that point (after a timeout) rather than hanging.
Closing bugs on older releases. Apologies for any lack of response. Unfortunately, without really specific knowledge of what was getting dropped, it's hard to determine why it was hanging. As such, I have to close this bug. If the issue comes back, please reopen with the sysrq output, and any information on the iptables config that is causing this.
Follow-up: this problem has not shown up on any FC3 boxes I manage. It was either unique to FC1 or it was a problem with my iptables setup I have since fixed.